The author discusses the importance of attention stability in the model’s predictions, emphasizing the need for a consistent ranking of words even after perturbations. Despite stable attention, achieving stable results can be challenging due to changes in text embeddings. Further restrictions must be imposed on the model’s predictions to ensure consistency in the face of perturbations. Balancing accuracy and robustness is crucial, as these factors naturally trade-off. The objective is to enhance stability while minimizing declines in model accuracy to avoid catastrophic errors. A mechanism is needed to maintain the model’s performance concerning the original input.
https://sato-team.github.io/Stable-Text-to-Motion-Framework/