T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations
Abstract
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (<PRE_TAG>VQ-VAE)</POST_TAG> and Generative Pre-trained Transformer (<PRE_TAG>GPT)</POST_TAG> for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on <PRE_TAG>HumanML3D</POST_TAG>, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (<PRE_TAG>R-Precision</POST_TAG>), but with <PRE_TAG>FID</POST_TAG> 0.116 largely outperforming <PRE_TAG>MotionDiffuse</POST_TAG> of 0.630. Additionally, we conduct analyses on <PRE_TAG>HumanML3D</POST_TAG> and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 7
Collections including this paper 0
No Collection including this paper