3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars
3DXTalker generates identity-consistent, expressive 3D talking avatars from a single reference image and speech audio, achieving accurate lip synchronization, expressive emotion control, and natural head-pose dynamics. It achieves expressive facial animation through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. By introducing frame-wise amplitude and emotional cues beyond standard speech embeddings, 3DXTalker delivers superior lip synchronization and nuanced expression modulation. Built on a flow-matching transformer architecture, it enables natural head-pose motion generation while supporting stylized control, integrating lip synchronization, emotional expression, and head-pose dynamics within a unified framework.
- Release the 3DTalking benchmark dataset
- Release the raw dataset
- Release the processed dataset
- Release the data processing code
- Release the training and inference code
- Release the pretrained models
- Python 3.10
- Pytorch 2.2.2
- CUDA 12.1
- Pytorch3d 0.7.7
conda create -n env_3DXTalker python==3.10
conda activate env_3DXTalker
pip install -r requirements.txtFor some people the compilation fails during requirements install and works after. Try running the following separately:
pip install "git+https://github.com/facebookresearch/pytorch3d.git@v0.7.7"-
Download
emotion2vec_plus_basemodel and place it in./pretrained_models/:# Create directory mkdir -p pretrained_models/emotion2vec_plus_base # Option 1: Using git-lfs (recommended) cd pretrained_models git lfs install git clone https://huggingface.co/iic/emotion2vec_plus_base # Option 2: Manual download from https://huggingface.co/iic/emotion2vec_plus_base # Download all files to ./pretrained_models/emotion2vec_plus_base/
-
Download
microsoft/wavlm-base-plus(audio encoder):# Option 1: Auto-download on first run (recommended) # The model will be automatically downloaded from HuggingFace when you run training # Option 2: Pre-download manually cd pretrained_models git lfs install git clone https://huggingface.co/microsoft/wavlm-base-plus # Then update config/default_config.yaml: # audio_encoder_repo: './pretrained_models/wavlm-base-plus'
Expected directory structure:
pretrained_models/ ├── emotion2vec_plus_base/ │ ├── config.json │ ├── pytorch_model.bin │ └── ... └── wavlm-base-plus/ # Optional (auto-downloads if not present) ├── config.json ├── pytorch_model.bin └── ...
-
Download raw video datasets following these links: V0-GRID; v1-RAVDESS; V2-MEAD; V3-VoxCeleb2; V4-HDTF; V5-Celebv-HQ
If you don't want to process the data manually, we also provide processed data at Hugging Face.
-
Run data curation (duration, noise, language, sync, resolution normalization).
- Edit
raw_video_dirindata_prepare/data_curation_pipeline.pyto your raw video folder.
cd data_prepare
python data_curation_pipeline.pyOutput will be in data_prepare/final_curated_videos/.
- Rename videos for dataset indexing.
- Edit
dataset_name,input_dir, andoutput_dirindata_prepare/rename.pyif needed. - By default it expects input at
data_prepare/Scaled_videosand outputs todata_prepare/Renamed_videos.
cd data_prepare
python rename.py- Download EMOCA-related assets (models and FLAME files).
bash gdl_apps/EMOCA/demos/download_assets.sh- Run EMOCA reconstruction to extract FLAME parameters.
- Edit
data_root_diranddataset_nameingdl_apps/EMOCA/demos/my_recons_video.py. data_root_dirshould contain<dataset_name>/all_videos_path.txt.
python gdl_apps/EMOCA/demos/my_recons_video.py \
--dataset_name VoxCeleb2 \
--output_folder video_output \
--model_name EMOCA_v2_lr_mse_20- Data structures are provided in DATASET_STRUCTURE.md
