MeiGen-AI/InfiniteTalk is an audio-driven video generation project. Its goal is to synchronize input audio onto a person video or image, generating a new video where lip shape, head motion, body posture, and facial expressions follow the audio.
It is not positioned as a simple mouth replacement tool. The README describes InfiniteTalk as a sparse-frame video dubbing framework. It can preserve identity in video-to-video scenarios, supports long-video generation, and can also generate talking videos from a single image plus audio in image-to-video scenarios.
Main Capabilities
InfiniteTalk’s key capabilities are straightforward:
- Audio-driven video-to-video generation.
- Image-to-video generation from an input image and audio.
- Synchronization beyond lips, including head, body, and facial expressions.
- Long-video generation.
- Compared with MultiTalk, the project description emphasizes reduced hand and body distortions and improved lip-sync accuracy.
This kind of project is suitable for interview dubbing, digital human video, localized lip synchronization, long-video redubbing, and virtual character expression. It is closer to a research and engineering tool than a lightweight desktop app.
Installation
The official README installation process is preserved below. Prepare a CUDA-capable Linux environment, Conda, Git, enough VRAM, and enough disk space first. The environment name in the README remains multitalk.
Clone the repository and enter the directory:
1
2
|
git clone https://github.com/MeiGen-AI/InfiniteTalk.git
cd InfiniteTalk
|
1
2
3
4
|
conda create -n multitalk python=3.10
conda activate multitalk
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121
|
This uses the PyTorch wheel for CUDA 12.1. If your local CUDA, driver, or platform differs, adjust the command according to the official PyTorch installation guide.
2. Install Flash-Attn dependencies
1
2
3
4
5
6
|
pip install misaki[en]
pip install ninja
pip install psutil
pip install packaging
pip install wheel
pip install flash_attn==2.7.4.post1
|
flash_attn is sensitive to CUDA, compiler toolchain, and PyTorch versions. If this step fails, first check whether CUDA, nvcc, GCC, Python, and the PyTorch wheel match each other.
3. Install other dependencies
1
2
|
pip install -r requirements.txt
conda install -c conda-forge librosa
|
4. Install FFmpeg
Inside the Conda environment, you can use:
1
|
conda install -c conda-forge ffmpeg
|
You can also install it at the system level. The README gives this yum example:
1
|
sudo yum install ffmpeg ffmpeg-devel
|
Model Preparation
InfiniteTalk needs three types of models:
| Model |
Purpose |
Wan2.1-I2V-14B-480P |
Base model |
chinese-wav2vec2-base |
Audio encoder |
MeiGen-InfiniteTalk |
InfiniteTalk audio condition weights |
The README uses huggingface-cli to download models. The official commands are:
1
2
3
4
|
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download MeiGen-AI/InfiniteTalk --local-dir ./weights/InfiniteTalk
|
If huggingface-cli is not available, install Hugging Face Hub first:
1
|
pip install -U huggingface_hub
|
Some models may require logging in to Hugging Face or accepting the usage terms on the model page. After download, the directory structure should roughly contain:
1
2
3
4
|
weights/
Wan2.1-I2V-14B-480P/
chinese-wav2vec2-base/
InfiniteTalk/
|
Common Runtime Parameters
The README lists several key parameters:
1
2
3
4
5
6
7
8
9
10
11
|
--mode streaming: long video generation.
--mode clip: generate short video with one chunk.
--use_teacache: run with TeaCache.
--size infinitetalk-480: generate 480P video.
--size infinitetalk-720: generate 720P video.
--use_apg: run with APG.
--teacache_thresh: A coefficient used for TeaCache acceleration
—-sample_text_guide_scale: When not using LoRA, the optimal value is 5. After applying LoRA, the recommended value is 1.
—-sample_audio_guide_scale: When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
—-sample_audio_guide_scale: When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
--max_frame_num: The max frame length of the generated video, the default is 40 seconds(1000 frames).
|
The parameter spelling above preserves the README text. In real use, note that a few option prefixes look like full-width or nonstandard dashes. Before copying commands, check whether they should be standard --sample_text_guide_scale and --sample_audio_guide_scale.
Single-GPU Inference Example
Official single-GPU example:
1
2
3
4
5
6
7
8
9
10
|
python generate_infinitetalk.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
--input_json examples/single_example_image.json \
--size infinitetalk-480 \
--sample_steps 40 \
--mode streaming \
--motion_frame 9 \
--save_file infinitetalk_res
|
To run at 720P, change --size to infinitetalk-720:
1
2
3
4
5
6
7
8
9
10
|
python generate_infinitetalk.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
--input_json examples/single_example_image.json \
--size infinitetalk-720 \
--sample_steps 40 \
--mode streaming \
--motion_frame 9 \
--save_file infinitetalk_res_720p
|
For low-VRAM mode, add --num_persistent_param_in_dit 0:
1
2
3
4
5
6
7
8
9
10
11
|
python generate_infinitetalk.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
--input_json examples/single_example_image.json \
--size infinitetalk-480 \
--sample_steps 40 \
--num_persistent_param_in_dit 0 \
--mode streaming \
--motion_frame 9 \
--save_file infinitetalk_res_lowvram
|
Multi-GPU, Multi-Person, And Gradio
Multi-GPU inference example:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
GPU_NUM=8
torchrun --nproc_per_node=$GPU_NUM --standalone generate_infinitetalk.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
--dit_fsdp --t5_fsdp \
--ulysses_size=$GPU_NUM \
--input_json examples/single_example_image.json \
--size infinitetalk-480 \
--sample_steps 40 \
--mode streaming \
--motion_frame 9 \
--save_file infinitetalk_res_multigpu
|
Multi-person animation example:
1
2
3
4
5
6
7
8
9
10
11
|
python generate_infinitetalk.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
--input_json examples/multi_example_image.json \
--size infinitetalk-480 \
--sample_steps 40 \
--num_persistent_param_in_dit 0 \
--mode streaming \
--motion_frame 9 \
--save_file infinitetalk_res_multiperson
|
Gradio example with single-person weights:
1
2
3
4
5
6
|
python app.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
--num_persistent_param_in_dit 0 \
--motion_frame 9
|
Gradio example with multi-person weights:
1
2
3
4
5
6
|
python app.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
--num_persistent_param_in_dit 0 \
--motion_frame 9
|
Acceleration And Quantized Inference
The README also provides an example for FusioniX or Lightx2v. FusioniX requires 8 steps, while lightx2v requires only 4 steps.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
python generate_infinitetalk.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
--lora_dir weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
--input_json examples/single_example_image.json \
--lora_scale 1.0 \
--size infinitetalk-480 \
--sample_text_guide_scale 1.0 \
--sample_audio_guide_scale 2.0 \
--sample_steps 8 \
--mode streaming \
--motion_frame 9 \
--sample_shift 2 \
--num_persistent_param_in_dit 0 \
--save_file infinitetalk_res_lora
|
The quantized model only supports single-GPU inference. Official example:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
python generate_infinitetalk.py \
--ckpt_dir weights/Wan2.1-I2V-14B-480P \
--wav2vec_dir 'weights/chinese-wav2vec2-base' \
--infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
--input_json examples/single_example_image.json \
--size infinitetalk-480 \
--sample_steps 40 \
--mode streaming \
--quant fp8 \
--quant_dir weights/InfiniteTalk/quant_models/infinitetalk_single_fp8.safetensors \
--motion_frame 9 \
--num_persistent_param_in_dit 0 \
--save_file infinitetalk_res_quant
|
Practical Notes
InfiniteTalk has heavy dependencies and large models. Before deployment, check a few things:
- Whether your GPU has enough VRAM. Low-VRAM machines should first try
--num_persistent_param_in_dit 0 or the quantized model.
- Whether
flash_attn can be installed correctly with the current CUDA and PyTorch combination.
- Whether the Hugging Face models are fully downloaded and whether paths match the
weights/... paths in the commands.
- Whether the input JSON follows the example format.
- 720P, long videos, multi-person generation, and multi-GPU inference all increase resource requirements significantly.
The README also notes that although FusionX LoRA can speed up inference and improve quality, it may worsen color shift after 1 minute and reduce identity preservation. For I2V, generation from a single image works better within 1 minute; beyond 1 minute, color shift becomes more obvious.
Summary
InfiniteTalk is an audio-driven video generation project aimed at research and engineering use. It fits scenarios that need long-form lip sync, character dubbing, digital human video, and image-to-talking-video generation.
For a quick trial, prepare the environment and models according to the official installation steps, then start with the 480P single-GPU example. After paths, dependencies, and VRAM are confirmed, try 720P, multi-person, multi-GPU, LoRA acceleration, or the quantized model.
Reference: