MultiTalk: Audio-Driven Multi-Person Conversational Video Generation

MeiGen-AI/MultiTalk is an audio-driven multi-person conversational video generation project with support for single-person and multi-person generation, TTS, low-VRAM inference, LoRA acceleration, quantization, and Gradio.

MeiGen-AI/MultiTalk is an audio-driven multi-person conversational video generation project. Its paper is titled Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation. The project aims to generate multi-person conversational videos from multi-stream audio, a reference image, and prompts, while keeping lip motion, interaction, and prompt control aligned.

MultiTalk is not limited to single-person digital humans. It emphasizes multi-person conversations, singing, interactive control, and cartoon character generation. It supports 480P and 720P output. The README notes that the current code mainly supports 480P inference, while 720P requires multiple GPUs.

Main Capabilities

MultiTalk’s capabilities can be summarized as follows:

  • Single-person and multi-person audio-driven video generation.
  • Prompt-based control of virtual character interaction.
  • Generalized generation for cartoon characters and singing.
  • 480P and 720P output.
  • Video generation up to about 15 seconds.
  • Multi-GPU, TeaCache, APG, low-VRAM inference, TTS, Gradio, LoRA acceleration, and INT8 quantization.

If you are building multi-person interviews, podcast videos, digital human conversations, character interactions, or TTS-driven character videos, MultiTalk is closer to this use case than single-person lip-sync tools.

Installation

The official README installation process is preserved below. Use a CUDA-capable Linux environment and prepare Conda, Git, enough VRAM, and enough disk space. The environment name in the README is multitalk.

Clone the repository and enter the directory:

1
2
git clone https://github.com/meigen-ai/multitalk.git
cd multitalk

1. Create a conda environment and install PyTorch and xformers

1
2
3
4
conda create -n multitalk python=3.10
conda activate multitalk
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121

This uses the PyTorch wheel for CUDA 12.1. If your driver, CUDA version, or platform differs, adjust the command according to the official PyTorch installation guide.

2. Install Flash-Attn

1
2
3
4
5
pip install misaki[en]
pip install ninja
pip install psutil
pip install packaging
pip install flash_attn==2.7.4.post1

flash_attn is sensitive to CUDA, the compiler toolchain, and PyTorch versions. If installation fails, first check whether CUDA, nvcc, GCC, Python, and the PyTorch wheel match.

3. Install other dependencies

1
2
pip install -r requirements.txt
conda install -c conda-forge librosa

4. Install FFmpeg

Inside the Conda environment:

1
conda install -c conda-forge ffmpeg

Or install it at the system level. The README gives this yum example:

1
sudo yum install ffmpeg ffmpeg-devel

Model Preparation

MultiTalk needs these models:

Model Purpose
Wan2.1-I2V-14B-480P Base model
chinese-wav2vec2-base Audio encoder
Kokoro-82M TTS weights
MeiGen-MultiTalk MultiTalk audio condition weights

The README uses huggingface-cli to download models. The official commands are:

1
2
3
4
5
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download hexgrad/Kokoro-82M --local-dir ./weights/Kokoro-82M
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk

If huggingface-cli is not available, install Hugging Face Hub first:

1
pip install -U huggingface_hub

After download, the directory should roughly contain:

1
2
3
4
5
weights/
  Wan2.1-I2V-14B-480P/
  chinese-wav2vec2-base/
  Kokoro-82M/
  MeiGen-MultiTalk/

The README asks you to link or copy the MultiTalk model into the Wan2.1-I2V-14B-480P directory.

Using symlinks:

1
2
3
mv weights/Wan2.1-I2V-14B-480P/diffusion_pytorch_model.safetensors.index.json weights/Wan2.1-I2V-14B-480P/diffusion_pytorch_model.safetensors.index.json_old
sudo ln -s {Absolute path}/weights/MeiGen-MultiTalk/diffusion_pytorch_model.safetensors.index.json weights/Wan2.1-I2V-14B-480P/
sudo ln -s {Absolute path}/weights/MeiGen-MultiTalk/multitalk.safetensors weights/Wan2.1-I2V-14B-480P/

Or copy the files:

1
2
3
mv weights/Wan2.1-I2V-14B-480P/diffusion_pytorch_model.safetensors.index.json weights/Wan2.1-I2V-14B-480P/diffusion_pytorch_model.safetensors.index.json_old
cp weights/MeiGen-MultiTalk/diffusion_pytorch_model.safetensors.index.json weights/Wan2.1-I2V-14B-480P/
cp weights/MeiGen-MultiTalk/multitalk.safetensors weights/Wan2.1-I2V-14B-480P/

If you use symlinks, replace {Absolute path} with your real absolute path. On Windows or WSL, also verify symlink permissions and path resolution.

Common Parameters

The README lists these parameters:

1
2
3
4
5
6
7
8
9
--mode streaming: long video generation.
--mode clip: generate short video with one chunk.
--use_teacache: run with TeaCache.
--size multitalk-480: generate 480P video.
--size multitalk-720: generate 720P video.
--use_apg: run with APG.
--teacache_thresh: A coefficient used for TeaCache acceleration
—-sample_text_guide_scale: When not using LoRA, the optimal value is 5. After applying LoRA, the recommended value is 1.
—-sample_audio_guide_scale: When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.

This keeps the README’s original spelling. Before copying commands, note that the dashes before sample_text_guide_scale and sample_audio_guide_scale may not be standard --; check them manually.

Single-Person Inference

Single-person, single-GPU run:

1
2
3
4
5
6
7
8
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/single_example_1.json \
    --sample_steps 40 \
    --mode streaming \
    --use_teacache \
    --save_file single_long_exp

Single-person low-VRAM run:

1
2
3
4
5
6
7
8
9
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/single_example_1.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file single_long_lowvram_exp

Single-person multi-GPU inference:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GPU_NUM=8
torchrun --nproc_per_node=$GPU_NUM --standalone generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --dit_fsdp --t5_fsdp \
    --ulysses_size=$GPU_NUM \
    --input_json examples/single_example_1.json \
    --sample_steps 40 \
    --mode streaming \
    --use_teacache \
    --save_file single_long_multigpu_exp

Single-person TTS run:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/single_example_tts_1.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file single_long_lowvram_tts_exp \
    --audio_mode tts

Multi-Person Inference

Multi-person, single-GPU run:

1
2
3
4
5
6
7
8
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/multitalk_example_2.json \
    --sample_steps 40 \
    --mode streaming \
    --use_teacache \
    --save_file multi_long_exp

Multi-person low-VRAM run:

1
2
3
4
5
6
7
8
9
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/multitalk_example_2.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file multi_long_lowvram_exp

Multi-person multi-GPU inference:

1
2
3
4
5
6
7
8
9
GPU_NUM=8
torchrun --nproc_per_node=$GPU_NUM --standalone generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --dit_fsdp --t5_fsdp --ulysses_size=$GPU_NUM \
    --input_json examples/multitalk_example_2.json \
    --sample_steps 40 \
    --mode streaming --use_teacache \
    --save_file multi_long_multigpu_exp

Multi-person TTS run:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/multitalk_example_tts_1.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file multi_long_lowvram_tts_exp \
    --audio_mode tts

FusionX, Quantization, And Gradio

FusioniX or Lightx2v can reduce sampling steps. The README’s single-person example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/single_example_1.json \
    --lora_dir weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
    --lora_scale 1.0 \
    --sample_text_guide_scale 1.0 \
    --sample_audio_guide_scale 2.0 \
    --sample_steps 8 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --save_file single_long_lowvram_fusionx_exp \
    --sample_shift 2

Multi-person example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/multitalk_example_2.json \
    --lora_dir weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
    --lora_scale 1.0 \
    --sample_text_guide_scale 1.0 \
    --sample_audio_guide_scale 2.0 \
    --sample_steps 8 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --save_file multi_long_lowvram_fusionx_exp

The INT8 quantized model only supports single-GPU runs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/multitalk_example_2.json \
    --sample_steps 40 \
    --mode streaming \
    --use_teacache \
    --quant int8 \
    --quant_dir weights/MeiGen-MultiTalk \
    --num_persistent_param_in_dit 0 \
    --save_file multi_long_lowvram_exp_quant

Quantization with LoRA:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
python generate_multitalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --input_json examples/multitalk_example_1.json \
    --quant int8 \
    --quant_dir weights/MeiGen-MultiTalk \
    --lora_dir weights/MeiGen-MultiTalk/quant_models/quant_model_int8_FusionX.safetensors \
    --sample_text_guide_scale 1.0 \
    --sample_audio_guide_scale 2.0 \
    --sample_steps 8 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --save_file multi_long_lowvram_fusionx_exp_quant \
    --sample_shift 2

Gradio example:

1
2
3
4
5
python app.py \
    --lora_dir weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
    --lora_scale 1.0 \
    --num_persistent_param_in_dit 0 \
    --sample_shift 2

Or:

1
python app.py --num_persistent_param_in_dit 0

Quantized Gradio example:

1
2
3
4
5
6
python app.py \
    --quant int8 \
    --quant_dir weights/MeiGen-MultiTalk \
    --lora_dir weights/MeiGen-MultiTalk/quant_models/quant_model_int8_FusionX.safetensors \
    --sample_shift 2 \
    --num_persistent_param_in_dit 0

Practical Notes

MultiTalk has heavy dependencies and large models. Before deployment, check these points:

  • The current code mainly supports 480P inference; 720P requires multiple GPUs.
  • The official notes say audio CFG usually works well between 3 and 5, and increasing audio CFG can improve lip synchronization.
  • The model was trained on 81-frame videos at 25 FPS. 81 frames are better for prompt following; longer clips may reduce prompt-following quality.
  • During long-video generation, Audio CFG affects color consistency across segments. Try setting it to 3 to reduce tonal variation.
  • The recommended --teacache_thresh range is 0.2 to 0.5. Higher values may run faster but can reduce video quality.
  • For low-VRAM environments, try --num_persistent_param_in_dit 0, INT8 quantization, or community low-VRAM solutions first.

The README also notes that TeaCache can provide about a 2x to 3x speedup. Actual results depend on GPU, resolution, sampling steps, number of people, and whether LoRA or quantization is used.

Summary

MultiTalk is an audio-driven generation project for multi-person conversational video. It fits multi-person digital humans, interview videos, TTS-driven videos, cartoon character interaction, and singing scenarios.

For a quick trial, prepare the environment and models using the official installation process, complete the model linking or copying step, then start with the 480P single-person example. After that works, gradually try multi-person generation, TTS, TeaCache, LoRA acceleration, INT8 quantization, and Gradio.

Reference:

记录并分享
Built with Hugo
Theme Stack designed by Jimmy