LongCat-Video-Avatar-1.5: Meituan's Open Audio-Driven Avatar Video Model

A concise look at Meituan LongCat's LongCat-Video-Avatar-1.5 on Hugging Face: an audio-driven avatar video model supporting AT2V, ATI2V, video continuation, single- and multi-person audio input, distilled inference, and INT8 quantization.

LongCat-Video-Avatar-1.5 is an audio-driven avatar video generation model released by Meituan’s LongCat team.

Project: https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5

It is not a general text-to-video model. It is designed for “given speech and character conditions, generate a video where the person speaks, moves steadily, and keeps a consistent identity.” According to the model card, it supports Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation, with both single-stream and multi-stream audio inputs.

At the time of writing, the Hugging Face page lists the model under the MIT License, with tags such as audio-text-to-video, audio-image-text-to-video, audio-driven-video-continuation, avatar, and video-generation.

What changed in 1.5

The official model card describes LongCat-Video-Avatar 1.5 as a more production-oriented open-source framework focused on improving stability for audio-driven human video generation.

Several changes stand out.

First, the audio encoder has moved from Wav2Vec2 to Whisper-Large. The official description says this brings smoother and more natural lip dynamics. In practice, scenarios that care about lip sync should prefer --model_type avatar-v1.5.

Second, it emphasizes long-video stability and identity consistency. Avatar videos often fail in two ways: the mouth does not match the audio in short clips, or the face, body, clothes, and motion drift in longer clips. One selling point of LongCat-Video-Avatar-1.5 is that it looks at lip sync, full-body temporal stability, and identity consistency together.

Third, it is not limited to realistic talking-head broadcasting. The model card says it generalizes to anime, animals, multi-person interactions, object handling, and more complex conditions. That makes it relevant not only for news-style digital humans, but also short drama, singing, e-commerce narration, animated characters, and animal characters.

Fourth, it provides 8-step inference. The model card mentions DMD2-based step distillation, reducing inference to 8 NFE to balance serving cost and visual quality. This matters for video models because generation is expensive, and fewer inference steps directly affect deployability.

Supported tasks

Based on the model card and sample commands, the model mainly covers three task groups.

The first is single-person animation.

It supports video generation from audio plus text, and video generation from audio plus an image. A typical use case is giving a voice clip to make a character speak, perform, or present.

The second is video continuation.

The examples use parameters such as --num_segments=5, --ref_img_index=10, and --mask_frame_range=3 to continue generating longer clips under existing character conditions. This is useful for long narration, courses, singing, and continuous performance.

The third is multi-person animation.

Multi-person mode uses run_demo_avatar_multi_audio_to_video.py and supports multiple audio streams. The model card also explains two dual-audio modes: when audio_type is para, merge mode requires two equal-length clips; when it is add, concatenation mode sequentially joins two clips and pads gaps with silence.

Installation and model download

The official flow starts by cloning the LongCat-Video repository:

1
2
git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video

Then create a Python 3.10 environment and install PyTorch according to your CUDA version. The CUDA 12.4 example in the model card is:

1
2
3
conda create -n longcat-video python=3.10
conda activate longcat-video
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

You also need flash_attn==2.7.4.post1, project requirements, librosa, ffmpeg, and requirements_avatar.txt. The model card says FlashAttention-2 is enabled by default, and the config can also be changed to FlashAttention-3 or xformers.

Download weights with huggingface-cli:

1
2
3
pip install "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
huggingface-cli download meituan-longcat/LongCat-Video-Avatar-1.5 --local-dir ./weights/LongCat-Video-Avatar-1.5

Note that it depends on two weight directories: LongCat-Video as the base video generation model, and LongCat-Video-Avatar-1.5 as the avatar model.

Quick inference examples

Single-person Audio-Text-to-Video:

1
torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 --stage_1=at2v --input_json=assets/avatar/single_example_1.json --use_distill --model_type avatar-v1.5 --use_int8

Single-person Audio-Image-to-Video:

1
torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5  --stage_1=ai2v --input_json=assets/avatar/single_example_1.json --use_distill --model_type avatar-v1.5 --use_int8

Multi-person Audio-Image-to-Video:

1
torchrun --nproc_per_node=2 run_demo_avatar_multi_audio_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 --input_json=assets/avatar/multi_example_1.json --use_distill --model_type avatar-v1.5 --use_int8

These commands share a few choices: they all use --model_type avatar-v1.5, include --use_distill, and enable --use_int8 in the examples. The model card states that --use_distill is required when using avatar-v1.5; --use_int8 loads the INT8-quantized DiT model to reduce VRAM usage and is only supported with avatar-v1.5.

Tuning parameters

The model card gives several practical tips.

If lip sync is not good enough, increase audio CFG. The recommended range is 3 to 5, and higher values usually help synchronization.

Prompts should not be too short. Longer, more specific descriptions usually improve character consistency and naturalness. Character appearance, action, scene, clothing, and expression are all useful details.

If repeated actions appear, adjust --ref_img_index and --mask_frame_range. The model card says --ref_img_index between 0 and 24 is better for consistency, while setting it to 30 can help reduce repeated actions. Increasing --mask_frame_range may also help, but overly large values can introduce artifacts.

For resolution, the model supports 480P and 720P through --resolution.

Good use cases

The official previews cover broadcasting, acting, singing, e-commerce marketing, multi-person conversation, animation, and animal characters.

In practice, it fits these directions:

  • News broadcasting, knowledge explanation, and course narration.
  • E-commerce product introduction and marketing shorts.
  • Virtual streamers, virtual-character short drama, and singing performance.
  • Audio-driven animation for anime or animal characters.
  • Multi-person conversational digital human videos.

The most interesting point is that it handles “lip sync” and “long-video stability” in the same framework. Many avatar models look fine in short clips, but drift in identity, repeat motions, or lose body stability once generation is extended. LongCat-Video-Avatar-1.5 explicitly treats these as optimization targets.

Things to watch

First, this is not a hosted model directly available through Hugging Face Inference Providers. The page says it is not currently deployed by an Inference Provider, so real usage requires preparing the environment, downloading weights, and running the LongCat-Video code yourself.

Second, local deployment is not lightweight. The examples use torchrun --nproc_per_node=2 and context_parallel_size=2, and depend on PyTorch, FlashAttention, ffmpeg, librosa, and multiple model weights. Even with INT8 quantization, it is better suited to users with a stronger GPU environment.

Third, avatar video involves likeness, voice, privacy, and content safety. The model card also reminds developers to assess accuracy, safety, and fairness themselves, and to comply with laws and regulations around data protection, privacy, and content safety. When generating real human likenesses or commercial videos, authorization and compliance matter more than visual quality.

Fourth, do not treat the generic Hugging Face “Diffusers/Transformers usage snippets” on the model card as the full inference path for this project. Real avatar inference should follow the LongCat-Video repository and the run_demo_avatar_* examples in the model card.

Summary

LongCat-Video-Avatar-1.5 is a notable open-source avatar video model. It is not just making a face talk; it combines audio driving, character consistency, long-video stability, multi-person audio, and distilled inference in one framework.

If you care about virtual streamers, e-commerce narration, course videos, animated characters, or multi-person dialogue videos, it is worth testing. But it is closer to a model for research and engineering teams to deploy and tune than an out-of-the-box web tool. Real deployment needs compute, asset authorization, prompt tuning, and content compliance workflows.

References

记录并分享
Built with Hugo
Theme Stack designed by Jimmy