Hugging Face on KnightLi Blog

LongCat-Video-Avatar-1.5: Meituan's Open Audio-Driven Avatar Video Model

Mon, 25 May 2026 07:53:43 +0800

LongCat-Video-Avatar-1.5 is an audio-driven avatar video generation model released by Meituan’s LongCat team.

Project: https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5

It is not a general text-to-video model. It is designed for “given speech and character conditions, generate a video where the person speaks, moves steadily, and keeps a consistent identity.” According to the model card, it supports Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation, with both single-stream and multi-stream audio inputs.

At the time of writing, the Hugging Face page lists the model under the MIT License, with tags such as audio-text-to-video, audio-image-text-to-video, audio-driven-video-continuation, avatar, and video-generation.

What changed in 1.5

The official model card describes LongCat-Video-Avatar 1.5 as a more production-oriented open-source framework focused on improving stability for audio-driven human video generation.

Several changes stand out.

First, the audio encoder has moved from Wav2Vec2 to Whisper-Large. The official description says this brings smoother and more natural lip dynamics. In practice, scenarios that care about lip sync should prefer --model_type avatar-v1.5.

Second, it emphasizes long-video stability and identity consistency. Avatar videos often fail in two ways: the mouth does not match the audio in short clips, or the face, body, clothes, and motion drift in longer clips. One selling point of LongCat-Video-Avatar-1.5 is that it looks at lip sync, full-body temporal stability, and identity consistency together.

Third, it is not limited to realistic talking-head broadcasting. The model card says it generalizes to anime, animals, multi-person interactions, object handling, and more complex conditions. That makes it relevant not only for news-style digital humans, but also short drama, singing, e-commerce narration, animated characters, and animal characters.

Fourth, it provides 8-step inference. The model card mentions DMD2-based step distillation, reducing inference to 8 NFE to balance serving cost and visual quality. This matters for video models because generation is expensive, and fewer inference steps directly affect deployability.

Supported tasks

Based on the model card and sample commands, the model mainly covers three task groups.

The first is single-person animation.

It supports video generation from audio plus text, and video generation from audio plus an image. A typical use case is giving a voice clip to make a character speak, perform, or present.

The second is video continuation.

The examples use parameters such as --num_segments=5, --ref_img_index=10, and --mask_frame_range=3 to continue generating longer clips under existing character conditions. This is useful for long narration, courses, singing, and continuous performance.

The third is multi-person animation.

Multi-person mode uses run_demo_avatar_multi_audio_to_video.py and supports multiple audio streams. The model card also explains two dual-audio modes: when audio_type is para, merge mode requires two equal-length clips; when it is add, concatenation mode sequentially joins two clips and pads gaps with silence.

Installation and model download

The official flow starts by cloning the LongCat-Video repository:

1
2

git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video

Then create a Python 3.10 environment and install PyTorch according to your CUDA version. The CUDA 12.4 example in the model card is:

1
2
3

conda create -n longcat-video python=3.10
conda activate longcat-video
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

You also need flash_attn==2.7.4.post1, project requirements, librosa, ffmpeg, and requirements_avatar.txt. The model card says FlashAttention-2 is enabled by default, and the config can also be changed to FlashAttention-3 or xformers.

Download weights with huggingface-cli:

1
2
3

pip install "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
huggingface-cli download meituan-longcat/LongCat-Video-Avatar-1.5 --local-dir ./weights/LongCat-Video-Avatar-1.5

Note that it depends on two weight directories: LongCat-Video as the base video generation model, and LongCat-Video-Avatar-1.5 as the avatar model.

Quick inference examples

Single-person Audio-Text-to-Video:

torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 --stage_1=at2v --input_json=assets/avatar/single_example_1.json --use_distill --model_type avatar-v1.5 --use_int8

Single-person Audio-Image-to-Video:

torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5  --stage_1=ai2v --input_json=assets/avatar/single_example_1.json --use_distill --model_type avatar-v1.5 --use_int8

Multi-person Audio-Image-to-Video:

torchrun --nproc_per_node=2 run_demo_avatar_multi_audio_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 --input_json=assets/avatar/multi_example_1.json --use_distill --model_type avatar-v1.5 --use_int8

These commands share a few choices: they all use --model_type avatar-v1.5, include --use_distill, and enable --use_int8 in the examples. The model card states that --use_distill is required when using avatar-v1.5; --use_int8 loads the INT8-quantized DiT model to reduce VRAM usage and is only supported with avatar-v1.5.

Tuning parameters

The model card gives several practical tips.

If lip sync is not good enough, increase audio CFG. The recommended range is 3 to 5, and higher values usually help synchronization.

Prompts should not be too short. Longer, more specific descriptions usually improve character consistency and naturalness. Character appearance, action, scene, clothing, and expression are all useful details.

If repeated actions appear, adjust --ref_img_index and --mask_frame_range. The model card says --ref_img_index between 0 and 24 is better for consistency, while setting it to 30 can help reduce repeated actions. Increasing --mask_frame_range may also help, but overly large values can introduce artifacts.

For resolution, the model supports 480P and 720P through --resolution.

Good use cases

The official previews cover broadcasting, acting, singing, e-commerce marketing, multi-person conversation, animation, and animal characters.

In practice, it fits these directions:

News broadcasting, knowledge explanation, and course narration.
E-commerce product introduction and marketing shorts.
Virtual streamers, virtual-character short drama, and singing performance.
Audio-driven animation for anime or animal characters.
Multi-person conversational digital human videos.

The most interesting point is that it handles “lip sync” and “long-video stability” in the same framework. Many avatar models look fine in short clips, but drift in identity, repeat motions, or lose body stability once generation is extended. LongCat-Video-Avatar-1.5 explicitly treats these as optimization targets.

Things to watch

First, this is not a hosted model directly available through Hugging Face Inference Providers. The page says it is not currently deployed by an Inference Provider, so real usage requires preparing the environment, downloading weights, and running the LongCat-Video code yourself.

Second, local deployment is not lightweight. The examples use torchrun --nproc_per_node=2 and context_parallel_size=2, and depend on PyTorch, FlashAttention, ffmpeg, librosa, and multiple model weights. Even with INT8 quantization, it is better suited to users with a stronger GPU environment.

Third, avatar video involves likeness, voice, privacy, and content safety. The model card also reminds developers to assess accuracy, safety, and fairness themselves, and to comply with laws and regulations around data protection, privacy, and content safety. When generating real human likenesses or commercial videos, authorization and compliance matter more than visual quality.

Fourth, do not treat the generic Hugging Face “Diffusers/Transformers usage snippets” on the model card as the full inference path for this project. Real avatar inference should follow the LongCat-Video repository and the run_demo_avatar_* examples in the model card.

Summary

LongCat-Video-Avatar-1.5 is a notable open-source avatar video model. It is not just making a face talk; it combines audio driving, character consistency, long-video stability, multi-person audio, and distilled inference in one framework.

If you care about virtual streamers, e-commerce narration, course videos, animated characters, or multi-person dialogue videos, it is worth testing. But it is closer to a model for research and engineering teams to deploy and tune than an out-of-the-box web tool. Real deployment needs compute, asset authorization, prompt tuning, and content compliance workflows.

References

LongCat-Video-Avatar-1.5 Hugging Face: https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5
LongCat-Video GitHub: https://github.com/meituan-longcat/LongCat-Video
LongCat-Video-Avatar-1.5 Technical Report: https://github.com/meituan-longcat/LongCat-Video

Gemma 4 E4B Uncensored vs Official: What Actually Changes

Sat, 18 Apr 2026 10:20:00 +0800

If you see a model like HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive, the most important point is this: it is not a new Google base model. It is a derivative release built on top of the official google/gemma-4-E4B-it, but with alignment behavior intentionally pushed toward fewer refusals.

That means the real difference is usually behavioral policy and response style, not a brand-new architecture.

What the derivative model explicitly claims

According to its Hugging Face model card, the HauhauCS release says:

it is based on google/gemma-4-E4B-it
it makes “no changes to datasets or capabilities”
it is “just without the refusals”
the Aggressive variant is “fully unlocked and won’t refuse prompts”

Those are the creator’s claims, not an independent benchmark. Still, they tell you the intended positioning very clearly: this is an unofficial derivative optimized to reduce safety refusals.

Official model vs “uncensored” derivative

Dimension	Official `google/gemma-4-E4B-it`	`Gemma-4-E4B-Uncensored-HauhauCS-Aggressive`
Source	Official Google release	Third-party derivative on Hugging Face
Base architecture	Gemma 4 E4B instruction-tuned model	Same base family, explicitly described as based on `google/gemma-4-E4B-it`
Main goal	General-purpose helpful assistant with responsible-use framing	Reduce refusals and keep answering even when the official model might decline
Safety posture	Aligned with Gemma family safety docs and prohibited-use policy	Intentionally weakened refusal behavior
Response style	More likely to refuse, redirect, or soften certain requests	More likely to answer directly, including prompts the official model may block
Risk profile	Lower misuse risk by default, but still not risk-free	Higher misuse risk, higher chance of unsafe or non-compliant output
Predictability in products	Easier to justify in normal apps and enterprise environments	Harder to justify in public-facing, business, or policy-sensitive deployments
Compliance burden	Still requires application-level safeguards	Requires even stronger downstream safeguards because the model itself is less restrictive

The core difference is alignment, not raw capability

Many users mistakenly treat “uncensored” as if it means “smarter.” That is usually the wrong frame.

For a derivative like this, what changes first is:

how often the model refuses
how strongly it follows harmful or policy-sensitive instructions
how much filtering remains in its final answers

What does not automatically change:

the underlying Gemma 4 family architecture
context window class
multimodal support class
general reasoning ceiling

In other words, an uncensored derivative is often better described as a different behavioral tuning of the same model family, not a higher-tier model.

Why the official version behaves differently

Google’s official Gemma materials frame the family as being built for responsible AI development. The Gemma model card highlights misuse, harmful content, privacy, and bias risks, and Google’s Gemma Prohibited Use Policy explicitly forbids using Gemma or model derivatives to:

facilitate dangerous, illegal, or malicious activities
generate harmful or deceptive content
override or circumvent safety filters

So the official model is not just “more conservative” by accident. Its surrounding policy and intended deployment posture are deliberately different.

When the official model is the better choice

Use the official google/gemma-4-E4B-it path if you care about:

product deployment
enterprise or team use
lower legal and policy exposure
fewer obviously unsafe outputs
easier documentation and review

For most normal applications, this is the safer default.

When people choose the uncensored derivative

Users usually choose an uncensored derivative for:

local private experimentation
testing where the official model refuses too early
roleplay or open-ended creative prompting
comparing alignment behavior across variants

But this comes with a real trade-off: you are moving more safety responsibility from the model provider to yourself.

Practical conclusion

The difference between a so-called “jailbroken” Gemma 4 E4B and the ordinary official version is mostly this:

the official version is optimized for usable capability with guardrails
the uncensored derivative is optimized for fewer refusals with weaker guardrails

That does not automatically make the uncensored model stronger. It mainly makes it more permissive.

If your goal is stable, explainable, and lower-risk deployment, use the official model first. If your goal is local experimentation and you understand the compliance and safety trade-offs, then an uncensored derivative is a behavior variant worth testing separately, not a drop-in “better” replacement.

Sources

Hugging Face: HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
Hugging Face: google/gemma-4-E4B-it
Google AI for Developers: Gemma Prohibited Use Policy
Google AI for Developers: Gemma model card

Where Does llama-cli -hf Save Hugging Face Models by Default

Fri, 17 Apr 2026 14:48:04 +0800

If you use llama-cli to download and run a model directly from Hugging Face, for example:

`1`	`llama-cli -hf unsloth/gemma-4-E4B-it-GGUF`

this uses the Hugging Face download support built into llama.cpp. Recent llama.cpp builds store models downloaded with -hf in the standard Hugging Face Hub cache directory.

Default cache locations

The cache location used by llama-cli -hf is first controlled by the LLAMA_CACHE environment variable. If LLAMA_CACHE is not set, llama.cpp checks Hugging Face cache variables such as HF_HUB_CACHE, HUGGINGFACE_HUB_CACHE, and HF_HOME.

If none of those variables are set, common default paths are:

System	Default cache directory
Linux	`~/.cache/huggingface/hub`
macOS	`~/.cache/huggingface/hub`
Windows	`%USERPROFILE%\.cache\huggingface\hub`

On Windows, %USERPROFILE% usually expands to:

`1`	`C:\Users\用户名`

So the default cache directory is roughly:

`1`	`C:\Users\用户名\.cache\huggingface\hub`

How to change the llama-cli cache directory

Set LLAMA_CACHE if you want to store the downloaded models on a specific disk or in a specific folder. You can also follow the Hugging Face convention and set HF_HOME; in that case, the Hub cache directory will be $HF_HOME/hub.

Temporary Windows CMD example:

1
2

set LLAMA_CACHE=D:\models\llama-cache
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF

Temporary PowerShell example:

1
2

$env:LLAMA_CACHE="D:\models\llama-cache"
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF

Temporary Linux / macOS example:

1
2

export LLAMA_CACHE=/data/models/llama-cache
llama-cli -hf unsloth/gemma-4-E4B-it-GGUF

Summary

llama-cli -hf ... uses the download logic from llama.cpp, but recent builds default to the Hugging Face Hub cache.
Linux / macOS default: ~/.cache/huggingface/hub
Windows default: %USERPROFILE%\.cache\huggingface\hub
To change the location, set LLAMA_CACHE, or set HF_HOME / HF_HUB_CACHE

How to Fix SSL Certificate Verification Failed When llama-cli Downloads from Hugging Face on Windows

Fri, 17 Apr 2026 14:20:29 +0800

If you run this command on Windows:

`1`	`llama-cli -hf unsloth/gemma-4-E4B-it-GGUF`

and see an error like this:

1
2

get_repo_commit: error: HTTPLIB failed: SSL server verification failed
error: failed to download model from Hugging Face

the problem is usually not CUDA or llama.cpp itself. More often, the program cannot correctly access the system certificate chain in the current environment, so HTTPS verification fails.

From the log, ggml-rpc.dll and ggml-cpu-alderlake.dll were loaded successfully, which means the runtime environment is mostly fine. The issue is mainly in the model download step.

The easiest workaround: download the model manually

If you just want to get it running quickly, downloading the model manually is usually the most stable option.

Open the matching Hugging Face repository page.
Download the required .gguf file from Files and versions.
After the download finishes, run it with the local file path:

`1`	`llama-cli -m C:\Users\knightli\Downloads\gemma-4-e4b-it.gguf`

This bypasses SSL verification during the -hf download step and is useful when you only want to verify that the model can run locally.

If you still want to use `-hf` automatic download

You can manually specify a certificate file path so the program can find a usable CA bundle in the current session.

cacert.pem can be obtained from the CA Extract page maintained by the curl project:

Page: https://curl.se/docs/caextract.html
Direct download: https://curl.se/ca/cacert.pem

If you download it in a browser, open the direct download link and save it as cacert.pem. You can also download it to a fixed directory with PowerShell:

1
2

New-Item -ItemType Directory -Force C:\certs
Invoke-WebRequest -Uri https://curl.se/ca/cacert.pem -OutFile C:\certs\cacert.pem

After the download finishes, set these variables in the command line:

1
2

set SSL_CERT_FILE=C:\certs\cacert.pem
set CURL_CA_BUNDLE=C:\certs\cacert.pem

Then run the original command again:

`1`	`llama-cli -hf unsloth/gemma-4-E4B-it-GGUF`

If the issue really comes from the certificate chain, this usually fixes it directly.

How to Get GGUF Models from Hugging Face with llama.cpp

Sun, 12 Apr 2026 09:31:38 +0800

llama.cpp can work directly with GGUF models hosted on Hugging Face, so you do not always need to download model files manually first.

If a model repository already provides GGUF files, you can use the -hf argument in the CLI, for example:

`1`	`llama-cli -hf ggml-org/gemma-3-1b-it-GGUF`

By default, this downloads from Hugging Face.
If you use another service that exposes a Hugging Face compatible API, you can switch the download endpoint with the MODEL_ENDPOINT environment variable.

One important detail is that llama.cpp only works directly with the GGUF format.
If your model is in another format, you need to convert it first with the convert_*.py scripts provided in the repository.

Hugging Face also offers several online tools related to llama.cpp, including:

converting models to GGUF
quantizing weights to reduce size
converting LoRA adapters
editing GGUF metadata in the browser
hosting llama.cpp inference endpoints

If you only want the practical takeaway, start with repositories that already provide GGUF, then use llama-cli -hf <user>/<model>. In most cases, that is the simplest path.

Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2

Sat, 11 Apr 2026 20:07:29 +0800

When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.

Understand 32, 16, and Q levels first

32: closest to original/uncompressed quality, but hardware demand is extreme.
16: still very close to original quality, around half the size of 32.
Q8: common entry point for quantized models (Q8_0 or Q8).
Q6, Q5, Q4, Q3, Q2: lower number means lower resource use and higher quality loss risk.

What `K_M` / `K_S` means

K_M and K_S are mixed quantization variants:

most weights stay at the target quantization level
important parts keep higher precision

So at the same level, Qx_K_M or Qx_K_S is usually slightly better than plain Qx.

Practical picking strategy

If hardware allows, start with Q8.
If memory is tight, step down through Q6 / Q5 / Q4.
Try not to go below Q4; Q4_K_M is a common lower bound.
Below Q4, quality degradation becomes increasingly visible.

Quality order (best to worst)

32
16

– Above this point, quality is effectively the same, but hardware requirements are extreme –

Q8
Q6_K_M
Q6_K_S
Q6
Q5_K_M
Q5_K_S
Q5

– This is the typical sweet spot –

Q4_K_M
Q4_K_S
Q4

– Below this point, quality loss becomes visible –

Q3_K_M
Q3_K_S
Q3
Q2_K_M
Q2_K_S
Q2

If you want one short rule: start with Q8 or Q6_K_M, then move down to Q5 or Q4_K_M only when needed.

How to Download a GGUF Model from Hugging Face and Import It into Ollama

Thu, 09 Apr 2026 11:00:07 +0800

If a model is not available in the official Ollama library, or if you want to use a specific GGUF file from Hugging Face, you can download it manually and then import it into Ollama.

Step 1: Download the GGUF file from Hugging Face

First, find the target model’s GGUF file on Hugging Face. You will usually see multiple quantized versions, such as:

Q4_K_M
Q5_K_M
Q8_0

Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the .gguf file in a fixed directory so you can reference it from the Modelfile.

Step 2: Write the Modelfile

Create a Modelfile in the same directory as the model file. The most basic version looks like this:

`1`	`FROM ./model.gguf`

If the filename is different, replace it with the actual filename, for example:

`1`	`FROM ./gemma-3-12b-it-q4_k_m.gguf`

If your goal is just to get it running, this single FROM line is usually enough.

Step 3: Import it into Ollama

Then run:

`1`	`ollama create myModelName -f Modelfile`

myModelName is the local model name you want to use inside Ollama
-f Modelfile tells Ollama to create the model from that file

Once the creation succeeds, the GGUF file becomes a local model that you can call directly.

Step 4: Run the model

After creation, run:

`1`	`ollama run myModelName`

From that point on, it works much like a model pulled with ollama pull.

How to inspect an existing model’s Modelfile

If you are not sure how to write a Modelfile, you can inspect the configuration of an existing model directly:

`1`	`ollama show --modelfile llama3.2`

This command prints the Modelfile for llama3.2, which is useful as a reference for:

How FROM should be written
How the template and system prompt are structured
How parameters are declared

When this approach makes sense

This manual Hugging Face import flow is useful when:

The model you want is not available in Ollama’s official library
You want a specific quantized variant
You have already downloaded the GGUF file manually
You want finer control over how the model is packaged

If Ollama already provides an official version, using pull is usually simpler. But when you need a specific quantization or a custom wrapper, GGUF + Modelfile gives you more flexibility.

Common notes

The path after FROM must match the actual location of the .gguf file.
If the filename contains spaces or special characters, it is better to rename it first.
Different GGUF quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.
If the model is a chat model, you may still need to adjust the prompt template later for better results.

Conclusion

Downloading a GGUF file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal Modelfile, then run ollama create, and you can bring a third-party GGUF model into your Ollama workflow.

Hugging Face on KnightLi Blog

LongCat-Video-Avatar-1.5: Meituan's Open Audio-Driven Avatar Video Model

What changed in 1.5

Supported tasks

Installation and model download

Quick inference examples

Tuning parameters

Good use cases

Things to watch

Summary

References

Gemma 4 E4B Uncensored vs Official: What Actually Changes

What the derivative model explicitly claims

Official model vs “uncensored” derivative

The core difference is alignment, not raw capability

Why the official version behaves differently

When the official model is the better choice

When people choose the uncensored derivative

Practical conclusion

Sources

Where Does llama-cli -hf Save Hugging Face Models by Default

Default cache locations

How to change the llama-cli cache directory

Summary

How to Fix SSL Certificate Verification Failed When llama-cli Downloads from Hugging Face on Windows

The easiest workaround: download the model manually

If you still want to use -hf automatic download

How to Get GGUF Models from Hugging Face with llama.cpp

Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2

Understand 32, 16, and Q levels first

What K_M / K_S means

Practical picking strategy

Quality order (best to worst)

How to Download a GGUF Model from Hugging Face and Import It into Ollama

Step 1: Download the GGUF file from Hugging Face

Step 2: Write the Modelfile

Step 3: Import it into Ollama

Step 4: Run the model

How to inspect an existing model’s Modelfile

When this approach makes sense

Common notes

Conclusion

If you still want to use `-hf` automatic download

What `K_M` / `K_S` means