<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Hugging Face on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/hugging-face/</link>
        <description>Recent content in Hugging Face on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Mon, 25 May 2026 07:53:43 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/hugging-face/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>LongCat-Video-Avatar-1.5: Meituan&#39;s Open Audio-Driven Avatar Video Model</title>
        <link>https://knightli.com/en/2026/05/25/longcat-video-avatar-1-5-audio-driven-avatar-video/</link>
        <pubDate>Mon, 25 May 2026 07:53:43 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/25/longcat-video-avatar-1-5-audio-driven-avatar-video/</guid>
        <description>&lt;p&gt;&lt;code&gt;LongCat-Video-Avatar-1.5&lt;/code&gt; is an audio-driven avatar video generation model released by Meituan&amp;rsquo;s LongCat team.&lt;/p&gt;
&lt;p&gt;Project: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It is not a general text-to-video model. It is designed for “given speech and character conditions, generate a video where the person speaks, moves steadily, and keeps a consistent identity.” According to the model card, it supports Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation, with both single-stream and multi-stream audio inputs.&lt;/p&gt;
&lt;p&gt;At the time of writing, the Hugging Face page lists the model under the MIT License, with tags such as &lt;code&gt;audio-text-to-video&lt;/code&gt;, &lt;code&gt;audio-image-text-to-video&lt;/code&gt;, &lt;code&gt;audio-driven-video-continuation&lt;/code&gt;, &lt;code&gt;avatar&lt;/code&gt;, and &lt;code&gt;video-generation&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;what-changed-in-15&#34;&gt;What changed in 1.5
&lt;/h2&gt;&lt;p&gt;The official model card describes &lt;code&gt;LongCat-Video-Avatar 1.5&lt;/code&gt; as a more production-oriented open-source framework focused on improving stability for audio-driven human video generation.&lt;/p&gt;
&lt;p&gt;Several changes stand out.&lt;/p&gt;
&lt;p&gt;First, the audio encoder has moved from Wav2Vec2 to Whisper-Large. The official description says this brings smoother and more natural lip dynamics. In practice, scenarios that care about lip sync should prefer &lt;code&gt;--model_type avatar-v1.5&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Second, it emphasizes long-video stability and identity consistency. Avatar videos often fail in two ways: the mouth does not match the audio in short clips, or the face, body, clothes, and motion drift in longer clips. One selling point of LongCat-Video-Avatar-1.5 is that it looks at lip sync, full-body temporal stability, and identity consistency together.&lt;/p&gt;
&lt;p&gt;Third, it is not limited to realistic talking-head broadcasting. The model card says it generalizes to anime, animals, multi-person interactions, object handling, and more complex conditions. That makes it relevant not only for news-style digital humans, but also short drama, singing, e-commerce narration, animated characters, and animal characters.&lt;/p&gt;
&lt;p&gt;Fourth, it provides 8-step inference. The model card mentions DMD2-based step distillation, reducing inference to 8 NFE to balance serving cost and visual quality. This matters for video models because generation is expensive, and fewer inference steps directly affect deployability.&lt;/p&gt;
&lt;h2 id=&#34;supported-tasks&#34;&gt;Supported tasks
&lt;/h2&gt;&lt;p&gt;Based on the model card and sample commands, the model mainly covers three task groups.&lt;/p&gt;
&lt;p&gt;The first is single-person animation.&lt;/p&gt;
&lt;p&gt;It supports video generation from audio plus text, and video generation from audio plus an image. A typical use case is giving a voice clip to make a character speak, perform, or present.&lt;/p&gt;
&lt;p&gt;The second is video continuation.&lt;/p&gt;
&lt;p&gt;The examples use parameters such as &lt;code&gt;--num_segments=5&lt;/code&gt;, &lt;code&gt;--ref_img_index=10&lt;/code&gt;, and &lt;code&gt;--mask_frame_range=3&lt;/code&gt; to continue generating longer clips under existing character conditions. This is useful for long narration, courses, singing, and continuous performance.&lt;/p&gt;
&lt;p&gt;The third is multi-person animation.&lt;/p&gt;
&lt;p&gt;Multi-person mode uses &lt;code&gt;run_demo_avatar_multi_audio_to_video.py&lt;/code&gt; and supports multiple audio streams. The model card also explains two dual-audio modes: when &lt;code&gt;audio_type&lt;/code&gt; is &lt;code&gt;para&lt;/code&gt;, merge mode requires two equal-length clips; when it is &lt;code&gt;add&lt;/code&gt;, concatenation mode sequentially joins two clips and pads gaps with silence.&lt;/p&gt;
&lt;h2 id=&#34;installation-and-model-download&#34;&gt;Installation and model download
&lt;/h2&gt;&lt;p&gt;The official flow starts by cloning the LongCat-Video repository:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; LongCat-Video
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then create a Python 3.10 environment and install PyTorch according to your CUDA version. The CUDA 12.4 example in the model card is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda create -n longcat-video &lt;span class=&#34;nv&#34;&gt;python&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;3.10
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda activate longcat-video
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install &lt;span class=&#34;nv&#34;&gt;torch&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;2.6.0+cu124 &lt;span class=&#34;nv&#34;&gt;torchvision&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;0.21.0+cu124 &lt;span class=&#34;nv&#34;&gt;torchaudio&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;2.6.0 --index-url https://download.pytorch.org/whl/cu124
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You also need &lt;code&gt;flash_attn==2.7.4.post1&lt;/code&gt;, project requirements, &lt;code&gt;librosa&lt;/code&gt;, &lt;code&gt;ffmpeg&lt;/code&gt;, and &lt;code&gt;requirements_avatar.txt&lt;/code&gt;. The model card says FlashAttention-2 is enabled by default, and the config can also be changed to FlashAttention-3 or xformers.&lt;/p&gt;
&lt;p&gt;Download weights with &lt;code&gt;huggingface-cli&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install &lt;span class=&#34;s2&#34;&gt;&amp;#34;huggingface_hub[cli]&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;huggingface-cli download meituan-longcat/LongCat-Video-Avatar-1.5 --local-dir ./weights/LongCat-Video-Avatar-1.5
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Note that it depends on two weight directories: LongCat-Video as the base video generation model, and LongCat-Video-Avatar-1.5 as the avatar model.&lt;/p&gt;
&lt;h2 id=&#34;quick-inference-examples&#34;&gt;Quick inference examples
&lt;/h2&gt;&lt;p&gt;Single-person Audio-Text-to-Video:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;torchrun --nproc_per_node&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; run_demo_avatar_single_audio_to_video.py --context_parallel_size&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; --checkpoint_dir&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;./weights/LongCat-Video-Avatar-1.5 --stage_1&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;at2v --input_json&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;assets/avatar/single_example_1.json --use_distill --model_type avatar-v1.5 --use_int8
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Single-person Audio-Image-to-Video:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;torchrun --nproc_per_node&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; run_demo_avatar_single_audio_to_video.py --context_parallel_size&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; --checkpoint_dir&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;./weights/LongCat-Video-Avatar-1.5  --stage_1&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;ai2v --input_json&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;assets/avatar/single_example_1.json --use_distill --model_type avatar-v1.5 --use_int8
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Multi-person Audio-Image-to-Video:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;torchrun --nproc_per_node&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; run_demo_avatar_multi_audio_to_video.py --context_parallel_size&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; --checkpoint_dir&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;./weights/LongCat-Video-Avatar-1.5 --input_json&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;assets/avatar/multi_example_1.json --use_distill --model_type avatar-v1.5 --use_int8
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;These commands share a few choices: they all use &lt;code&gt;--model_type avatar-v1.5&lt;/code&gt;, include &lt;code&gt;--use_distill&lt;/code&gt;, and enable &lt;code&gt;--use_int8&lt;/code&gt; in the examples. The model card states that &lt;code&gt;--use_distill&lt;/code&gt; is required when using &lt;code&gt;avatar-v1.5&lt;/code&gt;; &lt;code&gt;--use_int8&lt;/code&gt; loads the INT8-quantized DiT model to reduce VRAM usage and is only supported with &lt;code&gt;avatar-v1.5&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;tuning-parameters&#34;&gt;Tuning parameters
&lt;/h2&gt;&lt;p&gt;The model card gives several practical tips.&lt;/p&gt;
&lt;p&gt;If lip sync is not good enough, increase audio CFG. The recommended range is 3 to 5, and higher values usually help synchronization.&lt;/p&gt;
&lt;p&gt;Prompts should not be too short. Longer, more specific descriptions usually improve character consistency and naturalness. Character appearance, action, scene, clothing, and expression are all useful details.&lt;/p&gt;
&lt;p&gt;If repeated actions appear, adjust &lt;code&gt;--ref_img_index&lt;/code&gt; and &lt;code&gt;--mask_frame_range&lt;/code&gt;. The model card says &lt;code&gt;--ref_img_index&lt;/code&gt; between 0 and 24 is better for consistency, while setting it to 30 can help reduce repeated actions. Increasing &lt;code&gt;--mask_frame_range&lt;/code&gt; may also help, but overly large values can introduce artifacts.&lt;/p&gt;
&lt;p&gt;For resolution, the model supports 480P and 720P through &lt;code&gt;--resolution&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;good-use-cases&#34;&gt;Good use cases
&lt;/h2&gt;&lt;p&gt;The official previews cover broadcasting, acting, singing, e-commerce marketing, multi-person conversation, animation, and animal characters.&lt;/p&gt;
&lt;p&gt;In practice, it fits these directions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;News broadcasting, knowledge explanation, and course narration.&lt;/li&gt;
&lt;li&gt;E-commerce product introduction and marketing shorts.&lt;/li&gt;
&lt;li&gt;Virtual streamers, virtual-character short drama, and singing performance.&lt;/li&gt;
&lt;li&gt;Audio-driven animation for anime or animal characters.&lt;/li&gt;
&lt;li&gt;Multi-person conversational digital human videos.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The most interesting point is that it handles “lip sync” and “long-video stability” in the same framework. Many avatar models look fine in short clips, but drift in identity, repeat motions, or lose body stability once generation is extended. LongCat-Video-Avatar-1.5 explicitly treats these as optimization targets.&lt;/p&gt;
&lt;h2 id=&#34;things-to-watch&#34;&gt;Things to watch
&lt;/h2&gt;&lt;p&gt;First, this is not a hosted model directly available through Hugging Face Inference Providers. The page says it is not currently deployed by an Inference Provider, so real usage requires preparing the environment, downloading weights, and running the LongCat-Video code yourself.&lt;/p&gt;
&lt;p&gt;Second, local deployment is not lightweight. The examples use &lt;code&gt;torchrun --nproc_per_node=2&lt;/code&gt; and &lt;code&gt;context_parallel_size=2&lt;/code&gt;, and depend on PyTorch, FlashAttention, ffmpeg, librosa, and multiple model weights. Even with INT8 quantization, it is better suited to users with a stronger GPU environment.&lt;/p&gt;
&lt;p&gt;Third, avatar video involves likeness, voice, privacy, and content safety. The model card also reminds developers to assess accuracy, safety, and fairness themselves, and to comply with laws and regulations around data protection, privacy, and content safety. When generating real human likenesses or commercial videos, authorization and compliance matter more than visual quality.&lt;/p&gt;
&lt;p&gt;Fourth, do not treat the generic Hugging Face “Diffusers/Transformers usage snippets” on the model card as the full inference path for this project. Real avatar inference should follow the LongCat-Video repository and the &lt;code&gt;run_demo_avatar_*&lt;/code&gt; examples in the model card.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;LongCat-Video-Avatar-1.5 is a notable open-source avatar video model. It is not just making a face talk; it combines audio driving, character consistency, long-video stability, multi-person audio, and distilled inference in one framework.&lt;/p&gt;
&lt;p&gt;If you care about virtual streamers, e-commerce narration, course videos, animated characters, or multi-person dialogue videos, it is worth testing. But it is closer to a model for research and engineering teams to deploy and tune than an out-of-the-box web tool. Real deployment needs compute, asset authorization, prompt tuning, and content compliance workflows.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;LongCat-Video-Avatar-1.5 Hugging Face: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;LongCat-Video GitHub: &lt;a class=&#34;link&#34; href=&#34;https://github.com/meituan-longcat/LongCat-Video&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/meituan-longcat/LongCat-Video&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;LongCat-Video-Avatar-1.5 Technical Report: &lt;a class=&#34;link&#34; href=&#34;https://github.com/meituan-longcat/LongCat-Video&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/meituan-longcat/LongCat-Video&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Gemma 4 E4B Uncensored vs Official: What Actually Changes</title>
        <link>https://knightli.com/en/2026/04/18/gemma-4-e4b-uncensored-vs-official/</link>
        <pubDate>Sat, 18 Apr 2026 10:20:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/18/gemma-4-e4b-uncensored-vs-official/</guid>
        <description>&lt;p&gt;If you see a model like &lt;code&gt;HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&lt;/code&gt;, the most important point is this: it is &lt;strong&gt;not a new Google base model&lt;/strong&gt;. It is a derivative release built on top of the official &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;, but with alignment behavior intentionally pushed toward fewer refusals.&lt;/p&gt;
&lt;p&gt;That means the real difference is usually &lt;strong&gt;behavioral policy and response style&lt;/strong&gt;, not a brand-new architecture.&lt;/p&gt;
&lt;h2 id=&#34;what-the-derivative-model-explicitly-claims&#34;&gt;What the derivative model explicitly claims
&lt;/h2&gt;&lt;p&gt;According to its Hugging Face model card, the HauhauCS release says:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it is based on &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;it makes &amp;ldquo;no changes to datasets or capabilities&amp;rdquo;&lt;/li&gt;
&lt;li&gt;it is &amp;ldquo;just without the refusals&amp;rdquo;&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;Aggressive&lt;/code&gt; variant is &amp;ldquo;fully unlocked and won&amp;rsquo;t refuse prompts&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those are the creator&amp;rsquo;s claims, not an independent benchmark. Still, they tell you the intended positioning very clearly: this is an unofficial derivative optimized to reduce safety refusals.&lt;/p&gt;
&lt;h2 id=&#34;official-model-vs-uncensored-derivative&#34;&gt;Official model vs &amp;ldquo;uncensored&amp;rdquo; derivative
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Dimension&lt;/th&gt;
          &lt;th&gt;Official &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;&lt;/th&gt;
          &lt;th&gt;&lt;code&gt;Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&lt;/code&gt;&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Source&lt;/td&gt;
          &lt;td&gt;Official Google release&lt;/td&gt;
          &lt;td&gt;Third-party derivative on Hugging Face&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Base architecture&lt;/td&gt;
          &lt;td&gt;Gemma 4 E4B instruction-tuned model&lt;/td&gt;
          &lt;td&gt;Same base family, explicitly described as based on &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Main goal&lt;/td&gt;
          &lt;td&gt;General-purpose helpful assistant with responsible-use framing&lt;/td&gt;
          &lt;td&gt;Reduce refusals and keep answering even when the official model might decline&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Safety posture&lt;/td&gt;
          &lt;td&gt;Aligned with Gemma family safety docs and prohibited-use policy&lt;/td&gt;
          &lt;td&gt;Intentionally weakened refusal behavior&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Response style&lt;/td&gt;
          &lt;td&gt;More likely to refuse, redirect, or soften certain requests&lt;/td&gt;
          &lt;td&gt;More likely to answer directly, including prompts the official model may block&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Risk profile&lt;/td&gt;
          &lt;td&gt;Lower misuse risk by default, but still not risk-free&lt;/td&gt;
          &lt;td&gt;Higher misuse risk, higher chance of unsafe or non-compliant output&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Predictability in products&lt;/td&gt;
          &lt;td&gt;Easier to justify in normal apps and enterprise environments&lt;/td&gt;
          &lt;td&gt;Harder to justify in public-facing, business, or policy-sensitive deployments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Compliance burden&lt;/td&gt;
          &lt;td&gt;Still requires application-level safeguards&lt;/td&gt;
          &lt;td&gt;Requires even stronger downstream safeguards because the model itself is less restrictive&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;the-core-difference-is-alignment-not-raw-capability&#34;&gt;The core difference is alignment, not raw capability
&lt;/h2&gt;&lt;p&gt;Many users mistakenly treat &amp;ldquo;uncensored&amp;rdquo; as if it means &amp;ldquo;smarter.&amp;rdquo; That is usually the wrong frame.&lt;/p&gt;
&lt;p&gt;For a derivative like this, what changes first is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how often the model refuses&lt;/li&gt;
&lt;li&gt;how strongly it follows harmful or policy-sensitive instructions&lt;/li&gt;
&lt;li&gt;how much filtering remains in its final answers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What does &lt;strong&gt;not&lt;/strong&gt; automatically change:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the underlying Gemma 4 family architecture&lt;/li&gt;
&lt;li&gt;context window class&lt;/li&gt;
&lt;li&gt;multimodal support class&lt;/li&gt;
&lt;li&gt;general reasoning ceiling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, an uncensored derivative is often better described as a &lt;strong&gt;different behavioral tuning&lt;/strong&gt; of the same model family, not a higher-tier model.&lt;/p&gt;
&lt;h2 id=&#34;why-the-official-version-behaves-differently&#34;&gt;Why the official version behaves differently
&lt;/h2&gt;&lt;p&gt;Google&amp;rsquo;s official Gemma materials frame the family as being built for responsible AI development. The Gemma model card highlights misuse, harmful content, privacy, and bias risks, and Google&amp;rsquo;s Gemma Prohibited Use Policy explicitly forbids using Gemma or model derivatives to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;facilitate dangerous, illegal, or malicious activities&lt;/li&gt;
&lt;li&gt;generate harmful or deceptive content&lt;/li&gt;
&lt;li&gt;override or circumvent safety filters&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the official model is not just &amp;ldquo;more conservative&amp;rdquo; by accident. Its surrounding policy and intended deployment posture are deliberately different.&lt;/p&gt;
&lt;h2 id=&#34;when-the-official-model-is-the-better-choice&#34;&gt;When the official model is the better choice
&lt;/h2&gt;&lt;p&gt;Use the official &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt; path if you care about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;product deployment&lt;/li&gt;
&lt;li&gt;enterprise or team use&lt;/li&gt;
&lt;li&gt;lower legal and policy exposure&lt;/li&gt;
&lt;li&gt;fewer obviously unsafe outputs&lt;/li&gt;
&lt;li&gt;easier documentation and review&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For most normal applications, this is the safer default.&lt;/p&gt;
&lt;h2 id=&#34;when-people-choose-the-uncensored-derivative&#34;&gt;When people choose the uncensored derivative
&lt;/h2&gt;&lt;p&gt;Users usually choose an uncensored derivative for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;local private experimentation&lt;/li&gt;
&lt;li&gt;testing where the official model refuses too early&lt;/li&gt;
&lt;li&gt;roleplay or open-ended creative prompting&lt;/li&gt;
&lt;li&gt;comparing alignment behavior across variants&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But this comes with a real trade-off: you are moving more safety responsibility from the model provider to yourself.&lt;/p&gt;
&lt;h2 id=&#34;practical-conclusion&#34;&gt;Practical conclusion
&lt;/h2&gt;&lt;p&gt;The difference between a so-called &amp;ldquo;jailbroken&amp;rdquo; Gemma 4 E4B and the ordinary official version is mostly this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the official version is optimized for usable capability &lt;strong&gt;with guardrails&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;the uncensored derivative is optimized for fewer refusals &lt;strong&gt;with weaker guardrails&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does &lt;strong&gt;not&lt;/strong&gt; automatically make the uncensored model stronger. It mainly makes it more permissive.&lt;/p&gt;
&lt;p&gt;If your goal is stable, explainable, and lower-risk deployment, use the official model first. If your goal is local experimentation and you understand the compliance and safety trade-offs, then an uncensored derivative is a behavior variant worth testing separately, not a drop-in &amp;ldquo;better&amp;rdquo; replacement.&lt;/p&gt;
&lt;h2 id=&#34;sources&#34;&gt;Sources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Hugging Face: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hugging Face: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E4B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E4B-it&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Google AI for Developers: &lt;a class=&#34;link&#34; href=&#34;https://ai.google.dev/gemma/prohibited_use_policy&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Gemma Prohibited Use Policy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Google AI for Developers: &lt;a class=&#34;link&#34; href=&#34;https://ai.google.dev/gemma/docs/core/model_card&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Gemma model card&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Where Does llama-cli -hf Save Hugging Face Models by Default</title>
        <link>https://knightli.com/en/2026/04/17/llama-cli-hf-download-default-cache-path/</link>
        <pubDate>Fri, 17 Apr 2026 14:48:04 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/17/llama-cli-hf-download-default-cache-path/</guid>
        <description>&lt;p&gt;If you use &lt;code&gt;llama-cli&lt;/code&gt; to download and run a model directly from Hugging Face, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf unsloth/gemma-4-E4B-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;this uses the Hugging Face download support built into &lt;code&gt;llama.cpp&lt;/code&gt;. Recent &lt;code&gt;llama.cpp&lt;/code&gt; builds store models downloaded with &lt;code&gt;-hf&lt;/code&gt; in the standard Hugging Face Hub cache directory.&lt;/p&gt;
&lt;h2 id=&#34;default-cache-locations&#34;&gt;Default cache locations
&lt;/h2&gt;&lt;p&gt;The cache location used by &lt;code&gt;llama-cli -hf&lt;/code&gt; is first controlled by the &lt;code&gt;LLAMA_CACHE&lt;/code&gt; environment variable. If &lt;code&gt;LLAMA_CACHE&lt;/code&gt; is not set, &lt;code&gt;llama.cpp&lt;/code&gt; checks Hugging Face cache variables such as &lt;code&gt;HF_HUB_CACHE&lt;/code&gt;, &lt;code&gt;HUGGINGFACE_HUB_CACHE&lt;/code&gt;, and &lt;code&gt;HF_HOME&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If none of those variables are set, common default paths are:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;System&lt;/th&gt;
          &lt;th&gt;Default cache directory&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Linux&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;~/.cache/huggingface/hub&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;macOS&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;~/.cache/huggingface/hub&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Windows&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;%USERPROFILE%\.cache\huggingface\hub&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;On Windows, &lt;code&gt;%USERPROFILE%&lt;/code&gt; usually expands to:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;C:\Users\用户名
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;So the default cache directory is roughly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;C:\Users\用户名\.cache\huggingface\hub
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;how-to-change-the-llama-cli-cache-directory&#34;&gt;How to change the llama-cli cache directory
&lt;/h2&gt;&lt;p&gt;Set &lt;code&gt;LLAMA_CACHE&lt;/code&gt; if you want to store the downloaded models on a specific disk or in a specific folder. You can also follow the Hugging Face convention and set &lt;code&gt;HF_HOME&lt;/code&gt;; in that case, the Hub cache directory will be &lt;code&gt;$HF_HOME/hub&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Temporary Windows CMD example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;set LLAMA_CACHE=D:\models\llama-cache
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf unsloth/gemma-4-E4B-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Temporary PowerShell example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;$env:LLAMA_CACHE&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;D:\models\llama-cache&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;llama-cli&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-hf&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;unsloth&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gemma&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-E4B-it-GGUF&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Temporary Linux / macOS example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;LLAMA_CACHE&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;/data/models/llama-cache
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf unsloth/gemma-4-E4B-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llama-cli -hf ...&lt;/code&gt; uses the download logic from &lt;code&gt;llama.cpp&lt;/code&gt;, but recent builds default to the Hugging Face Hub cache.&lt;/li&gt;
&lt;li&gt;Linux / macOS default: &lt;code&gt;~/.cache/huggingface/hub&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Windows default: &lt;code&gt;%USERPROFILE%\.cache\huggingface\hub&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;To change the location, set &lt;code&gt;LLAMA_CACHE&lt;/code&gt;, or set &lt;code&gt;HF_HOME&lt;/code&gt; / &lt;code&gt;HF_HUB_CACHE&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Fix SSL Certificate Verification Failed When llama-cli Downloads from Hugging Face on Windows</title>
        <link>https://knightli.com/en/2026/04/17/llama-cli-hugging-face-ssl-certificate-failed-on-windows/</link>
        <pubDate>Fri, 17 Apr 2026 14:20:29 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/17/llama-cli-hugging-face-ssl-certificate-failed-on-windows/</guid>
        <description>&lt;p&gt;If you run this command on Windows:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf unsloth/gemma-4-E4B-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;and see an error like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;get_repo_commit: error: HTTPLIB failed: SSL server verification failed
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;error: failed to download model from Hugging Face
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;the problem is usually not CUDA or &lt;code&gt;llama.cpp&lt;/code&gt; itself. More often, the program cannot correctly access the system certificate chain in the current environment, so HTTPS verification fails.&lt;/p&gt;
&lt;p&gt;From the log, &lt;code&gt;ggml-rpc.dll&lt;/code&gt; and &lt;code&gt;ggml-cpu-alderlake.dll&lt;/code&gt; were loaded successfully, which means the runtime environment is mostly fine. The issue is mainly in the model download step.&lt;/p&gt;
&lt;h2 id=&#34;the-easiest-workaround-download-the-model-manually&#34;&gt;The easiest workaround: download the model manually
&lt;/h2&gt;&lt;p&gt;If you just want to get it running quickly, downloading the model manually is usually the most stable option.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Open the matching Hugging Face repository page.&lt;/li&gt;
&lt;li&gt;Download the required &lt;code&gt;.gguf&lt;/code&gt; file from &lt;code&gt;Files and versions&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;After the download finishes, run it with the local file path:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-gdscript3&#34; data-lang=&#34;gdscript3&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;llama&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cli&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;m&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;C&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;\&lt;span class=&#34;n&#34;&gt;Users&lt;/span&gt;\&lt;span class=&#34;n&#34;&gt;knightli&lt;/span&gt;\&lt;span class=&#34;n&#34;&gt;Downloads&lt;/span&gt;\&lt;span class=&#34;n&#34;&gt;gemma&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;e4b&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;it&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gguf&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This bypasses SSL verification during the &lt;code&gt;-hf&lt;/code&gt; download step and is useful when you only want to verify that the model can run locally.&lt;/p&gt;
&lt;h2 id=&#34;if-you-still-want-to-use--hf-automatic-download&#34;&gt;If you still want to use &lt;code&gt;-hf&lt;/code&gt; automatic download
&lt;/h2&gt;&lt;p&gt;You can manually specify a certificate file path so the program can find a usable CA bundle in the current session.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;cacert.pem&lt;/code&gt; can be obtained from the CA Extract page maintained by the curl project:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Page: &lt;a class=&#34;link&#34; href=&#34;https://curl.se/docs/caextract.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://curl.se/docs/caextract.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Direct download: &lt;a class=&#34;link&#34; href=&#34;https://curl.se/ca/cacert.pem&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://curl.se/ca/cacert.pem&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you download it in a browser, open the direct download link and save it as &lt;code&gt;cacert.pem&lt;/code&gt;. You can also download it to a fixed directory with PowerShell:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;New-Item&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-ItemType&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Directory&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-Force&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;C:&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;\&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;certs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;Invoke-WebRequest&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-Uri&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;https&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;//&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;curl&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;se&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ca&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cacert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;pem&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-OutFile&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;C:&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;\&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;certs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;\&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cacert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;pem&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After the download finishes, set these variables in the command line:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;set SSL_CERT_FILE=C:\certs\cacert.pem
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;set CURL_CA_BUNDLE=C:\certs\cacert.pem
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then run the original command again:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf unsloth/gemma-4-E4B-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the issue really comes from the certificate chain, this usually fixes it directly.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Get GGUF Models from Hugging Face with llama.cpp</title>
        <link>https://knightli.com/en/2026/04/12/llama-cpp-hugging-face-gguf-models/</link>
        <pubDate>Sun, 12 Apr 2026 09:31:38 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/12/llama-cpp-hugging-face-gguf-models/</guid>
        <description>&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; can work directly with GGUF models hosted on Hugging Face, so you do not always need to download model files manually first.&lt;/p&gt;
&lt;p&gt;If a model repository already provides GGUF files, you can use the &lt;code&gt;-hf&lt;/code&gt; argument in the CLI, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;By default, this downloads from Hugging Face.&lt;br&gt;
If you use another service that exposes a Hugging Face compatible API, you can switch the download endpoint with the &lt;code&gt;MODEL_ENDPOINT&lt;/code&gt; environment variable.&lt;/p&gt;
&lt;p&gt;One important detail is that &lt;code&gt;llama.cpp&lt;/code&gt; only works directly with the &lt;code&gt;GGUF&lt;/code&gt; format.&lt;br&gt;
If your model is in another format, you need to convert it first with the &lt;code&gt;convert_*.py&lt;/code&gt; scripts provided in the repository.&lt;/p&gt;
&lt;p&gt;Hugging Face also offers several online tools related to &lt;code&gt;llama.cpp&lt;/code&gt;, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;converting models to &lt;code&gt;GGUF&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;quantizing weights to reduce size&lt;/li&gt;
&lt;li&gt;converting LoRA adapters&lt;/li&gt;
&lt;li&gt;editing GGUF metadata in the browser&lt;/li&gt;
&lt;li&gt;hosting &lt;code&gt;llama.cpp&lt;/code&gt; inference endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only want the practical takeaway, start with repositories that already provide &lt;code&gt;GGUF&lt;/code&gt;, then use &lt;code&gt;llama-cli -hf &amp;lt;user&amp;gt;/&amp;lt;model&amp;gt;&lt;/code&gt;. In most cases, that is the simplest path.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2</title>
        <link>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</link>
        <pubDate>Sat, 11 Apr 2026 20:07:29 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</guid>
        <description>&lt;p&gt;When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.&lt;/p&gt;
&lt;h2 id=&#34;understand-32-16-and-q-levels-first&#34;&gt;Understand 32, 16, and Q levels first
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;: closest to original/uncompressed quality, but hardware demand is extreme.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;: still very close to original quality, around half the size of &lt;code&gt;32&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;: common entry point for quantized models (&lt;code&gt;Q8_0&lt;/code&gt; or &lt;code&gt;Q8&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;, &lt;code&gt;Q5&lt;/code&gt;, &lt;code&gt;Q4&lt;/code&gt;, &lt;code&gt;Q3&lt;/code&gt;, &lt;code&gt;Q2&lt;/code&gt;: lower number means lower resource use and higher quality loss risk.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;what-k_m--k_s-means&#34;&gt;What &lt;code&gt;K_M&lt;/code&gt; / &lt;code&gt;K_S&lt;/code&gt; means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;K_M&lt;/code&gt; and &lt;code&gt;K_S&lt;/code&gt; are mixed quantization variants:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;most weights stay at the target quantization level&lt;/li&gt;
&lt;li&gt;important parts keep higher precision&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So at the same level, &lt;code&gt;Qx_K_M&lt;/code&gt; or &lt;code&gt;Qx_K_S&lt;/code&gt; is usually slightly better than plain &lt;code&gt;Qx&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;practical-picking-strategy&#34;&gt;Practical picking strategy
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;If hardware allows, start with &lt;code&gt;Q8&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If memory is tight, step down through &lt;code&gt;Q6&lt;/code&gt; / &lt;code&gt;Q5&lt;/code&gt; / &lt;code&gt;Q4&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Try not to go below &lt;code&gt;Q4&lt;/code&gt;; &lt;code&gt;Q4_K_M&lt;/code&gt; is a common lower bound.&lt;/li&gt;
&lt;li&gt;Below &lt;code&gt;Q4&lt;/code&gt;, quality degradation becomes increasingly visible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quality-order-best-to-worst&#34;&gt;Quality order (best to worst)
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Above this point, quality is effectively the same, but hardware requirements are extreme &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; This is the typical sweet spot &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Below this point, quality loss becomes visible &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want one short rule: start with &lt;code&gt;Q8&lt;/code&gt; or &lt;code&gt;Q6_K_M&lt;/code&gt;, then move down to &lt;code&gt;Q5&lt;/code&gt; or &lt;code&gt;Q4_K_M&lt;/code&gt; only when needed.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Download a GGUF Model from Hugging Face and Import It into Ollama</title>
        <link>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</link>
        <pubDate>Thu, 09 Apr 2026 11:00:07 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</guid>
        <description>&lt;p&gt;If a model is not available in the official Ollama library, or if you want to use a specific &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face, you can download it manually and then import it into Ollama.&lt;/p&gt;
&lt;h2 id=&#34;step-1-download-the-gguf-file-from-hugging-face&#34;&gt;Step 1: Download the GGUF file from Hugging Face
&lt;/h2&gt;&lt;p&gt;First, find the target model&amp;rsquo;s &lt;code&gt;GGUF&lt;/code&gt; file on Hugging Face. You will usually see multiple quantized versions, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the &lt;code&gt;.gguf&lt;/code&gt; file in a fixed directory so you can reference it from the &lt;code&gt;Modelfile&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;step-2-write-the-modelfile&#34;&gt;Step 2: Write the Modelfile
&lt;/h2&gt;&lt;p&gt;Create a &lt;code&gt;Modelfile&lt;/code&gt; in the same directory as the model file. The most basic version looks like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./model.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the filename is different, replace it with the actual filename, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./gemma-3-12b-it-q4_k_m.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If your goal is just to get it running, this single &lt;code&gt;FROM&lt;/code&gt; line is usually enough.&lt;/p&gt;
&lt;h2 id=&#34;step-3-import-it-into-ollama&#34;&gt;Step 3: Import it into Ollama
&lt;/h2&gt;&lt;p&gt;Then run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama create myModelName -f Modelfile
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;myModelName&lt;/code&gt; is the local model name you want to use inside Ollama&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-f Modelfile&lt;/code&gt; tells Ollama to create the model from that file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once the creation succeeds, the GGUF file becomes a local model that you can call directly.&lt;/p&gt;
&lt;h2 id=&#34;step-4-run-the-model&#34;&gt;Step 4: Run the model
&lt;/h2&gt;&lt;p&gt;After creation, run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run myModelName
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;From that point on, it works much like a model pulled with &lt;code&gt;ollama pull&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-inspect-an-existing-models-modelfile&#34;&gt;How to inspect an existing model&amp;rsquo;s Modelfile
&lt;/h2&gt;&lt;p&gt;If you are not sure how to write a &lt;code&gt;Modelfile&lt;/code&gt;, you can inspect the configuration of an existing model directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama show --modelfile llama3.2
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This command prints the &lt;code&gt;Modelfile&lt;/code&gt; for &lt;code&gt;llama3.2&lt;/code&gt;, which is useful as a reference for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How &lt;code&gt;FROM&lt;/code&gt; should be written&lt;/li&gt;
&lt;li&gt;How the template and system prompt are structured&lt;/li&gt;
&lt;li&gt;How parameters are declared&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;when-this-approach-makes-sense&#34;&gt;When this approach makes sense
&lt;/h2&gt;&lt;p&gt;This manual Hugging Face import flow is useful when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model you want is not available in Ollama&amp;rsquo;s official library&lt;/li&gt;
&lt;li&gt;You want a specific quantized variant&lt;/li&gt;
&lt;li&gt;You have already downloaded the &lt;code&gt;GGUF&lt;/code&gt; file manually&lt;/li&gt;
&lt;li&gt;You want finer control over how the model is packaged&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If Ollama already provides an official version, using &lt;code&gt;pull&lt;/code&gt; is usually simpler. But when you need a specific quantization or a custom wrapper, &lt;code&gt;GGUF + Modelfile&lt;/code&gt; gives you more flexibility.&lt;/p&gt;
&lt;h2 id=&#34;common-notes&#34;&gt;Common notes
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;The path after &lt;code&gt;FROM&lt;/code&gt; must match the actual location of the &lt;code&gt;.gguf&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;If the filename contains spaces or special characters, it is better to rename it first.&lt;/li&gt;
&lt;li&gt;Different &lt;code&gt;GGUF&lt;/code&gt; quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.&lt;/li&gt;
&lt;li&gt;If the model is a chat model, you may still need to adjust the prompt template later for better results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Downloading a &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal &lt;code&gt;Modelfile&lt;/code&gt;, then run &lt;code&gt;ollama create&lt;/code&gt;, and you can bring a third-party &lt;code&gt;GGUF&lt;/code&gt; model into your Ollama workflow.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
