<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>GGUF on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/gguf/</link>
        <description>Recent content in GGUF on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sun, 24 May 2026 23:52:16 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/gguf/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Qwen3.6-35B-A3B jailbreak local deployment: uncensored GGUF, llama.cpp, and safety boundaries</title>
        <link>https://knightli.com/en/2026/05/24/qwen36-35b-a3b-local-deployment-llamacpp-gguf/</link>
        <pubDate>Sun, 24 May 2026 23:52:16 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/24/qwen36-35b-a3b-local-deployment-llamacpp-gguf/</guid>
        <description>&lt;p&gt;Freedidi recently introduced a popular local model: &lt;code&gt;Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive&lt;/code&gt;. The original article describes it as a jailbreak or uncensored open model, and gives GGUF quantized files, a llama.cpp launch method, and ideas for connecting it to agents.&lt;/p&gt;
&lt;p&gt;This kind of model is worth watching, but it should be understood calmly. The point is not only that it has fewer restrictions. It brings several important local AI capabilities together:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A 35B-class model with a MoE architecture.&lt;/li&gt;
&lt;li&gt;GGUF quantization that can run on consumer GPUs.&lt;/li&gt;
&lt;li&gt;An OpenAI-compatible local API through llama.cpp.&lt;/li&gt;
&lt;li&gt;Multimodal vision input through &lt;code&gt;mmproj&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Integration with local agent tools such as Hermes and OpenClaw.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you care about local models, the more important trend is not the jailbreak label. It is that local models are moving from “can chat” toward “can use tools, understand images, and serve as agent backends.”&lt;/p&gt;
&lt;h2 id=&#34;what-this-model-is&#34;&gt;What this model is
&lt;/h2&gt;&lt;p&gt;The model name mentioned in the original article is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The name contains several key pieces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6&lt;/code&gt;: based on the Qwen model family.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;35B&lt;/code&gt;: around 35B total parameters.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;A3B&lt;/code&gt;: roughly 3B active parameters per inference step, following a MoE-style design.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Uncensored&lt;/code&gt; / &lt;code&gt;Aggressive&lt;/code&gt;: fewer safety restrictions or a more aggressive tuning style.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GGUF&lt;/code&gt;: a quantized format for local inference tools such as llama.cpp.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One important note: &lt;code&gt;Uncensored&lt;/code&gt; does not mean more reliable. It usually means the model refuses less often, but it may also generate unconstrained, unverified, or risky content more easily. It can be useful for technical experiments, but it should not be connected directly to public services, production systems, or unattended workflows.&lt;/p&gt;
&lt;h2 id=&#34;why-a-35b-model-can-run-locally&#34;&gt;Why a 35B model can run locally
&lt;/h2&gt;&lt;p&gt;Many people see &lt;code&gt;35B&lt;/code&gt; and assume it requires a server or high-end multi-GPU machine. The key point in the original article is that this model uses a MoE architecture.&lt;/p&gt;
&lt;p&gt;MoE can be understood simply: the model has many total parameters, but each inference step activates only part of the experts. The original article says it activates roughly 3B parameters per run, so with quantization it can have much lower speed and VRAM pressure than a traditional dense 35B model.&lt;/p&gt;
&lt;p&gt;After GGUF quantization, it becomes possible to run it on consumer GPUs. The article says the smallest quantized version is around 11GB, and 6GB/8GB GPUs can try it, though at least 8GB VRAM is recommended.&lt;/p&gt;
&lt;p&gt;A more realistic expectation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;6GB VRAM: possible with low-bit quantization, but reduce expectations for context length and speed.&lt;/li&gt;
&lt;li&gt;8GB VRAM: better for entry-level testing with smaller quantization.&lt;/li&gt;
&lt;li&gt;16GB VRAM: more comfortable for longer context and more GPU offload.&lt;/li&gt;
&lt;li&gt;24GB VRAM: better for higher-quality quantizations such as Q4_K_M and Q4_K_P.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Whether a local model is usable is not only about whether it starts. Context length, generation speed, remaining VRAM, KV cache, multimodal mode, concurrency, and task type all matter.&lt;/p&gt;
&lt;h2 id=&#34;how-to-read-the-quantization-choices&#34;&gt;How to read the quantization choices
&lt;/h2&gt;&lt;p&gt;The original article roughly recommends:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_P&lt;/code&gt;: better for RTX 4090 or other 24GB VRAM machines.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;: more stable and higher quality.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;IQ4_NL&lt;/code&gt;: strong compression while preserving quality as much as possible.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;IQ2_M&lt;/code&gt;: for 6GB/8GB VRAM users.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Think of this as a trade-off between quality and resource usage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Q4 quantizations are usually more stable, but use more VRAM.&lt;/li&gt;
&lt;li&gt;IQ2 / IQ3 quantizations save resources, but may reduce answer quality, long-text stability, and detail handling.&lt;/li&gt;
&lt;li&gt;If you only want to test agent calls and a local API, low quantization can help you get the flow running.&lt;/li&gt;
&lt;li&gt;If you plan to write code, analyze images, or do complex reasoning for long periods, choose higher-quality quantization when possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not treat “it starts” as “it is good enough for long-term use.” Low-VRAM startup and stable task completion are different things.&lt;/p&gt;
&lt;h2 id=&#34;llamacpp-deployment-approach&#34;&gt;llama.cpp deployment approach
&lt;/h2&gt;&lt;p&gt;The original article recommends &lt;code&gt;llama.cpp&lt;/code&gt; because it supports Windows, Linux, macOS, and backends such as NVIDIA CUDA, AMD, Intel, Vulkan, and CPU.&lt;/p&gt;
&lt;p&gt;A typical launch command looks like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;9
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;llama-server&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;exe&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;-m&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;model-path.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-mmproj&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;mmproj.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;-ngl&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;999&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;-c&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;131072&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;-n&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;8192&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-host&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;127.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-port&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;8080&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;^&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-jinja&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Several parameters are worth understanding:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;-m&lt;/code&gt;: path to the main GGUF model.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--mmproj&lt;/code&gt;: multimodal projection file required for vision input.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-ngl&lt;/code&gt;: offload layers to GPU as much as possible, depending on VRAM and backend.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-c&lt;/code&gt;: context length; higher values use more memory and VRAM.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-n&lt;/code&gt;: maximum generated tokens per response.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--host 127.0.0.1&lt;/code&gt;: listen only locally, safer than exposing publicly.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--port 8080&lt;/code&gt;: local API port.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--jinja&lt;/code&gt;: important for newer Qwen chat templates; without it, formatting issues, repetition, or Chinese output problems may occur.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The easiest trap is context length. &lt;code&gt;-c 131072&lt;/code&gt; looks attractive, but long context significantly increases KV cache usage. On low-VRAM machines, start smaller and increase gradually.&lt;/p&gt;
&lt;h2 id=&#34;how-multimodal-support-works&#34;&gt;How multimodal support works
&lt;/h2&gt;&lt;p&gt;The article says this build supports multimodal vision, including image analysis, screenshots, OCR, complex UI analysis, and code screenshots.&lt;/p&gt;
&lt;p&gt;In llama.cpp, multimodal support usually requires both the main model and the matching &lt;code&gt;mmproj&lt;/code&gt; file. If &lt;code&gt;--mmproj&lt;/code&gt; is not loaded correctly, image upload may be unavailable or the model may not understand images correctly.&lt;/p&gt;
&lt;p&gt;Useful local multimodal scenarios include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analyzing UI screenshots.&lt;/li&gt;
&lt;li&gt;OCR on image text.&lt;/li&gt;
&lt;li&gt;Reading code screenshots or error screenshots.&lt;/li&gt;
&lt;li&gt;Providing visual input to local agents.&lt;/li&gt;
&lt;li&gt;Processing private images without uploading them to the cloud.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But vision understanding is not strict OCR or a guaranteed source of truth. For invoices, contracts, IDs, medical images, and other high-risk material, human review is still required.&lt;/p&gt;
&lt;h2 id=&#34;openai-compatible-api&#34;&gt;OpenAI-compatible API
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;llama-server&lt;/code&gt; in llama.cpp can expose a local interface similar to the OpenAI API. The local base URL from the original article is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:8080/v1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This means many tools that support custom OpenAI-compatible providers can send requests to the local model. The API key can often be any placeholder value, depending on whether the client enforces validation.&lt;/p&gt;
&lt;p&gt;This is useful because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No cloud API key is needed.&lt;/li&gt;
&lt;li&gt;There is no per-token billing.&lt;/li&gt;
&lt;li&gt;Data can remain on the local machine.&lt;/li&gt;
&lt;li&gt;It can connect to local agents, coding assistants, or chat frontends.&lt;/li&gt;
&lt;li&gt;It can be used as a local OpenAI API replacement for experiments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not expose the local API directly to the public internet. Even when the model runs locally, an open API can be abused, consume machine resources, or produce content you did not intend to generate.&lt;/p&gt;
&lt;h2 id=&#34;why-hermes-and-openclaw-matter&#34;&gt;Why Hermes and OpenClaw matter
&lt;/h2&gt;&lt;p&gt;The original article says the value becomes clearer when connecting this local model to Hermes or OpenClaw.&lt;/p&gt;
&lt;p&gt;The meaning is that the model itself is only the inference core. Agent tools connect it to real tasks, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Writing code.&lt;/li&gt;
&lt;li&gt;Calling tools.&lt;/li&gt;
&lt;li&gt;Reading files.&lt;/li&gt;
&lt;li&gt;Analyzing images.&lt;/li&gt;
&lt;li&gt;Searching the web.&lt;/li&gt;
&lt;li&gt;Executing multi-step tasks.&lt;/li&gt;
&lt;li&gt;Maintaining long-context workflows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A local model used only for chat has limited value. If it can act as a stable agent backend, it becomes closer to a local AI workstation.&lt;/p&gt;
&lt;p&gt;However, connecting an uncensored model to an agent requires extra caution. When the agent can operate files, run commands, visit web pages, and call tools, model output turns into real actions. The fewer restrictions the model has, the more important external permissions, human confirmation, and audit logs become.&lt;/p&gt;
&lt;h2 id=&#34;safety-boundaries-for-uncensored-models&#34;&gt;Safety boundaries for uncensored models
&lt;/h2&gt;&lt;p&gt;The main selling point of these models is often that they refuse less. But fewer refusals also mean higher risk.&lt;/p&gt;
&lt;p&gt;Keep in mind:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It may more easily produce illegal, dangerous, or misleading content.&lt;/li&gt;
&lt;li&gt;It may not actively remind you of safety boundaries.&lt;/li&gt;
&lt;li&gt;It may give overconfident advice on high-risk topics.&lt;/li&gt;
&lt;li&gt;It may be induced by prompts to perform inappropriate tasks.&lt;/li&gt;
&lt;li&gt;It is not suitable for direct public exposure.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A safer approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Test only on a local machine or controlled LAN.&lt;/li&gt;
&lt;li&gt;Do not connect it to high-privilege tools.&lt;/li&gt;
&lt;li&gt;Do not let it automatically delete, pay, publish, or bulk-submit.&lt;/li&gt;
&lt;li&gt;Put file, command, network, and browser permission boundaries around agent tools.&lt;/li&gt;
&lt;li&gt;Keep human review for high-risk outputs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The freer the model is, the more external system constraints it needs.&lt;/p&gt;
&lt;h2 id=&#34;who-should-try-it&#34;&gt;Who should try it
&lt;/h2&gt;&lt;p&gt;This kind of model fits users who:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Want to study local LLM deployment.&lt;/li&gt;
&lt;li&gt;Have at least 8GB VRAM and are willing to tune GGUF and llama.cpp.&lt;/li&gt;
&lt;li&gt;Want to connect local models to OpenAI-compatible clients.&lt;/li&gt;
&lt;li&gt;Care about local multimodal input, screenshot analysis, and agent backends.&lt;/li&gt;
&lt;li&gt;Want to process some private data offline.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is less suitable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Beginners who do not want to tune parameters.&lt;/li&gt;
&lt;li&gt;Services that require stable production SLA.&lt;/li&gt;
&lt;li&gt;Teams with strict security and compliance requirements.&lt;/li&gt;
&lt;li&gt;Business workflows that require strict factual reliability.&lt;/li&gt;
&lt;li&gt;People who want to expose the model directly to external users.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Models like &lt;code&gt;Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive&lt;/code&gt; show that local AI capabilities are moving quickly. Consumer GPUs can run larger models, GGUF quantization lowers deployment barriers, llama.cpp gives local models OpenAI-compatible APIs, and multimodal plus agent tools push them from chat toward task execution.&lt;/p&gt;
&lt;p&gt;But it should not be understood only as a jailbreak model. The more valuable angle is that local AI is becoming composable infrastructure. The final experience depends on the model, inference engine, API server, frontend, agent tools, and permission controls together.&lt;/p&gt;
&lt;p&gt;If you try it, start with low-risk local testing: choose an appropriate quantization, reduce context length, verify &lt;code&gt;--jinja&lt;/code&gt; and &lt;code&gt;--mmproj&lt;/code&gt;, then connect a client. After it is stable, consider connecting agent workflows.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Freedidi article: &lt;a class=&#34;link&#34; href=&#34;https://www.freedidi.com/24284.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.freedidi.com/24284.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;llama.cpp GitHub: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running Qwen3.6-35B Locally on an RTX 3070 8GB: llama.cpp Deployment Notes and Tuning Parameters</title>
        <link>https://knightli.com/en/2026/05/22/rtx-3070-8gb-qwen36-35b-llama-cpp-local-deployment/</link>
        <pubDate>Fri, 22 May 2026 22:44:16 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/22/rtx-3070-8gb-qwen36-35b-llama-cpp-local-deployment/</guid>
        <description>&lt;p&gt;Whether an 8GB GPU can run a 35B-class model depends on more than the total parameter count. Model architecture, quantization format, and the way the inference framework schedules work all matter.&lt;/p&gt;
&lt;p&gt;The core idea in this setup is to use a GGUF quantized version of an MoE model such as Qwen3.6-35B-A3B, then use llama.cpp with CUDA acceleration, CPU Offload, MoE parameter scheduling, and KV Cache quantization to split memory pressure between the GPU and system RAM. With that approach, an older GPU such as the RTX 3070 8GB can still have a chance to run a 35B-class local multimodal model.&lt;/p&gt;
&lt;p&gt;One point needs to be clear first: this is not &amp;ldquo;fitting a full 35B model entirely into 8GB of VRAM.&amp;rdquo; A more accurate way to understand it is that the GPU handles the compute that benefits most from GPU acceleration, while some expert layers and cache pressure are carried by system memory. The real experience depends on RAM capacity, CPU performance, quantization format, context length, and parameter choices.&lt;/p&gt;
&lt;h2 id=&#34;test-environment&#34;&gt;Test environment
&lt;/h2&gt;&lt;p&gt;This kind of setup is sensitive to system memory. A reference configuration is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CPU: Intel Core i7-12700 class&lt;/li&gt;
&lt;li&gt;GPU: NVIDIA RTX 3070 8GB&lt;/li&gt;
&lt;li&gt;RAM: 64GB&lt;/li&gt;
&lt;li&gt;OS: Windows 11&lt;/li&gt;
&lt;li&gt;Inference framework: llama.cpp CUDA build&lt;/li&gt;
&lt;li&gt;Model format: GGUF&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only have 16GB or 32GB of RAM, it is not necessarily impossible to try, but a 35B MoE model is more likely to create memory pressure during loading and long-context inference. For stable use, 64GB of RAM is a safer target.&lt;/p&gt;
&lt;h2 id=&#34;why-8gb-vram-can-still-run-a-35b-model&#34;&gt;Why 8GB VRAM can still run a 35B model
&lt;/h2&gt;&lt;p&gt;The key to Qwen3.6-35B-A3B is its MoE architecture. Its total parameter scale is 35B, but not all parameters are activated during each inference step; only part of the expert parameters are active.&lt;/p&gt;
&lt;p&gt;That leads to two consequences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The full model file is still large and requires enough disk space and system memory.&lt;/li&gt;
&lt;li&gt;The active compute per inference step is lower than a full 35B Dense model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;llama.cpp&amp;rsquo;s CPU Offload and MoE-related parameters can further reduce the VRAM threshold. The GPU mainly handles attention and some high-value compute, while the CPU and system memory carry part of the expert-layer weights. The tradeoff is that speed, response latency, and stability depend more on the whole machine, not only the GPU model.&lt;/p&gt;
&lt;h2 id=&#34;preparing-llamacpp&#34;&gt;Preparing llama.cpp
&lt;/h2&gt;&lt;p&gt;Windows users can download a prebuilt CUDA version of llama.cpp directly. Pay attention to three points:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The GPU driver should be new enough, and the CUDA runtime should match the llama.cpp package you download.&lt;/li&gt;
&lt;li&gt;After downloading, place it in a path without Chinese characters or special characters so batch scripts are easier to run.&lt;/li&gt;
&lt;li&gt;Put model files under a unified &lt;code&gt;models&lt;/code&gt; directory to avoid very long paths in commands.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you use AMD, Intel graphics, or a CPU-only environment, you can also choose Vulkan, HIP, SYCL, or CPU builds, but the parameters and performance will be different. This article focuses on the CUDA route for NVIDIA GPUs.&lt;/p&gt;
&lt;h2 id=&#34;download-the-model-and-multimodal-projection-file&#34;&gt;Download the model and multimodal projection file
&lt;/h2&gt;&lt;p&gt;The model used here is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B-UD-Q4_K_M.gguf&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;Q4_K_M&lt;/code&gt; quantization format is chosen mainly to balance accuracy, file size, and speed. On low-VRAM machines, it is not a good idea to start with a higher-precision version, because loading failures or frequent system paging become much more likely.&lt;/p&gt;
&lt;p&gt;If you want image understanding, you also need the multimodal projection file, for example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;mmproj-BF16.gguf&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This file is important. Downloading only the main model usually gives you text inference only. Without &lt;code&gt;mmproj&lt;/code&gt;, the web UI may not expose a usable image upload feature, or uploaded images may not be processed correctly.&lt;/p&gt;
&lt;p&gt;Keep the directory structure simple:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama.cpp/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;├─ llama-server.exe
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;└─ models/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   ├─ Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   └─ mmproj-BF16.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;rtx-3070-8gb-startup-parameters&#34;&gt;RTX 3070 8GB startup parameters
&lt;/h2&gt;&lt;p&gt;Below is an example startup script for an RTX 3070 8GB. Change the path to your own llama.cpp directory.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;15
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;16
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;17
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;18
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;19
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;20
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;21
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;22
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bat&#34; data-lang=&#34;bat&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;@&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; off
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;chcp 65001 &lt;span class=&#34;p&#34;&gt;&amp;gt;&lt;/span&gt;nul
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;cd&lt;/span&gt; /d D:\AI\llama.cpp
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-server.exe &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --mmproj &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\mmproj-BF16.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -ngl 99 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --n-cpu-moe 999 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --flash-attn on &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --jinja &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -c 32768 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -t 12 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -b 512 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -ub 128 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --cache-type-k q4_0 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --cache-type-v q4_0 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --mlock &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --host 127.0.0.1 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --port 8080
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;pause&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After startup, open this address in your browser:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:8080
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the page opens and the model replies normally, the service has started successfully. The first model load can be slow. Avoid launching multiple instances repeatedly during loading, because that can fill system memory more easily.&lt;/p&gt;
&lt;h2 id=&#34;understanding-the-key-parameters&#34;&gt;Understanding the key parameters
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;-ngl 99&lt;/code&gt; tries to place as many layers as possible on the GPU. How many layers actually fit depends on the model structure, quantization format, and VRAM usage.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--n-cpu-moe 999&lt;/code&gt; pushes more MoE expert layers to the CPU side, reducing VRAM pressure. It is one of the key parameters for running large MoE models on low-VRAM hardware.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--flash-attn on&lt;/code&gt; enables Flash Attention, which can reduce the cost of attention computation. Whether it is available depends on the current llama.cpp version and GPU support.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;-c 32768&lt;/code&gt; sets the context length. Long context significantly increases KV Cache pressure. If startup fails or inference is very slow, try lowering it to &lt;code&gt;8192&lt;/code&gt; or &lt;code&gt;16384&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--cache-type-k q4_0&lt;/code&gt; and &lt;code&gt;--cache-type-v q4_0&lt;/code&gt; quantize the KV Cache, saving memory and VRAM, though they may have a small impact on output quality and speed.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;-b 512&lt;/code&gt; and &lt;code&gt;-ub 128&lt;/code&gt; control batching-related parameters. In a low-VRAM environment, do not start with overly aggressive batch settings.&lt;/p&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common issues
&lt;/h2&gt;&lt;p&gt;If startup reports insufficient VRAM, first reduce the context length, for example changing &lt;code&gt;-c 32768&lt;/code&gt; to &lt;code&gt;-c 8192&lt;/code&gt;, then try lowering &lt;code&gt;-b&lt;/code&gt; and &lt;code&gt;-ub&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If the image upload button is unavailable, first check whether the &lt;code&gt;--mmproj&lt;/code&gt; path is correct and whether the &lt;code&gt;mmproj&lt;/code&gt; file matches the model.&lt;/p&gt;
&lt;p&gt;If the model responds slowly after loading, it usually does not mean the GPU is idle. Large amounts of weights or expert layers may be handled by the CPU and system memory. Use Task Manager to observe GPU, CPU, memory, and disk usage to identify the bottleneck.&lt;/p&gt;
&lt;p&gt;If the output format looks wrong, confirm that &lt;code&gt;--jinja&lt;/code&gt; is enabled and check whether the model requires the corresponding chat template.&lt;/p&gt;
&lt;p&gt;If the browser cannot open the service after startup, check the &lt;code&gt;--host&lt;/code&gt; and &lt;code&gt;--port&lt;/code&gt; settings, and make sure port 8080 is not occupied by another program.&lt;/p&gt;
&lt;h2 id=&#34;who-should-try-this&#34;&gt;Who should try this
&lt;/h2&gt;&lt;p&gt;This setup is suitable for users who already have 8GB VRAM devices such as RTX 3070, RTX 4060 Laptop, or RTX 3060 8GB, but want to experiment with larger MoE models.&lt;/p&gt;
&lt;p&gt;It is not suitable for people who need maximum speed. Running a 35B MoE model on low VRAM essentially trades CPU and system memory for a lower VRAM requirement. Being able to run it is one thing; whether it feels smooth enough is another.&lt;/p&gt;
&lt;p&gt;If your goal is high-frequency daily chatting, 7B, 8B, or 14B models may feel better. If your goal is to explore larger MoE models, multimodal capability, and the boundary of local deployment, an RTX 3070 8GB with 64GB of RAM is still worth trying.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The reason an RTX 3070 8GB can run Qwen3.6-35B-A3B is not that the GPU suddenly has more VRAM. It is the combination of MoE architecture, GGUF quantization, llama.cpp CPU Offload, and KV Cache optimization that lowers the threshold.&lt;/p&gt;
&lt;p&gt;The most interesting part of this setup is that it lets older GPUs still participate in local large-model experiments. As long as you accept tradeoffs in speed and stability, an 8GB VRAM machine can still be a local AI model testing platform, not only an entry-level device for small models.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Original article: &lt;a class=&#34;link&#34; href=&#34;https://www.freedidi.com/24267.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.freedidi.com/24267.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>llama.cpp b9196 Update: Windows Prebuilt Binaries Support CUDA 13.1, Vulkan, HIP, and SYCL</title>
        <link>https://knightli.com/en/2026/05/18/llama-cpp-windows-cuda-vulkan-gguf/</link>
        <pubDate>Mon, 18 May 2026 23:20:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/18/llama-cpp-windows-cuda-vulkan-gguf/</guid>
        <description>&lt;p&gt;The recent Windows release of &lt;code&gt;llama.cpp&lt;/code&gt; is much friendlier for local LLM users. In the past, running GGUF models on Windows often meant dealing with environment issues: CUDA version mismatches, missing DLLs, incompatible drivers, failed CMake builds, wrong environment variables, or complicated Vulkan / HIP / SYCL setup.&lt;/p&gt;
&lt;p&gt;Now the official Release page provides several Windows prebuilt packages. In many cases, users no longer need to compile from source. Download the right build, unzip it, place the model file, and you can start a local inference service directly.&lt;/p&gt;
&lt;h2 id=&#34;what-llamacpp-is-good-for&#34;&gt;What llama.cpp Is Good For
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; is one of the most commonly used local GGUF model inference frameworks. It is lightweight, cross-platform, can run on CPU or GPU, and has a large ecosystem of GGUF model resources.&lt;/p&gt;
&lt;p&gt;Common model families include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Qwen&lt;/li&gt;
&lt;li&gt;Llama&lt;/li&gt;
&lt;li&gt;DeepSeek&lt;/li&gt;
&lt;li&gt;Gemma&lt;/li&gt;
&lt;li&gt;Mistral&lt;/li&gt;
&lt;li&gt;Mixtral&lt;/li&gt;
&lt;li&gt;Hermes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As GGUF quantized models become more common, many open source models now provide GGUF versions suitable for local deployment. For regular users, the value of &lt;code&gt;llama.cpp&lt;/code&gt; is simple: you do not need a full complex inference stack to run a usable chat service on your own machine.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose-a-windows-prebuilt-build&#34;&gt;How to Choose a Windows Prebuilt Build
&lt;/h2&gt;&lt;p&gt;Windows users can choose different builds based on their hardware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Windows x64 CPU&lt;/li&gt;
&lt;li&gt;Windows x64 CUDA 12.4&lt;/li&gt;
&lt;li&gt;Windows x64 CUDA 13.1&lt;/li&gt;
&lt;li&gt;Windows x64 Vulkan&lt;/li&gt;
&lt;li&gt;Windows x64 HIP Radeon&lt;/li&gt;
&lt;li&gt;Windows x64 SYCL&lt;/li&gt;
&lt;li&gt;Windows ARM64 CPU&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you use an NVIDIA GPU, the CUDA build is usually the first choice. Cards such as RTX 3060, 4060, 4070, 4080, and 4090 are better suited to the CUDA route.&lt;/p&gt;
&lt;p&gt;If you use an AMD GPU, try HIP or Vulkan. In practice, Vulkan can sometimes be easier than HIP, especially if you do not want to set up a full ROCm environment.&lt;/p&gt;
&lt;p&gt;If you use Intel integrated graphics or an Arc GPU, try SYCL or Vulkan. Performance is usually behind NVIDIA CUDA, but it is already enough to test many small and medium GGUF models.&lt;/p&gt;
&lt;p&gt;The CPU build is suitable for users without a discrete GPU, or for those who only want to verify a model or run small models. It will not be fast, but deployment is the simplest.&lt;/p&gt;
&lt;h2 id=&#34;start-a-regular-gguf-model&#34;&gt;Start a Regular GGUF Model
&lt;/h2&gt;&lt;p&gt;Assume you have downloaded the &lt;code&gt;llama.cpp&lt;/code&gt; Windows prebuilt package and placed your model in the &lt;code&gt;models&lt;/code&gt; directory. Enter the extracted &lt;code&gt;llama.cpp&lt;/code&gt; directory and run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;llama-server&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;exe&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-m&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;models&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;\&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;your-model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;gguf&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-ngl&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;999&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Here, &lt;code&gt;-m&lt;/code&gt; points to the GGUF model file, and &lt;code&gt;-ngl 999&lt;/code&gt; tells llama.cpp to load as many layers as possible onto the GPU. The actual number depends on VRAM size, model size, and quantization format.&lt;/p&gt;
&lt;p&gt;After startup succeeds, open this address in your browser:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:8080
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You will enter the local web chat interface.&lt;/p&gt;
&lt;p&gt;If VRAM is not enough, switch to a smaller model or a lower quantization version, such as Q4 or Q5 GGUF files. Do not only look at parameter count; also check quantization format and context length settings.&lt;/p&gt;
&lt;h2 id=&#34;start-a-multimodal-vision-model&#34;&gt;Start a Multimodal Vision Model
&lt;/h2&gt;&lt;p&gt;Multimodal vision models usually need more than the main model file. They also need an &lt;code&gt;mmproj&lt;/code&gt; vision projection file. Start them by specifying both:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;llama-server&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;exe&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-m&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\main-model.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-mmproj&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\mmproj-model.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-ngl&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;999&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Common uses include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OCR recognition&lt;/li&gt;
&lt;li&gt;Screenshot understanding&lt;/li&gt;
&lt;li&gt;Webpage screenshot analysis&lt;/li&gt;
&lt;li&gt;Image Q&amp;amp;A&lt;/li&gt;
&lt;li&gt;Simple visual content judgment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, Qwen2-VL / Qwen2.5-VL models are useful for Chinese screenshot understanding, OCR, and image-text Q&amp;amp;A. Make sure the main model and &lt;code&gt;mmproj&lt;/code&gt; file match; version mismatches can easily cause loading failures or abnormal output.&lt;/p&gt;
&lt;h2 id=&#34;use-a-bat-script-to-manage-multiple-models&#34;&gt;Use a bat Script to Manage Multiple Models
&lt;/h2&gt;&lt;p&gt;If you keep multiple models locally, you can write a simple &lt;code&gt;.bat&lt;/code&gt; script to switch between them. The following example needs your own path and model names:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;15
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;16
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bat&#34; data-lang=&#34;bat&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;@&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; off
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;chcp 65001 &lt;span class=&#34;p&#34;&gt;&amp;gt;&lt;/span&gt;nul
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;cd&lt;/span&gt; /d C:\path\to\llama-b9196-bin-win-cuda-13.1-x64
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 请选择模型：
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 1. Gemma
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 2. Qwen VL 多模态
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 3. DeepSeek
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;set&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;/p&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;choice&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;输入数字：
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;%choice%&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;1&amp;#34;&lt;/span&gt; llama-server.exe -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\gemma.gguf&amp;#34;&lt;/span&gt; -ngl 999
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;%choice%&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;2&amp;#34;&lt;/span&gt; llama-server.exe -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\qwen-vl.gguf&amp;#34;&lt;/span&gt; --mmproj &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\mmproj.gguf&amp;#34;&lt;/span&gt; -ngl 999
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;%choice%&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;3&amp;#34;&lt;/span&gt; llama-server.exe -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\deepseek.gguf&amp;#34;&lt;/span&gt; -ngl 999
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;pause&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Save it as UTF-8, then change the extension to &lt;code&gt;.bat&lt;/code&gt;. Double-clicking the script lets you choose different models by number.&lt;/p&gt;
&lt;h2 id=&#34;three-things-to-check-when-choosing-models&#34;&gt;Three Things to Check When Choosing Models
&lt;/h2&gt;&lt;p&gt;First, check hardware. More VRAM means you can run larger models. If VRAM is limited, do not force a large model; start with 7B, 8B, or a lower quantization version.&lt;/p&gt;
&lt;p&gt;Second, check the use case. For everyday Q&amp;amp;A, summarization, and rewriting, small models or medium quantization are often enough. For coding, long-document analysis, or multimodal understanding, you need stronger models and more VRAM.&lt;/p&gt;
&lt;p&gt;Third, check licenses and safety boundaries. Many community-modified models have different capabilities, restrictions, and licenses. Before downloading, confirm the source, license, intended use, and risks. Do not hand production work directly to models from unclear sources.&lt;/p&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common Issues
&lt;/h2&gt;&lt;p&gt;If startup reports missing DLLs, first confirm that the downloaded package matches your GPU route. NVIDIA users should not download the HIP build by mistake, and AMD users should not download the CUDA build.&lt;/p&gt;
&lt;p&gt;If model loading is slow, the model may be too large, the disk may be slow, or part of the model may be falling back to CPU due to insufficient VRAM.&lt;/p&gt;
&lt;p&gt;If the web page does not open, check whether the command line service started successfully, then confirm the port is &lt;code&gt;8080&lt;/code&gt;. If the port is occupied, check &lt;code&gt;llama-server&lt;/code&gt; parameters and change the port.&lt;/p&gt;
&lt;p&gt;If a multimodal model behaves incorrectly, first check whether the &lt;code&gt;mmproj&lt;/code&gt; file matches the main model instead of only changing prompts.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The value of these Windows prebuilt packages is that they lower the entry barrier for local AI. Many users previously got stuck at compilation and dependency setup. Now they can move faster into downloading models, starting a service, and testing results.&lt;/p&gt;
&lt;p&gt;For Windows users, the route can be summarized simply:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NVIDIA: prefer CUDA.&lt;/li&gt;
&lt;li&gt;AMD: try Vulkan first, then HIP.&lt;/li&gt;
&lt;li&gt;Intel: try SYCL or Vulkan.&lt;/li&gt;
&lt;li&gt;No discrete GPU: use the CPU build for small models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Before real use, still confirm model source, license, VRAM needs, and actual results. Local AI gives you control, offline operation, and low latency, but it is not free of cost: model management, hardware resources, and output quality are still your responsibility.&lt;/p&gt;
&lt;p&gt;Source: &lt;a class=&#34;link&#34; href=&#34;https://www.freedidi.com/24211.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.freedidi.com/24211.html&lt;/a&gt;&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Local LLM Models Recommended for an RTX 3060 GPU</title>
        <link>https://knightli.com/en/2026/05/08/rtx-3060-local-llm-models/</link>
        <pubDate>Fri, 08 May 2026 09:25:24 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/08/rtx-3060-local-llm-models/</guid>
        <description>&lt;p&gt;The most common RTX 3060 variant has 12GB of VRAM. It is not a top-tier AI GPU, but it is a very usable card for local LLMs, especially 7B, 8B, 9B, and 12B models.&lt;/p&gt;
&lt;p&gt;If you only want a quick rule of thumb, remember this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On an RTX 3060 12GB, prioritize around-8B models in Q4_K_M or Q5_K_M quantization. Choose Q4 for stability, and try Q5 if you want better quality.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Do not start by chasing 32B or 70B models. Even if they can run with low-bit quantization and CPU offloading, their speed and experience are usually not suitable for daily use.&lt;/p&gt;
&lt;h2 id=&#34;start-with-the-vram-limit&#34;&gt;Start With the VRAM Limit
&lt;/h2&gt;&lt;p&gt;For local LLMs on an RTX 3060 12GB, the real limit is VRAM.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model Size&lt;/th&gt;
          &lt;th&gt;Recommended Quantization&lt;/th&gt;
          &lt;th&gt;RTX 3060 12GB Experience&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;3B / 4B&lt;/td&gt;
          &lt;td&gt;Q4, Q5, Q8&lt;/td&gt;
          &lt;td&gt;Very easy, fast&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;7B / 8B / 9B&lt;/td&gt;
          &lt;td&gt;Q4_K_M, Q5_K_M&lt;/td&gt;
          &lt;td&gt;Best balance of quality and speed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12B / 14B&lt;/td&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;Usable, but avoid huge context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;30B+&lt;/td&gt;
          &lt;td&gt;Q2 / Q3 or partial offload&lt;/td&gt;
          &lt;td&gt;Possible to tinker with, not recommended daily&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;70B+&lt;/td&gt;
          &lt;td&gt;Very low quantization or heavy CPU/RAM use&lt;/td&gt;
          &lt;td&gt;More like an experiment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs do not only consume VRAM for the model file. Context length, KV cache, batch size, inference framework, and drivers all consume resources.&lt;/p&gt;
&lt;p&gt;So 12GB of VRAM does not mean you can load a 12GB model file directly. It is better to leave room for the system and context.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-1-qwen3-8b&#34;&gt;Recommendation 1: Qwen3 8B
&lt;/h2&gt;&lt;p&gt;If you mainly use Chinese, &lt;code&gt;Qwen3 8B&lt;/code&gt; is one of the first models worth trying on an RTX 3060.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Chinese Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Summarization and rewriting.&lt;/li&gt;
&lt;li&gt;Everyday knowledge assistant work.&lt;/li&gt;
&lt;li&gt;Simple code explanation.&lt;/li&gt;
&lt;li&gt;Local RAG.&lt;/li&gt;
&lt;li&gt;Lightweight Agent flows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Qwen3 8B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M: first choice
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q5_K_M: better quality, more VRAM pressure
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Qwen models are friendly to Chinese usage. For daily writing, information organization, and Chinese instruction following, Qwen3 8B is a good first model.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-2-llama-31-8b-instruct&#34;&gt;Recommendation 2: Llama 3.1 8B Instruct
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Llama 3.1 8B Instruct&lt;/code&gt; is a stable general-purpose model with mature English capability and ecosystem support.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;English Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Lightweight coding help.&lt;/li&gt;
&lt;li&gt;General chat.&lt;/li&gt;
&lt;li&gt;Document summarization.&lt;/li&gt;
&lt;li&gt;Prompt testing.&lt;/li&gt;
&lt;li&gt;Comparing different inference tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Llama 3.1 8B Instruct GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M: better speed and VRAM stability
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q5_K_M: better answer quality
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you mainly process English materials, or want a model with many tutorials and broad compatibility, Llama 3.1 8B is still a good baseline.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-3-gemma-3-12b&#34;&gt;Recommendation 3: Gemma 3 12B
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Gemma 3 12B&lt;/code&gt; is closer to the upper practical limit for an RTX 3060 12GB.&lt;/p&gt;
&lt;p&gt;It uses more VRAM than 8B models, but Q4 quantization can still make it usable on a 12GB card. It is a good option if you want to try a slightly larger model on one GPU.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Higher-quality general Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;English content processing.&lt;/li&gt;
&lt;li&gt;More complex summarization and analysis.&lt;/li&gt;
&lt;li&gt;Trying an upgrade over 8B models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Gemma 3 12B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M or official QAT Q4
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Keep context modest
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you run out of VRAM, reduce context length first, or return to an 8B model. For an RTX 3060, 12B is &amp;ldquo;worth trying,&amp;rdquo; not a no-brainer default.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-4-deepseek-r1-distill-qwen-8b&#34;&gt;Recommendation 4: DeepSeek R1 Distill Qwen 8B
&lt;/h2&gt;&lt;p&gt;If you want to experience reasoning-style local models, try models like &lt;code&gt;DeepSeek R1 Distill Qwen 8B&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple reasoning tasks.&lt;/li&gt;
&lt;li&gt;Step-by-step analysis.&lt;/li&gt;
&lt;li&gt;Learning reasoning-model output style.&lt;/li&gt;
&lt;li&gt;Low-cost local experiments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;DeepSeek R1 Distill Qwen 8B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;These models may produce longer reasoning-style outputs, so speed and context usage can be heavier than ordinary instruction models. They are not always more comfortable for daily chat, but they are useful for reasoning experiments.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-5-phi--minicpm--smaller-models&#34;&gt;Recommendation 5: Phi / MiniCPM / Smaller Models
&lt;/h2&gt;&lt;p&gt;If your RTX 3060 is an 8GB variant, or your system RAM is limited, consider 3B and 4B models first.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fast Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Simple summaries.&lt;/li&gt;
&lt;li&gt;Embedding into local tools.&lt;/li&gt;
&lt;li&gt;Low-latency chat.&lt;/li&gt;
&lt;li&gt;Testing on older machines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These models may not match 8B or 12B quality, but they are light, fast, and easy to deploy.&lt;/p&gt;
&lt;h2 id=&#34;which-quantization-to-use&#34;&gt;Which Quantization to Use
&lt;/h2&gt;&lt;p&gt;Local models commonly use &lt;code&gt;GGUF&lt;/code&gt;, with quantization types such as Q4, Q5, Q6, and Q8.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th&gt;Traits&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;Small, fast, good enough&lt;/td&gt;
          &lt;td&gt;RTX 3060 first choice&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q5_K_M&lt;/td&gt;
          &lt;td&gt;Better quality, higher usage&lt;/td&gt;
          &lt;td&gt;Try with 8B models&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q6 / Q8&lt;/td&gt;
          &lt;td&gt;Closer to original quality, larger&lt;/td&gt;
          &lt;td&gt;Small models or more VRAM&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q2 / Q3&lt;/td&gt;
          &lt;td&gt;Saves VRAM but quality drops&lt;/td&gt;
          &lt;td&gt;Large-model tinkering&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For RTX 3060 12GB, the practical choices are:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B models: Q4_K_M or Q5_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;12B models: Q4_K_M first
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Larger models: not recommended as daily drivers
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;which-tool-to-use&#34;&gt;Which Tool to Use
&lt;/h2&gt;&lt;p&gt;Beginners can start with &lt;code&gt;Ollama&lt;/code&gt;, because installation and running models are simple.&lt;/p&gt;
&lt;p&gt;Common commands:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run qwen3:8b
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run llama3.1:8b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want finer control over GGUF files, GPU layers, and context length, use &lt;code&gt;llama.cpp&lt;/code&gt; or GUI tools based on it.&lt;/p&gt;
&lt;p&gt;Common choices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Ollama&lt;/code&gt;: easiest, best for beginners.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LM Studio&lt;/code&gt;: friendly GUI, good for downloading and switching models.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llama.cpp&lt;/code&gt;: most control, best for performance tuning.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;text-generation-webui&lt;/code&gt;: many features, good for backend testing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For local chat and simple Q&amp;amp;A, Ollama or LM Studio is enough.&lt;/p&gt;
&lt;h2 id=&#34;do-not-set-context-too-high&#34;&gt;Do Not Set Context Too High
&lt;/h2&gt;&lt;p&gt;Many models advertise long-context support, but do not blindly set context to the maximum on an RTX 3060.&lt;/p&gt;
&lt;p&gt;Longer context uses more KV cache and increases VRAM pressure. Even if the model loads, long context can slow generation down.&lt;/p&gt;
&lt;p&gt;Suggested settings:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Normal chat: 4K to 8K
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Document summaries: 8K to 16K
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Long-document RAG: chunk first; do not paste everything at once
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;An RTX 3060 is better suited to &amp;ldquo;moderate context + good model + good retrieval&amp;rdquo; than forcing hundreds of thousands of tokens into one prompt.&lt;/p&gt;
&lt;h2 id=&#34;choose-by-use-case&#34;&gt;Choose by Use Case
&lt;/h2&gt;&lt;p&gt;If you mainly write Chinese:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;First choice: Qwen3 8B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Alternative: DeepSeek R1 Distill Qwen 8B
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you mainly write English:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;First choice: Llama 3.1 8B Instruct Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Alternative: Gemma 3 12B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want speed:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;3B / 4B models
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Keep context at 4K to 8K
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want better quality:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B Q5_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;12B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Accept slower speed
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want coding help:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B coding models can help with explanations and small edits
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;For complex engineering tasks, use stronger cloud models
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Local RTX 3060 models are good for code explanation, function completion, small scripts, and offline assistance. For large refactors, difficult bugs, and cross-file Agent work, do not expect Claude Sonnet or GPT-5-level performance.&lt;/p&gt;
&lt;h2 id=&#34;reasonable-expectations&#34;&gt;Reasonable Expectations
&lt;/h2&gt;&lt;p&gt;The RTX 3060 12GB is good enough to turn local LLMs from toys into daily tools, but it will not recreate top cloud models at home.&lt;/p&gt;
&lt;p&gt;Its strengths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Low cost.&lt;/li&gt;
&lt;li&gt;More VRAM than 8GB cards.&lt;/li&gt;
&lt;li&gt;Good 8B model experience.&lt;/li&gt;
&lt;li&gt;Offline use.&lt;/li&gt;
&lt;li&gt;Local processing for privacy-sensitive materials.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Its limits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Large models are hard to run smoothly.&lt;/li&gt;
&lt;li&gt;Long context consumes VRAM.&lt;/li&gt;
&lt;li&gt;Slower than high-end GPUs.&lt;/li&gt;
&lt;li&gt;Small local models have limited complex reasoning.&lt;/li&gt;
&lt;li&gt;Multimodal and Agent workflows need more resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The stable route is: use 8B models as everyday local assistants, try 12B models for quality, and leave complex tasks to cloud models.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Recommended local LLM choices for RTX 3060 12GB:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Chinese general use: &lt;code&gt;Qwen3 8B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;English general use: &lt;code&gt;Llama 3.1 8B Instruct Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Higher-quality experiment: &lt;code&gt;Gemma 3 12B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Reasoning experiment: &lt;code&gt;DeepSeek R1 Distill Qwen 8B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Low-VRAM fast use: 3B / 4B small models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Choose &lt;code&gt;Q4_K_M&lt;/code&gt; first. Try &lt;code&gt;Q5_K_M&lt;/code&gt; for 8B models if you want better quality. Start with Ollama or LM Studio.&lt;/p&gt;
&lt;p&gt;Do not treat the RTX 3060 as a large-model server. Treat it as a local knowledge assistant, privacy document processor, lightweight coding helper, and model experiment card, and it will fit its real capabilities much better.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Qwen3 8B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3-8B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/Qwen/Qwen3-8B-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Llama 3.1 8B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Gemma 3 12B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-3-12b-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/unsloth/gemma-3-12b-it-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;llama.cpp: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama: &lt;a class=&#34;link&#34; href=&#34;https://ollama.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://ollama.com&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models</title>
        <link>https://knightli.com/en/2026/05/01/qwen3-6-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 12:02:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/qwen3-6-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;: a 27B dense model.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;: a 35B total / 3B active MoE model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are also online product or API model names such as &lt;code&gt;Qwen3.6-Plus&lt;/code&gt; and &lt;code&gt;Qwen3.6-Max&lt;/code&gt;.
If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table.
This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.&lt;/p&gt;
&lt;p&gt;As with the Gemma 4 table in &lt;code&gt;/05/10&lt;/code&gt;, two concepts need to be separated first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GGUF file size&lt;/strong&gt;: how large the model weight file is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actual VRAM usage&lt;/strong&gt;: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Qwen3.6 has a very long default context. The official model card states native support for &lt;code&gt;262,144&lt;/code&gt; tokens and extension to &lt;code&gt;1,010,000&lt;/code&gt; tokens.
So the “minimum VRAM” column below only applies to short or medium context.
If you really want 128K, 256K, or longer context, reserve much more room for KV cache.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM&lt;/th&gt;
          &lt;th&gt;Good Fit&lt;/th&gt;
          &lt;th&gt;Avoid&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;8GB&lt;/td&gt;
          &lt;td&gt;Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk&lt;/td&gt;
          &lt;td&gt;Q4 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12GB&lt;/td&gt;
          &lt;td&gt;27B Q2/Q3, 35B-A3B Q2/Q3 with short context&lt;/td&gt;
          &lt;td&gt;27B Q4 with long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16GB&lt;/td&gt;
          &lt;td&gt;27B Q3/Q4, 35B-A3B Q3/IQ4_XS&lt;/td&gt;
          &lt;td&gt;35B-A3B Q4 with long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;27B Q4/Q5/Q6, 35B-A3B Q4&lt;/td&gt;
          &lt;td&gt;35B-A3B Q8, BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32GB&lt;/td&gt;
          &lt;td&gt;27B Q8, 35B-A3B Q5/Q6&lt;/td&gt;
          &lt;td&gt;BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;35B-A3B Q8, 27B with longer context more comfortably&lt;/td&gt;
          &lt;td&gt;35B-A3B BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB+&lt;/td&gt;
          &lt;td&gt;27B / 35B-A3B BF16&lt;/td&gt;
          &lt;td&gt;No need to chase BF16 for ordinary local chat&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you have a 24GB GPU, focus on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B UD-Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.&lt;/p&gt;
&lt;h2 id=&#34;official-weight-sizes&#34;&gt;Official Weight Sizes
&lt;/h2&gt;&lt;p&gt;The following BF16 weight sizes come from &lt;code&gt;model.safetensors.index.json&lt;/code&gt; in the official Hugging Face repositories.
They are useful as a reference for the original model scale.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Architecture&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Official BF16 Weight Size&lt;/th&gt;
          &lt;th&gt;Official Context&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;27B dense&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;55.56GB&lt;/td&gt;
          &lt;td&gt;Native 262K, extendable to 1,010K&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;35B total / 3B active MoE&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;71.90GB&lt;/td&gt;
          &lt;td&gt;Native 262K, extendable to 1,010K&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Although &lt;code&gt;35B-A3B&lt;/code&gt; activates about 3B parameters per step, it still needs to load the full MoE weights.
So it should not be estimated like a 3B small model.&lt;/p&gt;
&lt;h2 id=&#34;qwen36-27b-vram-table&#34;&gt;Qwen3.6-27B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt; is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model.
For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.39GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.85GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.85GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ3_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.99GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;VRAM-saving 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;3-bit entry point&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.59GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.44GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4, more VRAM efficient&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_NL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.07GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.82GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Recommended 27B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;19.51GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.52GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28.60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For ordinary local coding and chat, &lt;code&gt;Q4_K_M&lt;/code&gt; is the easiest starting point to recommend.
A 24GB GPU can run &lt;code&gt;Q4_K_M&lt;/code&gt; fairly comfortably, but for long context, reduce quantization size or context length.&lt;/p&gt;
&lt;h2 id=&#34;qwen36-35b-a3b-vram-table&#34;&gt;Qwen3.6-35B-A3B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt; is an MoE model with 35B total parameters and about 3B active parameters per step.
Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.&lt;/p&gt;
&lt;p&gt;But note that MoE &lt;code&gt;3B active&lt;/code&gt; mainly affects compute. It does not mean VRAM usage is comparable to a 3B model.
Full operation still needs the expert weights.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.76GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.52GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.29GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ3_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.21GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;VRAM-saving 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;3-bit entry point&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;17.73GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_NL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18.04GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4 recommended option&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.13GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Recommended 35B-A3B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;26.46GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;29.31GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;36.90GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;69.37GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;With 24GB VRAM, &lt;code&gt;UD-Q4_K_M&lt;/code&gt; is a key option, but do not set the context too high.
If you want room for 128K+ context, &lt;code&gt;UD-IQ4_XS&lt;/code&gt;, &lt;code&gt;UD-IQ4_NL&lt;/code&gt;, or 3-bit versions are more realistic.&lt;/p&gt;
&lt;h2 id=&#34;27b-vs-35b-a3b&#34;&gt;27B vs 35B-A3B
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Need&lt;/th&gt;
          &lt;th&gt;Better Choice&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Stable dense-model behavior&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Faster response, agents, and tool use&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Daily local use on 24GB VRAM&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;35B-A3B UD-Q4_K_M&lt;/code&gt; or &lt;code&gt;27B Q4_K_M&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Testing on 16GB VRAM&lt;/td&gt;
          &lt;td&gt;Use 2-bit/3-bit for both; avoid long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Long context first&lt;/td&gt;
          &lt;td&gt;Use lower-bit quantization and leave more KV cache room&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quality first with 32GB+ VRAM&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;27B Q5/Q6&lt;/code&gt; or &lt;code&gt;35B-A3B Q5/Q6&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you mainly write code, run agents, or use tools, &lt;code&gt;35B-A3B&lt;/code&gt; is worth trying first.
If you care more about dense-model stability and consistency, &lt;code&gt;27B&lt;/code&gt; is more straightforward.&lt;/p&gt;
&lt;h2 id=&#34;why-long-context-uses-so-much-vram&#34;&gt;Why Long Context Uses So Much VRAM
&lt;/h2&gt;&lt;p&gt;The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning.
But for local deployment, long context means a much larger &lt;code&gt;KV cache&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Actual VRAM usage is affected by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;KV cache&lt;/code&gt;: longer context means higher usage.&lt;/li&gt;
&lt;li&gt;Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.&lt;/li&gt;
&lt;li&gt;Whether &lt;code&gt;--language-model-only&lt;/code&gt; is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.&lt;/li&gt;
&lt;li&gt;Batch size and concurrency: more concurrency requires more VRAM.&lt;/li&gt;
&lt;li&gt;KV cache quantization: &lt;code&gt;q8_0&lt;/code&gt;, &lt;code&gt;q4_0&lt;/code&gt;, and similar settings can save VRAM, but may affect details.&lt;/li&gt;
&lt;li&gt;Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So do not look only at GGUF file size.
If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you just want to try Qwen3.6 locally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;12GB VRAM: try &lt;code&gt;27B UD-IQ2_M&lt;/code&gt; or &lt;code&gt;35B-A3B UD-IQ2_M&lt;/code&gt;, with short context.&lt;/li&gt;
&lt;li&gt;16GB VRAM: try &lt;code&gt;27B Q3_K_M&lt;/code&gt; or &lt;code&gt;35B-A3B UD-IQ3_XXS&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;24GB VRAM: prefer &lt;code&gt;27B Q4_K_M&lt;/code&gt;, &lt;code&gt;35B-A3B UD-IQ4_NL&lt;/code&gt;, or &lt;code&gt;35B-A3B UD-Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;32GB VRAM: consider &lt;code&gt;27B Q5/Q6&lt;/code&gt; or &lt;code&gt;35B-A3B Q5/Q6&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;48GB and above: try &lt;code&gt;Q8_0&lt;/code&gt;, or reserve more room for long context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most users do not need BF16.
The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-27B&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-27B - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-35B-A3B&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-35B-A3B - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-27B-FP8&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-27B-FP8 - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-35B-A3B-FP8 - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/Qwen3.6-27B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/Qwen3.6-27B-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/Qwen3.6-35B-A3B-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models</title>
        <link>https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 11:42:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;Gemma 4 currently has four main sizes for local deployment: &lt;code&gt;E2B&lt;/code&gt;, &lt;code&gt;E4B&lt;/code&gt;, &lt;code&gt;26B A4B&lt;/code&gt;, and &lt;code&gt;31B&lt;/code&gt;.
&lt;code&gt;E2B&lt;/code&gt; and &lt;code&gt;E4B&lt;/code&gt; target lightweight and edge devices, &lt;code&gt;26B A4B&lt;/code&gt; uses an MoE architecture, and &lt;code&gt;31B&lt;/code&gt; is the larger dense model.&lt;/p&gt;
&lt;p&gt;The easiest mistake in local inference is mixing up two numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GGUF file size&lt;/strong&gt;: how large the model weight file is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actual VRAM usage&lt;/strong&gt;: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tables below estimate VRAM requirements based on GGUF file size.
The default assumption is local text inference with &lt;code&gt;llama.cpp&lt;/code&gt;, LM Studio, Ollama, or similar runtimes, using short to medium context.
If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM&lt;/th&gt;
          &lt;th&gt;Good Fit&lt;/th&gt;
          &lt;th&gt;Avoid&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;4GB&lt;/td&gt;
          &lt;td&gt;Low-bit E2B quantizations&lt;/td&gt;
          &lt;td&gt;E4B and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;6GB&lt;/td&gt;
          &lt;td&gt;E2B Q4/Q5, low-bit E4B&lt;/td&gt;
          &lt;td&gt;26B, 31B&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;8GB&lt;/td&gt;
          &lt;td&gt;E2B Q8, E4B Q4/Q5&lt;/td&gt;
          &lt;td&gt;26B Q4, 31B Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12GB&lt;/td&gt;
          &lt;td&gt;E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests&lt;/td&gt;
          &lt;td&gt;26B Q4 with long context, 31B Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-bit 26B, low-bit 31B&lt;/td&gt;
          &lt;td&gt;31B Q4 with long context, 26B Q5 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;26B Q4/Q5, 31B Q4&lt;/td&gt;
          &lt;td&gt;31B Q8, BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32GB&lt;/td&gt;
          &lt;td&gt;26B Q6/Q8, 31B Q5/Q6&lt;/td&gt;
          &lt;td&gt;BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;31B Q8 more comfortably, 26B Q8 with longer context&lt;/td&gt;
          &lt;td&gt;31B BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB+&lt;/td&gt;
          &lt;td&gt;26B/31B BF16&lt;/td&gt;
          &lt;td&gt;Single consumer GPU deployment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you just want something usable locally, start with &lt;code&gt;E4B Q4_K_M&lt;/code&gt; or &lt;code&gt;E2B Q4_K_M&lt;/code&gt;.
With 24GB VRAM, &lt;code&gt;26B A4B Q4_K_M&lt;/code&gt; and &lt;code&gt;31B Q4_K_M&lt;/code&gt; start to become realistic choices.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-e2b-vram-table&#34;&gt;Gemma 4 E2B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;E2B&lt;/code&gt; is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing.
It is easy to run, but complex reasoning, coding, and long tasks are limited.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.29GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.54GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Lightweight chat and summaries&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.98GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.11GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Recommended E2B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Slightly steadier than Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.50GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Higher-quality small model&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5.05GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Near-original precision for lightweight deployment&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.31GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Debugging, comparison, research&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For daily use, &lt;code&gt;E2B Q4_K_M&lt;/code&gt; is already enough.
With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-e4b-vram-table&#34;&gt;Gemma 4 E4B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;E4B&lt;/code&gt; is the more practical lightweight model.
Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.06GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Lightweight local assistant&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.72GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and speed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.98GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Recommended E4B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5.48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Steadier everyday use&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7.07GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.19GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.05GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your GPU has 8GB VRAM, &lt;code&gt;E4B Q4_K_M&lt;/code&gt; is a realistic starting point.
With 12GB or 16GB VRAM, &lt;code&gt;E4B Q8_0&lt;/code&gt; is also worth considering.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-26b-a4b-vram-table&#34;&gt;Gemma 4 26B A4B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;26B A4B&lt;/code&gt; is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference.
It is better suited to more complex Q&amp;amp;A, coding, tool use, and agent workflows.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.97GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme 16GB GPU tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.55GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Running 26B with low VRAM&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;Better quality while still VRAM-conscious&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.42GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.87GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Recommended 26B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.15GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.17GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;26.86GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.51GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td&gt;Not realistic for most single consumer GPUs&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;24GB VRAM is the comfortable dividing line for 26B A4B.
A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-31b-vram-table&#34;&gt;Gemma 4 31B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;31B&lt;/code&gt; is the larger dense model.
Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests with clear quality loss&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.75GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.77GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;16GB GPU experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.21GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;More VRAM-efficient 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14.74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.37GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4 compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18.32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Recommended 31B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.66GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32.64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.41GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Server or large-VRAM workstation&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point.
&lt;code&gt;Q4_K_M&lt;/code&gt; is the balanced choice, while &lt;code&gt;Q5_K_M&lt;/code&gt; and above make more sense with 32GB+ VRAM.&lt;/p&gt;
&lt;h2 id=&#34;why-actual-usage-is-higher-than-file-size&#34;&gt;Why Actual Usage Is Higher Than File Size
&lt;/h2&gt;&lt;p&gt;The GGUF file size is only the weight size.
Runtime usage also includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;KV cache&lt;/code&gt;: longer context means higher memory use.&lt;/li&gt;
&lt;li&gt;Batch size and concurrency: processing more tokens or more users increases VRAM.&lt;/li&gt;
&lt;li&gt;Multimodal components: image, audio, or video input often requires &lt;code&gt;mmproj&lt;/code&gt; or extra modules.&lt;/li&gt;
&lt;li&gt;Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.&lt;/li&gt;
&lt;li&gt;KV cache quantization: &lt;code&gt;q8_0&lt;/code&gt;, &lt;code&gt;q4_0&lt;/code&gt;, and similar modes can save VRAM, but may affect detail.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the “minimum VRAM” column should be read as the threshold for startup and short-context inference.
For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you just want to try Gemma 4 locally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4GB to 6GB VRAM: choose &lt;code&gt;E2B Q3_K_M&lt;/code&gt; or &lt;code&gt;E2B Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;8GB VRAM: prefer &lt;code&gt;E4B Q4_K_M&lt;/code&gt;; &lt;code&gt;E2B Q8_0&lt;/code&gt; is also fine.&lt;/li&gt;
&lt;li&gt;12GB VRAM: choose &lt;code&gt;E4B Q8_0&lt;/code&gt;, or try low-bit 26B/31B variants.&lt;/li&gt;
&lt;li&gt;16GB VRAM: try &lt;code&gt;26B A4B UD-Q3_K_M&lt;/code&gt; or &lt;code&gt;31B Q3_K_S&lt;/code&gt;, but do not expect long context to feel comfortable.&lt;/li&gt;
&lt;li&gt;24GB VRAM: focus on &lt;code&gt;26B A4B UD-Q4_K_M&lt;/code&gt; and &lt;code&gt;31B Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;32GB and above: consider &lt;code&gt;Q5_K_M&lt;/code&gt;, &lt;code&gt;Q6_K&lt;/code&gt;, or longer context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most users do not need BF16.
Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E2B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E2B-it - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E4B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E4B-it - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;ggml-org/gemma-4-26B-A4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-E2B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-E4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-26B-A4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-31B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-31B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Use llama-quantize for GGUF Models</title>
        <link>https://knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</link>
        <pubDate>Sun, 12 Apr 2026 09:42:36 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</guid>
        <description>&lt;p&gt;&lt;code&gt;llama-quantize&lt;/code&gt; is the quantization tool in &lt;code&gt;llama.cpp&lt;/code&gt;. It is used to convert high-precision &lt;code&gt;GGUF&lt;/code&gt; models into smaller quantized versions.&lt;/p&gt;
&lt;p&gt;Its most common use is turning formats such as &lt;code&gt;F32&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, or &lt;code&gt;FP16&lt;/code&gt; into versions like &lt;code&gt;Q4_K_M&lt;/code&gt;, &lt;code&gt;Q5_K_M&lt;/code&gt;, or &lt;code&gt;Q8_0&lt;/code&gt; that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.&lt;/p&gt;
&lt;h2 id=&#34;basic-workflow&#34;&gt;Basic workflow
&lt;/h2&gt;&lt;p&gt;A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# install Python dependencies&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 -m pip install -r requirements.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# convert the model to ggml FP16 format&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 convert_hf_to_gguf.py ./models/mymodel/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# quantize the model to 4-bits (using Q4_K_M method)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After that, you can run the quantized model with &lt;code&gt;llama-cli&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# start inference on a gguf model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p &lt;span class=&#34;s2&#34;&gt;&amp;#34;You are a helpful assistant&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;common-options&#34;&gt;Common options
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--allow-requantize&lt;/code&gt;: allows requantizing an already quantized model, usually not ideal for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--leave-output-tensor&lt;/code&gt;: keeps the output layer unquantized, increasing size but sometimes helping quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--pure&lt;/code&gt;: disables mixed quantization and uses a more uniform quant type&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--imatrix&lt;/code&gt;: uses an importance matrix to improve quantization quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--keep-split&lt;/code&gt;: keeps the original shard layout instead of producing one merged file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want a practical starting point, this is often enough:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;how-to-choose-a-quant&#34;&gt;How to choose a quant
&lt;/h2&gt;&lt;p&gt;You can think of quant levels as a tradeoff between size, speed, and quality:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;: larger, but usually safer for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K&lt;/code&gt; / &lt;code&gt;Q5_K_M&lt;/code&gt;: common balanced choices&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;: a very common default with a good size-quality balance&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt; / &lt;code&gt;Q2&lt;/code&gt;: useful when hardware is very limited, but quality loss is more visible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.&lt;/p&gt;
&lt;h2 id=&#34;practical-takeaway&#34;&gt;Practical takeaway
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;start with &lt;code&gt;Q4_K_M&lt;/code&gt; or &lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;move up to &lt;code&gt;Q6_K&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt; if quality matters more&lt;/li&gt;
&lt;li&gt;move down to &lt;code&gt;Q3&lt;/code&gt; or &lt;code&gt;Q2&lt;/code&gt; if memory is tight&lt;/li&gt;
&lt;li&gt;compare versions with the same prompt set&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, &lt;code&gt;llama-quantize&lt;/code&gt; is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Get GGUF Models from Hugging Face with llama.cpp</title>
        <link>https://knightli.com/en/2026/04/12/llama-cpp-hugging-face-gguf-models/</link>
        <pubDate>Sun, 12 Apr 2026 09:31:38 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/12/llama-cpp-hugging-face-gguf-models/</guid>
        <description>&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; can work directly with GGUF models hosted on Hugging Face, so you do not always need to download model files manually first.&lt;/p&gt;
&lt;p&gt;If a model repository already provides GGUF files, you can use the &lt;code&gt;-hf&lt;/code&gt; argument in the CLI, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;By default, this downloads from Hugging Face.&lt;br&gt;
If you use another service that exposes a Hugging Face compatible API, you can switch the download endpoint with the &lt;code&gt;MODEL_ENDPOINT&lt;/code&gt; environment variable.&lt;/p&gt;
&lt;p&gt;One important detail is that &lt;code&gt;llama.cpp&lt;/code&gt; only works directly with the &lt;code&gt;GGUF&lt;/code&gt; format.&lt;br&gt;
If your model is in another format, you need to convert it first with the &lt;code&gt;convert_*.py&lt;/code&gt; scripts provided in the repository.&lt;/p&gt;
&lt;p&gt;Hugging Face also offers several online tools related to &lt;code&gt;llama.cpp&lt;/code&gt;, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;converting models to &lt;code&gt;GGUF&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;quantizing weights to reduce size&lt;/li&gt;
&lt;li&gt;converting LoRA adapters&lt;/li&gt;
&lt;li&gt;editing GGUF metadata in the browser&lt;/li&gt;
&lt;li&gt;hosting &lt;code&gt;llama.cpp&lt;/code&gt; inference endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only want the practical takeaway, start with repositories that already provide &lt;code&gt;GGUF&lt;/code&gt;, then use &lt;code&gt;llama-cli -hf &amp;lt;user&amp;gt;/&amp;lt;model&amp;gt;&lt;/code&gt;. In most cases, that is the simplest path.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2</title>
        <link>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</link>
        <pubDate>Sat, 11 Apr 2026 20:07:29 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</guid>
        <description>&lt;p&gt;When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.&lt;/p&gt;
&lt;h2 id=&#34;understand-32-16-and-q-levels-first&#34;&gt;Understand 32, 16, and Q levels first
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;: closest to original/uncompressed quality, but hardware demand is extreme.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;: still very close to original quality, around half the size of &lt;code&gt;32&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;: common entry point for quantized models (&lt;code&gt;Q8_0&lt;/code&gt; or &lt;code&gt;Q8&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;, &lt;code&gt;Q5&lt;/code&gt;, &lt;code&gt;Q4&lt;/code&gt;, &lt;code&gt;Q3&lt;/code&gt;, &lt;code&gt;Q2&lt;/code&gt;: lower number means lower resource use and higher quality loss risk.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;what-k_m--k_s-means&#34;&gt;What &lt;code&gt;K_M&lt;/code&gt; / &lt;code&gt;K_S&lt;/code&gt; means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;K_M&lt;/code&gt; and &lt;code&gt;K_S&lt;/code&gt; are mixed quantization variants:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;most weights stay at the target quantization level&lt;/li&gt;
&lt;li&gt;important parts keep higher precision&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So at the same level, &lt;code&gt;Qx_K_M&lt;/code&gt; or &lt;code&gt;Qx_K_S&lt;/code&gt; is usually slightly better than plain &lt;code&gt;Qx&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;practical-picking-strategy&#34;&gt;Practical picking strategy
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;If hardware allows, start with &lt;code&gt;Q8&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If memory is tight, step down through &lt;code&gt;Q6&lt;/code&gt; / &lt;code&gt;Q5&lt;/code&gt; / &lt;code&gt;Q4&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Try not to go below &lt;code&gt;Q4&lt;/code&gt;; &lt;code&gt;Q4_K_M&lt;/code&gt; is a common lower bound.&lt;/li&gt;
&lt;li&gt;Below &lt;code&gt;Q4&lt;/code&gt;, quality degradation becomes increasingly visible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quality-order-best-to-worst&#34;&gt;Quality order (best to worst)
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Above this point, quality is effectively the same, but hardware requirements are extreme &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; This is the typical sweet spot &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Below this point, quality loss becomes visible &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want one short rule: start with &lt;code&gt;Q8&lt;/code&gt; or &lt;code&gt;Q6_K_M&lt;/code&gt;, then move down to &lt;code&gt;Q5&lt;/code&gt; or &lt;code&gt;Q4_K_M&lt;/code&gt; only when needed.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Download a GGUF Model from Hugging Face and Import It into Ollama</title>
        <link>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</link>
        <pubDate>Thu, 09 Apr 2026 11:00:07 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</guid>
        <description>&lt;p&gt;If a model is not available in the official Ollama library, or if you want to use a specific &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face, you can download it manually and then import it into Ollama.&lt;/p&gt;
&lt;h2 id=&#34;step-1-download-the-gguf-file-from-hugging-face&#34;&gt;Step 1: Download the GGUF file from Hugging Face
&lt;/h2&gt;&lt;p&gt;First, find the target model&amp;rsquo;s &lt;code&gt;GGUF&lt;/code&gt; file on Hugging Face. You will usually see multiple quantized versions, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the &lt;code&gt;.gguf&lt;/code&gt; file in a fixed directory so you can reference it from the &lt;code&gt;Modelfile&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;step-2-write-the-modelfile&#34;&gt;Step 2: Write the Modelfile
&lt;/h2&gt;&lt;p&gt;Create a &lt;code&gt;Modelfile&lt;/code&gt; in the same directory as the model file. The most basic version looks like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./model.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the filename is different, replace it with the actual filename, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./gemma-3-12b-it-q4_k_m.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If your goal is just to get it running, this single &lt;code&gt;FROM&lt;/code&gt; line is usually enough.&lt;/p&gt;
&lt;h2 id=&#34;step-3-import-it-into-ollama&#34;&gt;Step 3: Import it into Ollama
&lt;/h2&gt;&lt;p&gt;Then run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama create myModelName -f Modelfile
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;myModelName&lt;/code&gt; is the local model name you want to use inside Ollama&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-f Modelfile&lt;/code&gt; tells Ollama to create the model from that file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once the creation succeeds, the GGUF file becomes a local model that you can call directly.&lt;/p&gt;
&lt;h2 id=&#34;step-4-run-the-model&#34;&gt;Step 4: Run the model
&lt;/h2&gt;&lt;p&gt;After creation, run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run myModelName
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;From that point on, it works much like a model pulled with &lt;code&gt;ollama pull&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-inspect-an-existing-models-modelfile&#34;&gt;How to inspect an existing model&amp;rsquo;s Modelfile
&lt;/h2&gt;&lt;p&gt;If you are not sure how to write a &lt;code&gt;Modelfile&lt;/code&gt;, you can inspect the configuration of an existing model directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama show --modelfile llama3.2
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This command prints the &lt;code&gt;Modelfile&lt;/code&gt; for &lt;code&gt;llama3.2&lt;/code&gt;, which is useful as a reference for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How &lt;code&gt;FROM&lt;/code&gt; should be written&lt;/li&gt;
&lt;li&gt;How the template and system prompt are structured&lt;/li&gt;
&lt;li&gt;How parameters are declared&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;when-this-approach-makes-sense&#34;&gt;When this approach makes sense
&lt;/h2&gt;&lt;p&gt;This manual Hugging Face import flow is useful when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model you want is not available in Ollama&amp;rsquo;s official library&lt;/li&gt;
&lt;li&gt;You want a specific quantized variant&lt;/li&gt;
&lt;li&gt;You have already downloaded the &lt;code&gt;GGUF&lt;/code&gt; file manually&lt;/li&gt;
&lt;li&gt;You want finer control over how the model is packaged&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If Ollama already provides an official version, using &lt;code&gt;pull&lt;/code&gt; is usually simpler. But when you need a specific quantization or a custom wrapper, &lt;code&gt;GGUF + Modelfile&lt;/code&gt; gives you more flexibility.&lt;/p&gt;
&lt;h2 id=&#34;common-notes&#34;&gt;Common notes
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;The path after &lt;code&gt;FROM&lt;/code&gt; must match the actual location of the &lt;code&gt;.gguf&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;If the filename contains spaces or special characters, it is better to rename it first.&lt;/li&gt;
&lt;li&gt;Different &lt;code&gt;GGUF&lt;/code&gt; quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.&lt;/li&gt;
&lt;li&gt;If the model is a chat model, you may still need to adjust the prompt template later for better results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Downloading a &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal &lt;code&gt;Modelfile&lt;/code&gt;, then run &lt;code&gt;ollama create&lt;/code&gt;, and you can bring a third-party &lt;code&gt;GGUF&lt;/code&gt; model into your Ollama workflow.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
