<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Local LLM on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/local-llm/</link>
        <description>Recent content in Local LLM on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Mon, 18 May 2026 23:20:00 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/local-llm/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>llama.cpp b9196 Update: Windows Prebuilt Binaries Support CUDA 13.1, Vulkan, HIP, and SYCL</title>
        <link>https://knightli.com/en/2026/05/18/llama-cpp-windows-cuda-vulkan-gguf/</link>
        <pubDate>Mon, 18 May 2026 23:20:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/18/llama-cpp-windows-cuda-vulkan-gguf/</guid>
        <description>&lt;p&gt;The recent Windows release of &lt;code&gt;llama.cpp&lt;/code&gt; is much friendlier for local LLM users. In the past, running GGUF models on Windows often meant dealing with environment issues: CUDA version mismatches, missing DLLs, incompatible drivers, failed CMake builds, wrong environment variables, or complicated Vulkan / HIP / SYCL setup.&lt;/p&gt;
&lt;p&gt;Now the official Release page provides several Windows prebuilt packages. In many cases, users no longer need to compile from source. Download the right build, unzip it, place the model file, and you can start a local inference service directly.&lt;/p&gt;
&lt;h2 id=&#34;what-llamacpp-is-good-for&#34;&gt;What llama.cpp Is Good For
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; is one of the most commonly used local GGUF model inference frameworks. It is lightweight, cross-platform, can run on CPU or GPU, and has a large ecosystem of GGUF model resources.&lt;/p&gt;
&lt;p&gt;Common model families include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Qwen&lt;/li&gt;
&lt;li&gt;Llama&lt;/li&gt;
&lt;li&gt;DeepSeek&lt;/li&gt;
&lt;li&gt;Gemma&lt;/li&gt;
&lt;li&gt;Mistral&lt;/li&gt;
&lt;li&gt;Mixtral&lt;/li&gt;
&lt;li&gt;Hermes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As GGUF quantized models become more common, many open source models now provide GGUF versions suitable for local deployment. For regular users, the value of &lt;code&gt;llama.cpp&lt;/code&gt; is simple: you do not need a full complex inference stack to run a usable chat service on your own machine.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose-a-windows-prebuilt-build&#34;&gt;How to Choose a Windows Prebuilt Build
&lt;/h2&gt;&lt;p&gt;Windows users can choose different builds based on their hardware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Windows x64 CPU&lt;/li&gt;
&lt;li&gt;Windows x64 CUDA 12.4&lt;/li&gt;
&lt;li&gt;Windows x64 CUDA 13.1&lt;/li&gt;
&lt;li&gt;Windows x64 Vulkan&lt;/li&gt;
&lt;li&gt;Windows x64 HIP Radeon&lt;/li&gt;
&lt;li&gt;Windows x64 SYCL&lt;/li&gt;
&lt;li&gt;Windows ARM64 CPU&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you use an NVIDIA GPU, the CUDA build is usually the first choice. Cards such as RTX 3060, 4060, 4070, 4080, and 4090 are better suited to the CUDA route.&lt;/p&gt;
&lt;p&gt;If you use an AMD GPU, try HIP or Vulkan. In practice, Vulkan can sometimes be easier than HIP, especially if you do not want to set up a full ROCm environment.&lt;/p&gt;
&lt;p&gt;If you use Intel integrated graphics or an Arc GPU, try SYCL or Vulkan. Performance is usually behind NVIDIA CUDA, but it is already enough to test many small and medium GGUF models.&lt;/p&gt;
&lt;p&gt;The CPU build is suitable for users without a discrete GPU, or for those who only want to verify a model or run small models. It will not be fast, but deployment is the simplest.&lt;/p&gt;
&lt;h2 id=&#34;start-a-regular-gguf-model&#34;&gt;Start a Regular GGUF Model
&lt;/h2&gt;&lt;p&gt;Assume you have downloaded the &lt;code&gt;llama.cpp&lt;/code&gt; Windows prebuilt package and placed your model in the &lt;code&gt;models&lt;/code&gt; directory. Enter the extracted &lt;code&gt;llama.cpp&lt;/code&gt; directory and run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;llama-server&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;exe&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-m&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;models&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;\&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;your-model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;gguf&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-ngl&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;999&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Here, &lt;code&gt;-m&lt;/code&gt; points to the GGUF model file, and &lt;code&gt;-ngl 999&lt;/code&gt; tells llama.cpp to load as many layers as possible onto the GPU. The actual number depends on VRAM size, model size, and quantization format.&lt;/p&gt;
&lt;p&gt;After startup succeeds, open this address in your browser:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:8080
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You will enter the local web chat interface.&lt;/p&gt;
&lt;p&gt;If VRAM is not enough, switch to a smaller model or a lower quantization version, such as Q4 or Q5 GGUF files. Do not only look at parameter count; also check quantization format and context length settings.&lt;/p&gt;
&lt;h2 id=&#34;start-a-multimodal-vision-model&#34;&gt;Start a Multimodal Vision Model
&lt;/h2&gt;&lt;p&gt;Multimodal vision models usually need more than the main model file. They also need an &lt;code&gt;mmproj&lt;/code&gt; vision projection file. Start them by specifying both:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;llama-server&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;py&#34;&gt;exe&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-m&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\main-model.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-mmproj&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\mmproj-model.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-ngl&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;999&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Common uses include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OCR recognition&lt;/li&gt;
&lt;li&gt;Screenshot understanding&lt;/li&gt;
&lt;li&gt;Webpage screenshot analysis&lt;/li&gt;
&lt;li&gt;Image Q&amp;amp;A&lt;/li&gt;
&lt;li&gt;Simple visual content judgment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, Qwen2-VL / Qwen2.5-VL models are useful for Chinese screenshot understanding, OCR, and image-text Q&amp;amp;A. Make sure the main model and &lt;code&gt;mmproj&lt;/code&gt; file match; version mismatches can easily cause loading failures or abnormal output.&lt;/p&gt;
&lt;h2 id=&#34;use-a-bat-script-to-manage-multiple-models&#34;&gt;Use a bat Script to Manage Multiple Models
&lt;/h2&gt;&lt;p&gt;If you keep multiple models locally, you can write a simple &lt;code&gt;.bat&lt;/code&gt; script to switch between them. The following example needs your own path and model names:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;15
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;16
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bat&#34; data-lang=&#34;bat&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;@&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; off
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;chcp 65001 &lt;span class=&#34;p&#34;&gt;&amp;gt;&lt;/span&gt;nul
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;cd&lt;/span&gt; /d C:\path\to\llama-b9196-bin-win-cuda-13.1-x64
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 请选择模型：
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 1. Gemma
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 2. Qwen VL 多模态
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; 3. DeepSeek
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;set&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;/p&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;choice&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;输入数字：
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;%choice%&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;1&amp;#34;&lt;/span&gt; llama-server.exe -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\gemma.gguf&amp;#34;&lt;/span&gt; -ngl 999
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;%choice%&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;2&amp;#34;&lt;/span&gt; llama-server.exe -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\qwen-vl.gguf&amp;#34;&lt;/span&gt; --mmproj &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\mmproj.gguf&amp;#34;&lt;/span&gt; -ngl 999
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;%choice%&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;==&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;3&amp;#34;&lt;/span&gt; llama-server.exe -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\deepseek.gguf&amp;#34;&lt;/span&gt; -ngl 999
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;pause&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Save it as UTF-8, then change the extension to &lt;code&gt;.bat&lt;/code&gt;. Double-clicking the script lets you choose different models by number.&lt;/p&gt;
&lt;h2 id=&#34;three-things-to-check-when-choosing-models&#34;&gt;Three Things to Check When Choosing Models
&lt;/h2&gt;&lt;p&gt;First, check hardware. More VRAM means you can run larger models. If VRAM is limited, do not force a large model; start with 7B, 8B, or a lower quantization version.&lt;/p&gt;
&lt;p&gt;Second, check the use case. For everyday Q&amp;amp;A, summarization, and rewriting, small models or medium quantization are often enough. For coding, long-document analysis, or multimodal understanding, you need stronger models and more VRAM.&lt;/p&gt;
&lt;p&gt;Third, check licenses and safety boundaries. Many community-modified models have different capabilities, restrictions, and licenses. Before downloading, confirm the source, license, intended use, and risks. Do not hand production work directly to models from unclear sources.&lt;/p&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common Issues
&lt;/h2&gt;&lt;p&gt;If startup reports missing DLLs, first confirm that the downloaded package matches your GPU route. NVIDIA users should not download the HIP build by mistake, and AMD users should not download the CUDA build.&lt;/p&gt;
&lt;p&gt;If model loading is slow, the model may be too large, the disk may be slow, or part of the model may be falling back to CPU due to insufficient VRAM.&lt;/p&gt;
&lt;p&gt;If the web page does not open, check whether the command line service started successfully, then confirm the port is &lt;code&gt;8080&lt;/code&gt;. If the port is occupied, check &lt;code&gt;llama-server&lt;/code&gt; parameters and change the port.&lt;/p&gt;
&lt;p&gt;If a multimodal model behaves incorrectly, first check whether the &lt;code&gt;mmproj&lt;/code&gt; file matches the main model instead of only changing prompts.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The value of these Windows prebuilt packages is that they lower the entry barrier for local AI. Many users previously got stuck at compilation and dependency setup. Now they can move faster into downloading models, starting a service, and testing results.&lt;/p&gt;
&lt;p&gt;For Windows users, the route can be summarized simply:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NVIDIA: prefer CUDA.&lt;/li&gt;
&lt;li&gt;AMD: try Vulkan first, then HIP.&lt;/li&gt;
&lt;li&gt;Intel: try SYCL or Vulkan.&lt;/li&gt;
&lt;li&gt;No discrete GPU: use the CPU build for small models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Before real use, still confirm model source, license, VRAM needs, and actual results. Local AI gives you control, offline operation, and low latency, but it is not free of cost: model management, hardware resources, and output quality are still your responsibility.&lt;/p&gt;
&lt;p&gt;Source: &lt;a class=&#34;link&#34; href=&#34;https://www.freedidi.com/24211.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.freedidi.com/24211.html&lt;/a&gt;&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Claude Code &#43; Ollama Local Deployment Guide: Build a Free AI Coding Assistant with CC Switch</title>
        <link>https://knightli.com/en/2026/05/15/claude-code-ollama-cc-switch-local-agent/</link>
        <pubDate>Fri, 15 May 2026 23:27:50 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/15/claude-code-ollama-cc-switch-local-agent/</guid>
        <description>&lt;p&gt;&lt;code&gt;Claude Code&lt;/code&gt; has become a popular AI coding assistant recently. Its appeal is not just that it can chat about code, but that it can read a project, modify files, run commands, install dependencies, and keep fixing errors in an agent-like workflow.&lt;/p&gt;
&lt;p&gt;The hard part is cost. Once a project grows, long context and repeated agent turns can burn through API quota quickly. If you just want to experiment, refactor small utilities, generate scripts, or work on a private local project, it is natural to ask: can Claude Code&amp;rsquo;s workflow be kept while the model runs locally?&lt;/p&gt;
&lt;p&gt;The key tool in this setup is &lt;code&gt;CC Switch&lt;/code&gt;. It lets Claude Code connect to the local &lt;code&gt;Ollama&lt;/code&gt; service through an OpenAI-compatible API endpoint, so requests can be forwarded to a local model instead of the official Claude API.&lt;/p&gt;
&lt;h2 id=&#34;what-this-setup-solves&#34;&gt;What This Setup Solves
&lt;/h2&gt;&lt;p&gt;You can think of the whole setup as:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Claude Code desktop
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;+ CC Switch API forwarding layer
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;+ Ollama local model
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Claude Code is still responsible for the coding workflow and project operations. CC Switch handles model provider configuration and API compatibility. Ollama runs the model locally.&lt;/p&gt;
&lt;p&gt;This does not make a local model suddenly become Claude. Its real value is that it makes Claude Code&amp;rsquo;s agent workflow usable in lower-cost, offline, and private local scenarios.&lt;/p&gt;
&lt;h2 id=&#34;basic-preparation&#34;&gt;Basic Preparation
&lt;/h2&gt;&lt;p&gt;Before you start, prepare these pieces:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install &lt;code&gt;Git&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;Ollama&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pull a local model suitable for coding.&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;CC Switch&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Have Claude Code available on your machine.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For the model side, you can start with coding-oriented models, such as Qwen Coder, DeepSeek Coder, or other models with decent tool-calling and code generation behavior. The larger the model, the better the result may be, but memory and GPU pressure will also rise.&lt;/p&gt;
&lt;p&gt;If your machine only has limited memory, start with a smaller model first. Confirm that the workflow runs smoothly before trying a larger one.&lt;/p&gt;
&lt;h2 id=&#34;key-cc-switch-configuration&#34;&gt;Key CC Switch Configuration
&lt;/h2&gt;&lt;p&gt;After Ollama starts, its default local API address is usually:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:11434/v1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;In CC Switch, choose an OpenAI-compatible provider type, commonly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;OpenAI Chat Completions
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then point the base URL to Ollama&amp;rsquo;s local address.&lt;/p&gt;
&lt;p&gt;For the API key field, local Ollama normally does not need a real key, but many tools still require an environment variable or placeholder. You can use:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ANTHROPIC_API_KEY
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;or another placeholder variable accepted by your local setup.&lt;/p&gt;
&lt;p&gt;One configuration item is worth special attention:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&amp;#34;inferenceModels&amp;#34;=&amp;#34;[\&amp;#34;haiku\&amp;#34;,\&amp;#34;sonnet\&amp;#34;,\&amp;#34;opus\&amp;#34;]&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This means mapping Claude Code&amp;rsquo;s expected model roles to the local provider. In practice, you need to bind &lt;code&gt;haiku&lt;/code&gt;, &lt;code&gt;sonnet&lt;/code&gt;, and &lt;code&gt;opus&lt;/code&gt; to the model names exposed by Ollama or CC Switch. If this mapping is wrong, Claude Code may fail to call the model or may keep falling back to an unexpected configuration.&lt;/p&gt;
&lt;h2 id=&#34;where-claude-code-is-strong&#34;&gt;Where Claude Code Is Strong
&lt;/h2&gt;&lt;p&gt;Claude Code&amp;rsquo;s biggest advantage is not raw completion. It is the full coding workflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reading and understanding project structure;&lt;/li&gt;
&lt;li&gt;locating related files based on a task;&lt;/li&gt;
&lt;li&gt;editing code directly;&lt;/li&gt;
&lt;li&gt;running commands and tests;&lt;/li&gt;
&lt;li&gt;observing errors and iterating;&lt;/li&gt;
&lt;li&gt;completing multi-step tasks in one session.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is why many people want to keep Claude Code even when switching to a local model. A normal chat UI can generate code snippets, but it does not naturally operate inside a repository. Claude Code is closer to an executable development assistant.&lt;/p&gt;
&lt;h2 id=&#34;what-role-ollama-plays-here&#34;&gt;What Role Ollama Plays Here
&lt;/h2&gt;&lt;p&gt;Ollama is responsible for local model runtime and management. It handles model downloading, loading, and local inference.&lt;/p&gt;
&lt;p&gt;The advantage is clear: requests stay on your machine, repeated use does not create API bills, and you can use it when the network is limited. For private code, this is also easier to accept than sending every context window to a cloud model.&lt;/p&gt;
&lt;p&gt;The trade-off is also clear. Local models depend heavily on your hardware and on model quality. A smaller model can handle simple edits, explanations, and script generation, but it may struggle with large cross-file refactors or subtle architectural decisions.&lt;/p&gt;
&lt;h2 id=&#34;where-the-experience-has-boundaries&#34;&gt;Where The Experience Has Boundaries
&lt;/h2&gt;&lt;p&gt;This setup should not be treated as a full replacement for Claude&amp;rsquo;s strongest cloud models.&lt;/p&gt;
&lt;p&gt;You may run into these issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;weaker long-context understanding;&lt;/li&gt;
&lt;li&gt;unstable tool-calling behavior in complex tasks;&lt;/li&gt;
&lt;li&gt;slower inference on CPU-only machines;&lt;/li&gt;
&lt;li&gt;more hallucinated file paths or APIs;&lt;/li&gt;
&lt;li&gt;less reliable multi-round planning;&lt;/li&gt;
&lt;li&gt;lower success rate on large repository refactors.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the better expectation is: use it as a free local development assistant, not as a perfect substitute for a top-tier cloud model.&lt;/p&gt;
&lt;h2 id=&#34;multimodal-compatibility-is-still-unstable&#34;&gt;Multimodal Compatibility Is Still Unstable
&lt;/h2&gt;&lt;p&gt;Some users want Claude Code to handle screenshots, UI images, diagrams, or other multimodal inputs. This part depends on the local model and the forwarding layer.&lt;/p&gt;
&lt;p&gt;If the selected Ollama model does not support vision, or CC Switch does not translate the request format correctly, multimodal features may fail. Even with a vision model, behavior may differ from Claude&amp;rsquo;s official API.&lt;/p&gt;
&lt;p&gt;For now, this setup is more suitable for text and code workflows. Treat multimodal support as experimental.&lt;/p&gt;
&lt;h2 id=&#34;who-should-try-it&#34;&gt;Who Should Try It
&lt;/h2&gt;&lt;p&gt;This setup is suitable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;developers who want to try Claude Code&amp;rsquo;s workflow at low cost;&lt;/li&gt;
&lt;li&gt;users who frequently write scripts, small tools, and automation snippets;&lt;/li&gt;
&lt;li&gt;teams that want to keep code on local machines;&lt;/li&gt;
&lt;li&gt;learners who want an AI coding assistant without constant API spend;&lt;/li&gt;
&lt;li&gt;people testing different local coding models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is less suitable if you rely heavily on long context, large monorepos, strict code review quality, or complex full-project refactors.&lt;/p&gt;
&lt;h2 id=&#34;usage-advice&#34;&gt;Usage Advice
&lt;/h2&gt;&lt;p&gt;Start with small tasks.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;explain a single file;&lt;/li&gt;
&lt;li&gt;refactor a small function;&lt;/li&gt;
&lt;li&gt;generate a shell script;&lt;/li&gt;
&lt;li&gt;fix a simple error;&lt;/li&gt;
&lt;li&gt;add a small feature;&lt;/li&gt;
&lt;li&gt;write unit tests for a narrow module.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After each change, run tests or at least review the diff yourself. A local model can be useful, but you should not blindly accept every generated edit.&lt;/p&gt;
&lt;p&gt;If the model keeps losing context, reduce the task scope. Instead of asking it to &amp;ldquo;refactor the whole project&amp;rdquo;, ask it to &amp;ldquo;refactor this function&amp;rdquo; or &amp;ldquo;add validation in this file&amp;rdquo;.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Claude Code + CC Switch + Ollama&lt;/code&gt; is an interesting combination. It keeps Claude Code&amp;rsquo;s agent-style development workflow while moving inference to a local model.&lt;/p&gt;
&lt;p&gt;Its biggest strengths are lower cost, local privacy, and a smooth development workflow. Its limits are also obvious: model quality, hardware performance, long context, and tool-calling stability all affect the final experience.&lt;/p&gt;
&lt;p&gt;If you already use Ollama and want a more practical local AI coding workflow, this setup is worth trying. Just remember to start small, verify every change, and treat the local model as an assistant rather than an automatic engineer.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Running DeepSeek 4 Locally: Antirez&#39;s ds4 Experiment on Apple Silicon Mac</title>
        <link>https://knightli.com/en/2026/05/11/deepseek-v4-flash-ds4-metal/</link>
        <pubDate>Mon, 11 May 2026 08:51:37 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/11/deepseek-v4-flash-ds4-metal/</guid>
        <description>&lt;p&gt;Antirez has open sourced a new project: &lt;code&gt;ds4&lt;/code&gt;. It is not a general-purpose LLM framework, but a local inference engine for DeepSeek V4 Flash, with a focus on Apple Silicon and the Metal backend.&lt;/p&gt;
&lt;p&gt;Project URL: &lt;a class=&#34;link&#34; href=&#34;https://github.com/antirez/ds4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/antirez/ds4&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;what-is-ds4&#34;&gt;What is ds4?
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ds4&lt;/code&gt; has a clear goal: running DeepSeek V4 Flash locally on a Mac.&lt;/p&gt;
&lt;p&gt;It currently provides three ways to use it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Interactive CLI.&lt;/li&gt;
&lt;li&gt;HTTP server.&lt;/li&gt;
&lt;li&gt;An experimental Agent mode.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Judging from its positioning, it is more like an inference project deeply optimized for one specific model than a replacement for general-purpose tools such as &lt;code&gt;llama.cpp&lt;/code&gt;, Ollama, or vLLM.&lt;/p&gt;
&lt;h2 id=&#34;why-it-is-worth-watching&#34;&gt;Why it is worth watching
&lt;/h2&gt;&lt;p&gt;There are three main reasons this kind of project is worth following.&lt;/p&gt;
&lt;p&gt;First, the author is Antirez, the creator of Redis. He has long focused on low-level systems, performance, and simple tools, and his projects are usually quite direct in style.&lt;/p&gt;
&lt;p&gt;Second, DeepSeek V4 Flash points toward efficient inference. If the local running experience is good enough, it could be very attractive for Mac users.&lt;/p&gt;
&lt;p&gt;Third, &lt;code&gt;ds4&lt;/code&gt; directly targets Apple Metal. Compared with the route of supporting every platform first and optimizing later, it feels more like a project trying to go deep on one well-defined scenario.&lt;/p&gt;
&lt;h2 id=&#34;who-should-try-it&#34;&gt;Who should try it
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ds4&lt;/code&gt; is better suited for users who:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use an Apple Silicon Mac.&lt;/li&gt;
&lt;li&gt;Want to run DeepSeek V4 Flash locally.&lt;/li&gt;
&lt;li&gt;Care about Metal inference performance.&lt;/li&gt;
&lt;li&gt;Are willing to try an alpha-stage project.&lt;/li&gt;
&lt;li&gt;Want to study lightweight inference engines and model runtime details.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your goal is stable deployment, cross-platform operation, or OpenAI API-compatible infrastructure, it may not be the first choice at this stage. It is better treated as an experimental tool and a technical project to watch.&lt;/p&gt;
&lt;h2 id=&#34;how-to-use-it&#34;&gt;How to use it
&lt;/h2&gt;&lt;p&gt;The basic workflow in the project README is to build it first, then run it.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/antirez/ds4.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ds4
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;make
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Run it interactively:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./ds4
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Start the HTTP server:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./ds4 --server
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Agent mode:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./ds4 --agent
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For exact parameters and model file preparation, follow the repository README, because the project is still changing quickly.&lt;/p&gt;
&lt;h2 id=&#34;current-risks&#34;&gt;Current risks
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ds4&lt;/code&gt; is still at an early stage, so set expectations before using it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Features may be incomplete.&lt;/li&gt;
&lt;li&gt;Parameters, model formats, and command-line behavior may change.&lt;/li&gt;
&lt;li&gt;Compatibility mainly revolves around Apple Silicon and Metal.&lt;/li&gt;
&lt;li&gt;Agent mode is more experimental and is not suitable for direct production use.&lt;/li&gt;
&lt;li&gt;When something breaks, you may need to read the README, issues, or source code yourself.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, it is currently more of an open source experiment worth trying than a one-click tool for ordinary users.&lt;/p&gt;
&lt;h2 id=&#34;how-it-differs-from-general-inference-tools&#34;&gt;How it differs from general inference tools
&lt;/h2&gt;&lt;p&gt;General-purpose inference tools usually aim for broad compatibility across model formats, platforms, backends, and APIs. &lt;code&gt;ds4&lt;/code&gt; takes a narrower path: local DeepSeek V4 Flash inference on Metal.&lt;/p&gt;
&lt;p&gt;That choice has both benefits and trade-offs.&lt;/p&gt;
&lt;p&gt;The benefit is that the implementation can stay focused, making performance and user experience easier to optimize around a single target. The trade-off is a limited scope: it is not meant to run every possible model, nor to replace a complete deployment platform.&lt;/p&gt;
&lt;p&gt;If you already use &lt;code&gt;llama.cpp&lt;/code&gt; or Ollama, &lt;code&gt;ds4&lt;/code&gt; is better treated as a supplementary testing tool, not an immediate replacement for your existing workflow.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The interesting part of &lt;code&gt;ds4&lt;/code&gt; is not that it is yet another local LLM tool. It is that its scope is intentionally narrow: DeepSeek V4 Flash, Apple Silicon, Metal, and local inference.&lt;/p&gt;
&lt;p&gt;If you have a suitable Mac and are willing to tinker with an early-stage project, it is worth watching its performance, model support approach, and server/agent capabilities. For production environments, it is better to keep observing until the interfaces and usage patterns become more stable.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;GitHub project: &lt;a class=&#34;link&#34; href=&#34;https://github.com/antirez/ds4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/antirez/ds4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>A Practical llama.cpp Multi-GPU Benchmarking Approach: Is 2x V100 16GB Faster Than One 32GB Card?</title>
        <link>https://knightli.com/en/2026/05/09/llama-cpp-multi-gpu-offload-performance/</link>
        <pubDate>Sat, 09 May 2026 15:05:41 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/09/llama-cpp-multi-gpu-offload-performance/</guid>
        <description>&lt;p&gt;Short version: llama.cpp multi-GPU offload is not free performance just because you add a second card. If the model already fits fully on one 32GB GPU, 2x V100 16GB is often less convenient than a single 32GB card and may even be slower. If the model does not fit on one 16GB card, the main value of dual GPUs is that the model can stay on GPU, and the benefit can be obvious.&lt;/p&gt;
&lt;h2 id=&#34;first-understand-split-mode&#34;&gt;First, Understand split mode
&lt;/h2&gt;&lt;p&gt;llama.cpp multi-GPU usage mainly revolves around &lt;code&gt;--split-mode&lt;/code&gt; and &lt;code&gt;--tensor-split&lt;/code&gt;. When discussing performance, distinguish these modes first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;layer&lt;/code&gt;: splits layers across GPUs. It is usually the most compatible starting point.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tensor&lt;/code&gt;: splits tensor computation across multiple GPUs. It is closer to true parallel compute, but depends more heavily on inter-GPU bandwidth and backend support.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;row&lt;/code&gt;: an older row-splitting mode that still appears in some setups, but is usually not the first choice for new deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In simple terms, &lt;code&gt;layer&lt;/code&gt; is like putting different floors on different cards. During single-token generation, it may not keep both cards fully busy at the same time. &lt;code&gt;tensor&lt;/code&gt; is more like letting both cards work on the same layer together. It has more theoretical parallelism, but inter-GPU communication can become the bottleneck.&lt;/p&gt;
&lt;h2 id=&#34;if-one-32gb-card-can-fit-the-model-dual-16gb-is-not-always-faster&#34;&gt;If One 32GB Card Can Fit the Model, Dual 16GB Is Not Always Faster
&lt;/h2&gt;&lt;p&gt;If the model and KV cache fit fully on one 32GB GPU, a single card is usually steadier and often faster. For hardware in the same generation, such as 1x V100 32GB versus 2x V100 16GB, the dual-card setup does not necessarily win.&lt;/p&gt;
&lt;p&gt;A conservative expectation is that 2x V100 16GB may be 10% to 40% slower than one V100 32GB, especially for single-user chat, Continue Agent, and code Q&amp;amp;A workloads where one request is mainly generating one answer.&lt;/p&gt;
&lt;p&gt;The reason is straightforward: multi-GPU does not simply merge VRAM into one fast pool. With layer splitting, inference moves across GPUs and one card may wait for the other during token generation. With tensor splitting, both cards can compute together, but intermediate results need cross-GPU synchronization, so bandwidth and latency directly affect throughput.&lt;/p&gt;
&lt;p&gt;So if your choice is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1x V100 32GB&lt;/li&gt;
&lt;li&gt;2x V100 16GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and the target model already fits fully on one 32GB card, the single 32GB card is often the more comfortable option.&lt;/p&gt;
&lt;h2 id=&#34;if-one-16gb-card-cannot-fit-the-model-dual-cards-matter&#34;&gt;If One 16GB Card Cannot Fit the Model, Dual Cards Matter
&lt;/h2&gt;&lt;p&gt;The situation changes completely when the model does not fit on one 16GB card but can fit across two 16GB cards.&lt;/p&gt;
&lt;p&gt;In that case, the value of dual GPUs is very direct:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One 16GB card: may require heavy CPU offload, which can slow things down a lot.&lt;/li&gt;
&lt;li&gt;2x 16GB cards: weights can stay mostly on GPU, which may be much faster than mixed CPU/GPU execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this scenario, 2x V100 16GB is not guaranteed to beat one 32GB card, but it may be several times faster than a single 16GB card with heavy system-memory offload. In other words, the first value of dual cards is not acceleration. It is avoiding the need to push model weights into slower system RAM.&lt;/p&gt;
&lt;h2 id=&#34;v100-pcie-and-v100-sxm2-are-very-different&#34;&gt;V100 PCIe and V100 SXM2 Are Very Different
&lt;/h2&gt;&lt;p&gt;The easiest thing to overlook in multi-GPU inference is the interconnect.&lt;/p&gt;
&lt;p&gt;If you have V100 SXM2 with NVLink, cross-GPU communication bandwidth is much higher. NVIDIA&amp;rsquo;s V100 material lists NVLink interconnect bandwidth up to 300GB/s. In that environment, &lt;code&gt;tensor&lt;/code&gt; mode or higher-batch workloads have a better chance of approaching or exceeding single-card performance.&lt;/p&gt;
&lt;p&gt;If you have V100 PCIe, expectations should be more conservative. V100 PCIe mainly uses PCIe Gen3, and the listed interconnect bandwidth is 32GB/s. That is a very different class from NVLink, which is why dual PCIe cards often provide enough VRAM without doubling speed.&lt;/p&gt;
&lt;p&gt;So when judging whether 2x V100 16GB is worthwhile, do not only add the VRAM to 32GB. Also check whether the cards are PCIe or SXM2/NVLink.&lt;/p&gt;
&lt;h2 id=&#34;a-practical-buying-rule&#34;&gt;A Practical Buying Rule
&lt;/h2&gt;&lt;p&gt;If the model fits on one 32GB GPU, choose the single card first. Its latency, stability, and tuning cost are usually better.&lt;/p&gt;
&lt;p&gt;If the model does not fit on one 16GB GPU but can fit on two 16GB GPUs, dual cards are worth using. At that point, the goal is to keep weights on GPU as much as possible, not to expect linear performance scaling.&lt;/p&gt;
&lt;p&gt;If you have dual V100 PCIe cards, start with &lt;code&gt;--split-mode layer&lt;/code&gt; and aim for stable execution with less CPU fallback.&lt;/p&gt;
&lt;p&gt;If you have V100 SXM2/NVLink, it is more worth testing &lt;code&gt;tensor&lt;/code&gt;-related modes, especially for prefill, larger batches, or concurrent serving.&lt;/p&gt;
&lt;h2 id=&#34;when-to-buy-2x16gb-and-when-to-buy-1x32gb&#34;&gt;When to Buy 2x16GB and When to Buy 1x32GB
&lt;/h2&gt;&lt;p&gt;If you serve only one user and mainly do chat, code completion, Continue Agent, or long-context Q&amp;amp;A, and the target model fits within 32GB, 1x32GB is usually the better choice. It avoids cross-GPU scheduling, has steadier latency, and is easier to debug.&lt;/p&gt;
&lt;p&gt;If you already own one 16GB card and want a lower-cost path to run 30B, 32B, or higher-quantized models, 2x16GB makes sense. It may not double token/s, but it can keep weights on GPU that would otherwise require CPU offload.&lt;/p&gt;
&lt;p&gt;If you are buying from scratch, the priority can look like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single model, single user, latency-sensitive: prefer 1x32GB.&lt;/li&gt;
&lt;li&gt;Model does not fit on one card and budget is limited: consider 2x16GB.&lt;/li&gt;
&lt;li&gt;Machine has NVLink or SXM2: 2x16GB is much more interesting than ordinary PCIe dual cards.&lt;/li&gt;
&lt;li&gt;You want longer context later: do not only count model weights; reserve VRAM for KV cache too.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;practical-advice-for-layer-split-and-tensor-split&#34;&gt;Practical Advice for layer split and tensor split
&lt;/h2&gt;&lt;p&gt;The practical rule is: start with &lt;code&gt;layer&lt;/code&gt;, then benchmark &lt;code&gt;tensor&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;layer&lt;/code&gt; is the default starting point. It splits the model by layer, has better compatibility, and is friendlier to PCIe dual-card systems. The downside is that generation can behave more like a pipeline: at certain moments one card is busy while the other waits.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;tensor&lt;/code&gt; is better suited to machines with strong interconnects, such as V100 SXM2/NVLink. It splits part of the same layer&amp;rsquo;s computation across GPUs, so it has more parallelism in theory, but it also synchronizes across cards more often. On PCIe dual cards, communication overhead may eat the benefit.&lt;/p&gt;
&lt;p&gt;You can start with these tests:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode layer --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode tensor --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode layer --tensor-split 1,0
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The third command is not meant as the long-term configuration. It gives you a single-card reference, so you can see whether dual GPUs are actually faster or only distributing VRAM pressure.&lt;/p&gt;
&lt;h2 id=&#34;why-prefill-and-decode-behave-differently&#34;&gt;Why prefill and decode Behave Differently
&lt;/h2&gt;&lt;p&gt;Local LLM performance should usually be viewed in two stages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;prefill&lt;/code&gt;: processes the input prompt. A typical metric is prompt-processing throughput such as &lt;code&gt;pp512&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;decode&lt;/code&gt;: generates the response token by token. A typical metric is token-generation throughput such as &lt;code&gt;tg128&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;prefill&lt;/code&gt; is more like large-batch matrix computation. With larger batches, it is easier to keep GPUs busy and more likely to benefit from multi-GPU parallelism. &lt;code&gt;decode&lt;/code&gt; generates one token after another. The batch is smaller and synchronization is more frequent, so cross-card communication and scheduling latency are easier to notice.&lt;/p&gt;
&lt;p&gt;That is why you may see dual GPUs improve &lt;code&gt;pp512&lt;/code&gt; while &lt;code&gt;tg128&lt;/code&gt; barely improves or even gets worse. For chat and agent workflows, user experience is closer to &lt;code&gt;tg128&lt;/code&gt;. For long document ingestion, batch prefill, or concurrent serving, &lt;code&gt;pp512&lt;/code&gt; also matters.&lt;/p&gt;
&lt;h2 id=&#34;can-kv-cache-become-a-second-vram-bottleneck&#34;&gt;Can KV cache Become a Second VRAM Bottleneck?
&lt;/h2&gt;&lt;p&gt;Yes. Many people only count model weights and forget KV cache.&lt;/p&gt;
&lt;p&gt;Model weights decide whether the model can load. KV cache decides whether you can use the context length you want. The longer the context, the higher the concurrency, and the larger the batch, the more visible KV cache usage becomes. You may find that the model itself fits in 32GB, but 32K or 64K context pushes VRAM over the limit.&lt;/p&gt;
&lt;p&gt;At minimum, leave VRAM headroom for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;KV cache&lt;/li&gt;
&lt;li&gt;CUDA graph or backend runtime overhead&lt;/li&gt;
&lt;li&gt;prompt batch and ubatch&lt;/li&gt;
&lt;li&gt;desktop, driver, and other process usage&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you use 2x16GB, VRAM is not a fully equivalent 32GB pool. Some buffers, KV cache, or intermediate tensors may still be limited by remaining memory on a single card. When testing long context, use the target &lt;code&gt;--ctx-size&lt;/code&gt; and target concurrency directly instead of only checking whether the model starts.&lt;/p&gt;
&lt;h2 id=&#34;how-to-benchmark-dual-cards-with-llama-bench&#34;&gt;How to Benchmark Dual Cards with llama-bench
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;llama-bench&lt;/code&gt; is better than direct chatting for hardware comparison because it separates prompt processing and token generation into comparable metrics. The default example in the official README is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For dual V100 cards, test at least these sets:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Single-card baseline&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt; llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Dual-card layer split&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode layer --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Dual-card tensor split&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode tensor --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Focus on two columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pp512&lt;/code&gt;: prompt processing, more relevant to long inputs and batch prefill.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tg128&lt;/code&gt;: token generation, more relevant to single-user chat and agent responsiveness.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Keep the model, quantization, context length, batch settings, driver version, and llama.cpp version fixed. Run each group several times and compare medians rather than one-off results. Finally, test your real workflow too, such as Continue Agent, an OpenAI-compatible server, or your own RAG requests, because a good benchmark does not always mean better interactive experience.&lt;/p&gt;
&lt;h2 id=&#34;one-sentence-conclusion&#34;&gt;One-Sentence Conclusion
&lt;/h2&gt;&lt;p&gt;The main advantage of 2x V100 16GB is VRAM capacity, not guaranteed generation speed. If the model fits on one card, a single 32GB card is usually faster and steadier. If the model does not fit on one 16GB card, dual 16GB cards become valuable because they avoid heavy CPU offload. Whether they are faster depends on split mode, batch size, model size, and whether the two V100 cards are connected through PCIe or NVLink.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;llama.cpp server README&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.mintlify.com/ggml-org/llama.cpp/concepts/backends&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;llama.cpp Compute Backends&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.nvidia.com/en-gb/data-center/tesla-v100/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA Tesla V100&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet.pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA V100 Datasheet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>RTX 5090 / 5080 AI Inference Benchmarks: Choosing for Local LLMs, 4K Video, and Real-Time 3D</title>
        <link>https://knightli.com/en/2026/05/08/rtx-5090-5080-ai-inference-benchmark/</link>
        <pubDate>Fri, 08 May 2026 10:07:19 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/08/rtx-5090-5080-ai-inference-benchmark/</guid>
        <description>&lt;p&gt;For local AI users, the RTX 50 series is exciting not only because of gaming performance, but because Blackwell, GDDR7 memory, and fifth-generation Tensor Cores change what a desktop AI workstation can do. If you run local LLMs, image generation, video enhancement, or real-time 3D workflows, the GPU is no longer just a rendering device.&lt;/p&gt;
&lt;p&gt;RTX 5090 and RTX 5080 should not be judged by model name alone. Both use Blackwell, support DLSS 4, fifth-generation Tensor Cores, and FP4, but local AI experience is usually decided by VRAM capacity, memory bandwidth, software support, and model compatibility.&lt;/p&gt;
&lt;p&gt;The short version: RTX 5090 is the better single-card flagship for local AI, large models, long context, image generation, and video AI. RTX 5080 is better for smaller models, tighter budgets, and workflows that fit inside 16GB of VRAM. Both improve on the previous generation, but not every AI app can immediately use all Blackwell features.&lt;/p&gt;
&lt;h2 id=&#34;start-with-the-hardware-gap&#34;&gt;Start With The Hardware Gap
&lt;/h2&gt;&lt;p&gt;RTX 5090 has 32GB GDDR7, a 512-bit memory bus, 21760 CUDA cores, and 3352 AI TOPS. Public testing from Puget Systems also highlights about 1.79TB/s of memory bandwidth, compared with RTX 4090&amp;rsquo;s 24GB and about 1.01TB/s. That matters for AI workloads.&lt;/p&gt;
&lt;p&gt;RTX 5080 is more restrained: 16GB GDDR7, a 256-bit memory bus, 10752 CUDA cores, and 1801 AI TOPS. Its bandwidth is about 960GB/s, a clear jump over RTX 4080-class cards, but VRAM stays at 16GB.&lt;/p&gt;
&lt;p&gt;That gives the two cards very different roles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;RTX 5090 is stronger for larger models, longer context, and heavier multimodal workloads because of 32GB VRAM and high bandwidth.&lt;/li&gt;
&lt;li&gt;RTX 5080 is more cost- and power-conscious, and fits small to medium models, image generation, lighter video work, and development.&lt;/li&gt;
&lt;li&gt;If a workload is already VRAM-limited, RTX 5080 cannot solve that with compute alone.&lt;/li&gt;
&lt;li&gt;If a workload is software-limited, RTX 5090 may not always pull far ahead of RTX 4090 in proportion to its specs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Local AI inference often follows a simple rule: VRAM decides whether it runs, bandwidth decides how fast it feels. That is why RTX 5090 is more attractive for local LLM users.&lt;/p&gt;
&lt;h2 id=&#34;local-llms-32gb-matters-more&#34;&gt;Local LLMs: 32GB Matters More
&lt;/h2&gt;&lt;p&gt;When running LLMs, VRAM is mainly used by model weights, KV cache, and runtime overhead. Larger models, longer context, and higher concurrency all increase pressure.&lt;/p&gt;
&lt;p&gt;RTX 5080&amp;rsquo;s 16GB can cover many 7B, 8B, and 14B models, and can run some larger models with 4-bit quantization. But if you want 30B-class models, longer context, or WebUI, RAG, voice, and tool calls at the same time, 16GB becomes a limit quickly.&lt;/p&gt;
&lt;p&gt;RTX 5090&amp;rsquo;s 32GB gives local inference much more room. It is better for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Running quantized models around the 30B level.&lt;/li&gt;
&lt;li&gt;Keeping longer context on 7B and 14B models.&lt;/li&gt;
&lt;li&gt;Local coding assistants, knowledge-base Q&amp;amp;A, and Agent debugging.&lt;/li&gt;
&lt;li&gt;Loading embedding, reranker, or multimodal components alongside the main model.&lt;/li&gt;
&lt;li&gt;Reducing model switching and context compromises on a single machine.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Still, 32GB is not magic. Even 70B-class models with 4-bit quantization often need careful context, runtime settings, and memory management. For high-concurrency service, multi-GPU or server GPUs remain more suitable.&lt;/p&gt;
&lt;p&gt;For personal use, RTX 5090&amp;rsquo;s biggest benefit is less friction: more model choices, more comfortable context length, and enough room for GUI tools and companion components.&lt;/p&gt;
&lt;h2 id=&#34;fp4-is-potential-not-instant-acceleration-everywhere&#34;&gt;FP4 Is Potential, Not Instant Acceleration Everywhere
&lt;/h2&gt;&lt;p&gt;One major Blackwell change is FP4 support in fifth-generation Tensor Cores. NVIDIA&amp;rsquo;s TensorRT materials note that FP4 can reduce model memory use and data movement, and can help local inference for generative models such as FLUX.&lt;/p&gt;
&lt;p&gt;That is important for image generation and future LLM inference. Lower precision means less VRAM pressure and less bandwidth pressure. On a high-bandwidth GPU such as RTX 5090, FP4 can theoretically amplify the advantage if frameworks and models support it well.&lt;/p&gt;
&lt;p&gt;But FP4 gains depend on the software path:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether the model has a suitable FP4 quantized version.&lt;/li&gt;
&lt;li&gt;Whether the inference framework supports the needed operators.&lt;/li&gt;
&lt;li&gt;Whether TensorRT, ComfyUI, PyTorch, ONNX, or plugins are adapted.&lt;/li&gt;
&lt;li&gt;Whether the task can accept the precision tradeoff.&lt;/li&gt;
&lt;li&gt;Whether the user is willing to adjust the workflow for speed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So RTX 50 AI performance should not be judged only by FP4 peak numbers. Blackwell provides the hardware base, but the real experience depends on app updates. Early adopters will see some benefits first; mainstream users may need to wait for the ecosystem.&lt;/p&gt;
&lt;h2 id=&#34;image-generation-and-4k-video-bandwidth-and-vram-work-together&#34;&gt;Image Generation And 4K Video: Bandwidth And VRAM Work Together
&lt;/h2&gt;&lt;p&gt;Stable Diffusion, FLUX, video super-resolution, frame interpolation, denoising, matting, and generative video all care about VRAM. Higher resolution costs more memory; more nodes add runtime overhead; ControlNet, LoRA, high-res fix, and batch generation increase pressure further.&lt;/p&gt;
&lt;p&gt;RTX 5080 can handle many image-generation jobs inside 16GB. For 1024px images, light LoRA use, and normal ComfyUI workflows, it is already fast enough. Problems appear with larger canvases, more complex node graphs, higher batch sizes, or long-sequence video generation.&lt;/p&gt;
&lt;p&gt;RTX 5090 has clearer advantages in 4K video workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;32GB VRAM is better for high-resolution frames, long sequences, and complex node graphs.&lt;/li&gt;
&lt;li&gt;Around 1.79TB/s bandwidth helps reduce data-movement bottlenecks.&lt;/li&gt;
&lt;li&gt;Three ninth-generation NVENC encoders are useful for export, transcoding, and creator workflows.&lt;/li&gt;
&lt;li&gt;Once FP4 and TensorRT support matures, image generation models may benefit more.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Public video AI benchmarks also show a caution: application optimization has not fully caught up. Puget Systems found that RTX 5090 does not always dramatically beat RTX 4090 in DaVinci Resolve AI and Topaz Video AI, and RTX 5080 does not always create a large gap over RTX 4080-class cards. Video AI is not just about specs; plugins, drivers, and model implementations matter.&lt;/p&gt;
&lt;p&gt;In other words, RTX 50 is more compelling if your workflow already supports Blackwell, TensorRT, or FP4. If you mostly rely on commercial software that has not been optimized yet, the upgrade value depends on the exact version.&lt;/p&gt;
&lt;h2 id=&#34;real-time-3d-and-ai-modeling-rtx-5090-fits-heavier-scenes&#34;&gt;Real-Time 3D And AI Modeling: RTX 5090 Fits Heavier Scenes
&lt;/h2&gt;&lt;p&gt;Real-time 3D modeling, neural rendering, 3D asset generation, and viewport AI acceleration use CUDA, RT Cores, Tensor Cores, and VRAM at the same time. Unlike pure LLM work, the goal is not only token speed. Scene complexity, materials, geometry, ray tracing, AI denoising, and viewport frame rate all matter.&lt;/p&gt;
&lt;p&gt;RTX 5080 can handle many 4K gaming, real-time preview, and medium-scale creative projects. For independent creators, it is a realistic high-performance option.&lt;/p&gt;
&lt;p&gt;RTX 5090 is a better fit for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Complex 3D scene preview.&lt;/li&gt;
&lt;li&gt;High-resolution materials and large asset libraries.&lt;/li&gt;
&lt;li&gt;AI denoising, upscaling, and generative modeling assistance running together.&lt;/li&gt;
&lt;li&gt;Heavy D5 Render, Blender, Unreal Engine, and similar workloads.&lt;/li&gt;
&lt;li&gt;Modeling while also running a local AI assistant or reference-image generator.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;NVIDIA says RTX 50 can improve generative AI, video editing, and 3D rendering in creative apps, but production projects still depend on whether the software uses the new hardware paths. The reliable method is to test with your own project files, not only marketing charts.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How To Choose
&lt;/h2&gt;&lt;p&gt;If your main goal is local LLMs, start with VRAM. RTX 5080&amp;rsquo;s 16GB can run many lightweight models, but it is closer to an entry high-performance local AI card. RTX 5090&amp;rsquo;s 32GB is closer to a single-card local LLM workstation.&lt;/p&gt;
&lt;p&gt;For image generation, RTX 5080 covers many daily workflows. If you often use high resolution, complex node graphs, batch generation, FLUX, or video generation, RTX 5090&amp;rsquo;s VRAM headroom matters more.&lt;/p&gt;
&lt;p&gt;For 4K video AI, RTX 5090 is safer, but check the exact software version. Topaz, DaVinci Resolve, ComfyUI, TensorRT plugins, and drivers can all affect results.&lt;/p&gt;
&lt;p&gt;For real-time 3D, RTX 5080 can satisfy many creators. RTX 5090 is better for heavier scenes, parallel apps, and long production sessions.&lt;/p&gt;
&lt;p&gt;If you already own an RTX 4090, upgrade carefully. RTX 5090 has more VRAM and bandwidth, but some AI software has not fully unlocked Blackwell yet. Unless you clearly need 32GB, higher bandwidth, or the new encoders, waiting for the ecosystem is reasonable.&lt;/p&gt;
&lt;p&gt;If you are still on RTX 30 series or older, RTX 50 will feel much more meaningful. Moving from 8GB, 10GB, or 12GB to 16GB or 32GB directly expands what local AI can run.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;RTX 5090 and RTX 5080 both push consumer GPUs further into local AI, but they serve different users.&lt;/p&gt;
&lt;p&gt;RTX 5090 is about 32GB GDDR7, very high memory bandwidth, and a stronger creative hardware stack. It suits users who want larger local models, more complex image generation, heavier video AI, and real-time 3D on one machine.&lt;/p&gt;
&lt;p&gt;RTX 5080 is about entering Blackwell at a lower cost. It suits small and medium models, daily image generation, development tests, and high-performance creative work that fits in 16GB.&lt;/p&gt;
&lt;p&gt;The buying rule is simple: first check whether your models and projects fit in VRAM, then check whether your software is optimized for Blackwell, and only then look at theoretical AI TOPS. For local AI, finishing reliably matters more than peak numbers.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA GeForce RTX 5090 official specifications&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5080/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA GeForce RTX 5080 official specifications&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.nvidia.com/en-us/geforce/news/rtx-5090-5080-out-now/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA: GeForce RTX 5090 &amp;amp; 5080 Out Now&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA Technical Blog: TensorRT Unlocks FP4 Image Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.pugetsystems.com/labs/articles/nvidia-geforce-rtx-5090-amp-5080-ai-review/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Puget Systems: NVIDIA GeForce RTX 5090 &amp;amp; 5080 AI Review&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>DeepSeek V4 Local Private Deployment: Choosing Domestic Chips or Consumer GPU Clusters</title>
        <link>https://knightli.com/en/2026/05/08/deepseek-v4-local-private-deployment/</link>
        <pubDate>Fri, 08 May 2026 09:39:35 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/08/deepseek-v4-local-private-deployment/</guid>
        <description>&lt;p&gt;After DeepSeek V4 was released, many enterprises started asking one question: can we avoid external APIs and deploy the model in our own data center, private cloud, or dedicated cluster?&lt;/p&gt;
&lt;p&gt;This is a very practical need. Finance, healthcare, government, manufacturing, legal, and R&amp;amp;D teams often cannot send internal documents, code, contracts, tickets, or customer data directly to public cloud models. For these scenarios, DeepSeek V4 is attractive not only because of model capability, but because it gives enterprises an option closer to controllable LLM infrastructure.&lt;/p&gt;
&lt;p&gt;However, local deployment of DeepSeek V4 is not as simple as downloading a model and finding a few GPUs. Especially for very large MoE models such as Pro, total parameter size, active parameters, context length, KV cache, concurrency, and inference framework all directly affect hardware cost. What enterprises really need is not blindly chasing the full version, but first deciding what deployment shape the business actually needs.&lt;/p&gt;
&lt;h2 id=&#34;clarify-the-deployment-goal-first&#34;&gt;Clarify the Deployment Goal First
&lt;/h2&gt;&lt;p&gt;Enterprise local private deployment usually has three goals:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Keep data inside the domain: internal documents, code, customer materials, logs, and knowledge bases do not leave the enterprise environment.&lt;/li&gt;
&lt;li&gt;Make operations stable and controllable: model services, permissions, audit, logs, and upgrade cadence are controlled by the enterprise.&lt;/li&gt;
&lt;li&gt;Reduce long-term cost: for high-frequency calls, local inference may be more controllable than long-term external API purchases.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If only a few employees ask occasional questions, local deployment may not be cost-effective. Private deployment is truly suitable for high-frequency, stable, data-sensitive, and workflow-defined scenarios, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Internal knowledge-base Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Code review and development assistants.&lt;/li&gt;
&lt;li&gt;Customer-service ticket summarization.&lt;/li&gt;
&lt;li&gt;Contract, medical-record, and report analysis.&lt;/li&gt;
&lt;li&gt;Database query assistants.&lt;/li&gt;
&lt;li&gt;Agent workflow automation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These scenarios share the same traits: sensitive data, stable call patterns, and the ability to fit into enterprise governance through permissions and logs.&lt;/p&gt;
&lt;h2 id=&#34;do-not-chase-full-pro-from-day-one&#34;&gt;Do Not Chase Full Pro From Day One
&lt;/h2&gt;&lt;p&gt;Common DeepSeek V4 versions include Pro and Flash. In public materials, Pro targets stronger reasoning and complex Agent tasks, while Flash emphasizes cost and response speed. Enterprises should not assume every workload needs Pro.&lt;/p&gt;
&lt;p&gt;You can split tasks by complexity:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple Q&amp;amp;A, summarization, classification, and tag generation: prioritize Flash or smaller models.&lt;/li&gt;
&lt;li&gt;Internal knowledge-base retrieval augmentation: Flash is enough for many cases; RAG, permissions, and retrieval quality matter more.&lt;/li&gt;
&lt;li&gt;Code Agents, complex reasoning, and long-context analysis: then evaluate Pro.&lt;/li&gt;
&lt;li&gt;High-value, low-frequency tasks: Pro can be used, but high concurrency may not be necessary.&lt;/li&gt;
&lt;li&gt;Regular office assistants: there is no need to occupy the most expensive inference resources for long periods.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The advantage of MoE models is that each inference only activates part of the parameters, but this does not mean the hardware pressure is small. Weight storage, expert parallelism, network communication, context cache, and concurrent scheduling are still heavy. With 1M-token-level long context in particular, the real resource consumer is often not a single answer, but long context, multi-user concurrency, and persistent sessions.&lt;/p&gt;
&lt;h2 id=&#34;domestic-chip-route-better-for-enterprise-private-cloud&#34;&gt;Domestic Chip Route: Better for Enterprise Private Cloud
&lt;/h2&gt;&lt;p&gt;If an enterprise already has a domestic compute pool, or has requirements around Xinchuang, compliance, or supply-chain control, it can first evaluate domestic chips such as Ascend and Cambricon.&lt;/p&gt;
&lt;p&gt;The advantages of this route are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Better alignment with localization and supply-chain control requirements.&lt;/li&gt;
&lt;li&gt;Suitable for enterprise data centers, dedicated clouds, and government/enterprise projects.&lt;/li&gt;
&lt;li&gt;Easier to unify permissions, audit, resource isolation, and operations.&lt;/li&gt;
&lt;li&gt;Friendlier to long-term stable services.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But the domestic chip route also has three practical issues.&lt;/p&gt;
&lt;p&gt;First, framework adaptation. Whether the model can run depends not only on chip compute power, but also on the maturity of the inference framework, operators, communication libraries, quantization formats, MoE expert parallelism, and long-context optimization.&lt;/p&gt;
&lt;p&gt;Second, engineering experience. Enterprises need more than &amp;ldquo;it starts successfully&amp;rdquo;; they need stable services: multi-tenancy, rate limiting, monitoring, failure recovery, gray releases, log audit, and permission isolation all need to be built.&lt;/p&gt;
&lt;p&gt;Third, ecosystem differences. The same model will not have identical performance, accuracy, quantization support, or deployment tools on NVIDIA, Ascend, Cambricon, and other platforms. Before launch, real stress testing is required instead of relying only on nominal compute.&lt;/p&gt;
&lt;p&gt;Therefore, domestic chips are more suitable for enterprises with clear budgets, high compliance requirements, and willingness to invest in platform engineering. It is not the easiest route, but it may be the route that best fits long-term governance.&lt;/p&gt;
&lt;h2 id=&#34;consumer-gpu-clusters-better-for-pilots-and-small-teams&#34;&gt;Consumer GPU Clusters: Better for Pilots and Small Teams
&lt;/h2&gt;&lt;p&gt;If the goal is to validate business value first, a consumer GPU cluster is easier to start with. GPUs such as RTX 4090, RTX 5090, RTX 3090, and RTX 3060 12GB have more community tools, quantized models, and local inference references, so trial-and-error cost is lower.&lt;/p&gt;
&lt;p&gt;The consumer GPU route fits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Internal pilots by R&amp;amp;D teams.&lt;/li&gt;
&lt;li&gt;Knowledge-base Q&amp;amp;A for small and medium businesses.&lt;/li&gt;
&lt;li&gt;Low-concurrency code assistants.&lt;/li&gt;
&lt;li&gt;Offline document processing.&lt;/li&gt;
&lt;li&gt;Internal tools without strict SLA requirements.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But it also has obvious limits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VRAM is small, making it hard to host a full large model directly.&lt;/li&gt;
&lt;li&gt;Multi-GPU communication is weak, and cross-machine communication is more troublesome.&lt;/li&gt;
&lt;li&gt;Long-term full-load stability is weaker than server-grade solutions.&lt;/li&gt;
&lt;li&gt;Chassis, power, cooling, drivers, and operations become hidden costs.&lt;/li&gt;
&lt;li&gt;It is not suitable for promising enterprise-grade high availability from the start.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A more realistic approach is to first run Flash, distilled versions, quantized versions, or smaller models on consumer GPUs, get the business workflow working, and then decide whether to migrate to server GPUs or a domestic compute platform after call volume, quality, and data governance have been validated.&lt;/p&gt;
&lt;h2 id=&#34;a-possible-deployment-architecture&#34;&gt;A Possible Deployment Architecture
&lt;/h2&gt;&lt;p&gt;A relatively stable enterprise private architecture can be divided into six layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Model layer: DeepSeek V4 Pro, V4 Flash, or smaller distilled models selected by task.&lt;/li&gt;
&lt;li&gt;Inference layer: SGLang, vLLM, llama.cpp, vendor NPU inference stacks, or enterprise self-developed services.&lt;/li&gt;
&lt;li&gt;Gateway layer: unified authentication, rate limiting, audit, model routing, and call logs.&lt;/li&gt;
&lt;li&gt;Knowledge layer: vector database, full-text search, document parsing, permission filtering, and RAG.&lt;/li&gt;
&lt;li&gt;Application layer: customer service, code assistants, document analysis, report Q&amp;amp;A, and Agent workflows.&lt;/li&gt;
&lt;li&gt;Operations layer: monitoring, alerts, cost statistics, gray releases, rollback, and security audit.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The gateway layer and knowledge layer are the easiest to underestimate. Many projects fail not because the model is completely unusable, but because permissions, retrieval, logs, context management, prompt templates, and business workflows were not done well.&lt;/p&gt;
&lt;p&gt;When deploying LLMs internally, enterprises should treat the model as infrastructure, not as an isolated chat page. The real value appears only when the model enters workflows and can stably process the enterprise&amp;rsquo;s own data and tasks.&lt;/p&gt;
&lt;h2 id=&#34;hardware-selection&#34;&gt;Hardware Selection
&lt;/h2&gt;&lt;p&gt;Hardware selection should not only ask &amp;ldquo;can it run&amp;rdquo;; it should also ask &amp;ldquo;can it serve stably&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;You can choose by stage:&lt;/p&gt;
&lt;h3 id=&#34;validation-stage&#34;&gt;Validation Stage
&lt;/h3&gt;&lt;p&gt;The goal is to prove whether the business is worth doing.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use 1-4 consumer GPUs.&lt;/li&gt;
&lt;li&gt;Prioritize Flash, smaller models, distilled models, or quantized models.&lt;/li&gt;
&lt;li&gt;Keep concurrency low and focus on task completion rate.&lt;/li&gt;
&lt;li&gt;Do not promise high availability.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not buy large-scale hardware too early at this stage. First confirm whether employees actually use it, whether the business really saves time, and whether answers can enter real workflows.&lt;/p&gt;
&lt;h3 id=&#34;pilot-stage&#34;&gt;Pilot Stage
&lt;/h3&gt;&lt;p&gt;The goal is to let one department or one business line use it steadily.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use 4-16 GPUs or a set of domestic NPU nodes.&lt;/li&gt;
&lt;li&gt;Add a unified gateway, logs, and permission controls.&lt;/li&gt;
&lt;li&gt;Build RAG, document parsing, model routing, and caching.&lt;/li&gt;
&lt;li&gt;Start tracking tokens, concurrency, latency, and failure rate.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At this stage, operations begin to matter. Model quality is only one part; stability, cost, and data governance are equally important.&lt;/p&gt;
&lt;h3 id=&#34;production-stage&#34;&gt;Production Stage
&lt;/h3&gt;&lt;p&gt;The goal is to enter enterprise-grade service.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use server GPUs, domestic compute clusters, or private-cloud resource pools.&lt;/li&gt;
&lt;li&gt;Build multi-replica deployment, rate limiting, failover, and capacity planning.&lt;/li&gt;
&lt;li&gt;Route models by task: simple tasks use lightweight models, complex tasks use Pro.&lt;/li&gt;
&lt;li&gt;Connect to enterprise identity systems, audit systems, and security policies.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In production, it is not recommended to send every request to the strongest model. Proper model routing usually saves more money than simply adding hardware.&lt;/p&gt;
&lt;h2 id=&#34;choosing-an-inference-framework&#34;&gt;Choosing an Inference Framework
&lt;/h2&gt;&lt;p&gt;Models such as DeepSeek V4 have high requirements for inference frameworks. When MoE, long context, sparse attention, quantization, and multi-GPU parallelism are involved, framework maturity directly affects speed and stability.&lt;/p&gt;
&lt;p&gt;Common choices can be understood this way:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SGLang&lt;/code&gt;: suitable for teams focused on high-performance inference, Agents, multi-turn tool calls, and complex service orchestration.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;vLLM&lt;/code&gt;: mature ecosystem, suitable for general LLM services, but actual support depends on version and model adaptation progress.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llama.cpp&lt;/code&gt;: better for small models, quantized models, and edge deployment; not suitable for directly hosting a full very large MoE model.&lt;/li&gt;
&lt;li&gt;Domestic NPU inference stacks: suitable for Xinchuang and domestic compute environments, but operator, quantization, and long-context support must be carefully verified.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not choose a framework only by benchmark. Enterprises should test their own real inputs: internal document length, concurrency, average output length, RAG hit rate, number of Agent tool calls, and retry count after failures.&lt;/p&gt;
&lt;h2 id=&#34;data-security-must-be-built-outside-the-model&#34;&gt;Data Security Must Be Built Outside the Model
&lt;/h2&gt;&lt;p&gt;Private deployment does not automatically mean security. Running the model locally only solves part of the question of whether data leaves the enterprise.&lt;/p&gt;
&lt;p&gt;You still need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Accounts and permissions: different departments can only access their own knowledge bases.&lt;/li&gt;
&lt;li&gt;Log audit: who asked what, which model was called, and which documents were accessed.&lt;/li&gt;
&lt;li&gt;Data masking: customer information, ID numbers, phone numbers, contract amounts, and other sensitive fields must be handled.&lt;/li&gt;
&lt;li&gt;Prompt security: prevent users from bypassing permissions or leaking system prompts through prompts.&lt;/li&gt;
&lt;li&gt;Output review: important scenarios need human review or rule-based review.&lt;/li&gt;
&lt;li&gt;Data lifecycle: uploaded documents, vector indexes, caches, and session records must be deletable.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Enterprise local LLM deployment cannot involve only the algorithm team. Security, legal, operations, and business owners should all participate; otherwise, risks will be exposed after launch.&lt;/p&gt;
&lt;h2 id=&#34;cost-is-more-than-gpus&#34;&gt;Cost Is More Than GPUs
&lt;/h2&gt;&lt;p&gt;The cost of local deployment is often underestimated. Beyond GPUs or NPUs, you also need to count:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Servers, racks, power, cooling, and networking.&lt;/li&gt;
&lt;li&gt;Storage and backup.&lt;/li&gt;
&lt;li&gt;Inference framework adaptation and engineering development.&lt;/li&gt;
&lt;li&gt;Operations monitoring and incident handling.&lt;/li&gt;
&lt;li&gt;Model upgrades, rollback, and compatibility tests.&lt;/li&gt;
&lt;li&gt;Security audit and permission systems.&lt;/li&gt;
&lt;li&gt;Business-side prompts, RAG, and workflow construction.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If call volume is very low, external APIs may be cheaper. If call volume is high, data is sensitive, and workflows are stable, local deployment is more likely to amortize cost.&lt;/p&gt;
&lt;p&gt;A more reasonable strategy is hybrid deployment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Highly sensitive data goes to local models.&lt;/li&gt;
&lt;li&gt;Low-sensitivity general tasks can use external APIs.&lt;/li&gt;
&lt;li&gt;Simple tasks use small models.&lt;/li&gt;
&lt;li&gt;Complex tasks use DeepSeek V4 Pro.&lt;/li&gt;
&lt;li&gt;High-frequency tasks prioritize caching, retrieval, and model routing optimization.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;recommended-rollout-path&#34;&gt;Recommended Rollout Path
&lt;/h2&gt;&lt;p&gt;Enterprises can proceed in this order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Choose 2-3 high-value scenarios first; do not roll out company-wide.&lt;/li&gt;
&lt;li&gt;Use consumer GPUs or small-scale compute for a PoC.&lt;/li&gt;
&lt;li&gt;Run Flash, distilled models, or quantized models first, and connect RAG and permissions.&lt;/li&gt;
&lt;li&gt;Introduce Pro for comparison tests on complex tasks.&lt;/li&gt;
&lt;li&gt;Record real call volume, latency, failure rate, and time saved by humans.&lt;/li&gt;
&lt;li&gt;Then decide whether to purchase domestic chip clusters or server GPUs.&lt;/li&gt;
&lt;li&gt;Before production, complete gateway, audit, monitoring, rate limiting, and rollback.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This path is more stable than buying a large cluster from the start. The biggest enterprise risk is not that the model is not strong enough, but that a lot of money is spent before the business workflow is ready to absorb the model capability.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;DeepSeek V4 gives enterprises more room to imagine local private deployment, but it is not simply a &amp;ldquo;local ChatGPT&amp;rdquo;. The real difficulty is engineering: hardware, frameworks, model routing, permissions, RAG, audit, monitoring, and cost control all need to be considered together.&lt;/p&gt;
&lt;p&gt;The domestic chip route better fits enterprises with high compliance requirements and long-term private cloud plans. Consumer GPU clusters are better for pilots and quick validation by small and medium teams. Pro fits complex reasoning and Agent tasks; Flash or smaller models fit many ordinary tasks.&lt;/p&gt;
&lt;p&gt;If you only remember one sentence: DeepSeek V4 private deployment should not start with hardware procurement, but with business scenarios, data boundaries, and call volume. First get the scenario working, then decide whether to use a large model, how large it should be, and what compute platform to use.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://apnews.com/article/deepseek-ai-china-gpt-v4-d2ed33f2521917193616e061674d5f92&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;AP News: DeepSeek launches an update of its AI model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/blog/deepseekv4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hugging Face Blog: DeepSeek-V4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.lmsys.org/blog/2026-04-25-deepseek-v4/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;LMSYS Blog: DeepSeek-V4 on Day 0&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Local LLM Models Recommended for an RTX 3060 GPU</title>
        <link>https://knightli.com/en/2026/05/08/rtx-3060-local-llm-models/</link>
        <pubDate>Fri, 08 May 2026 09:25:24 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/08/rtx-3060-local-llm-models/</guid>
        <description>&lt;p&gt;The most common RTX 3060 variant has 12GB of VRAM. It is not a top-tier AI GPU, but it is a very usable card for local LLMs, especially 7B, 8B, 9B, and 12B models.&lt;/p&gt;
&lt;p&gt;If you only want a quick rule of thumb, remember this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On an RTX 3060 12GB, prioritize around-8B models in Q4_K_M or Q5_K_M quantization. Choose Q4 for stability, and try Q5 if you want better quality.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Do not start by chasing 32B or 70B models. Even if they can run with low-bit quantization and CPU offloading, their speed and experience are usually not suitable for daily use.&lt;/p&gt;
&lt;h2 id=&#34;start-with-the-vram-limit&#34;&gt;Start With the VRAM Limit
&lt;/h2&gt;&lt;p&gt;For local LLMs on an RTX 3060 12GB, the real limit is VRAM.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model Size&lt;/th&gt;
          &lt;th&gt;Recommended Quantization&lt;/th&gt;
          &lt;th&gt;RTX 3060 12GB Experience&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;3B / 4B&lt;/td&gt;
          &lt;td&gt;Q4, Q5, Q8&lt;/td&gt;
          &lt;td&gt;Very easy, fast&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;7B / 8B / 9B&lt;/td&gt;
          &lt;td&gt;Q4_K_M, Q5_K_M&lt;/td&gt;
          &lt;td&gt;Best balance of quality and speed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12B / 14B&lt;/td&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;Usable, but avoid huge context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;30B+&lt;/td&gt;
          &lt;td&gt;Q2 / Q3 or partial offload&lt;/td&gt;
          &lt;td&gt;Possible to tinker with, not recommended daily&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;70B+&lt;/td&gt;
          &lt;td&gt;Very low quantization or heavy CPU/RAM use&lt;/td&gt;
          &lt;td&gt;More like an experiment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs do not only consume VRAM for the model file. Context length, KV cache, batch size, inference framework, and drivers all consume resources.&lt;/p&gt;
&lt;p&gt;So 12GB of VRAM does not mean you can load a 12GB model file directly. It is better to leave room for the system and context.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-1-qwen3-8b&#34;&gt;Recommendation 1: Qwen3 8B
&lt;/h2&gt;&lt;p&gt;If you mainly use Chinese, &lt;code&gt;Qwen3 8B&lt;/code&gt; is one of the first models worth trying on an RTX 3060.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Chinese Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Summarization and rewriting.&lt;/li&gt;
&lt;li&gt;Everyday knowledge assistant work.&lt;/li&gt;
&lt;li&gt;Simple code explanation.&lt;/li&gt;
&lt;li&gt;Local RAG.&lt;/li&gt;
&lt;li&gt;Lightweight Agent flows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Qwen3 8B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M: first choice
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q5_K_M: better quality, more VRAM pressure
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Qwen models are friendly to Chinese usage. For daily writing, information organization, and Chinese instruction following, Qwen3 8B is a good first model.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-2-llama-31-8b-instruct&#34;&gt;Recommendation 2: Llama 3.1 8B Instruct
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Llama 3.1 8B Instruct&lt;/code&gt; is a stable general-purpose model with mature English capability and ecosystem support.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;English Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Lightweight coding help.&lt;/li&gt;
&lt;li&gt;General chat.&lt;/li&gt;
&lt;li&gt;Document summarization.&lt;/li&gt;
&lt;li&gt;Prompt testing.&lt;/li&gt;
&lt;li&gt;Comparing different inference tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Llama 3.1 8B Instruct GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M: better speed and VRAM stability
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q5_K_M: better answer quality
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you mainly process English materials, or want a model with many tutorials and broad compatibility, Llama 3.1 8B is still a good baseline.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-3-gemma-3-12b&#34;&gt;Recommendation 3: Gemma 3 12B
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Gemma 3 12B&lt;/code&gt; is closer to the upper practical limit for an RTX 3060 12GB.&lt;/p&gt;
&lt;p&gt;It uses more VRAM than 8B models, but Q4 quantization can still make it usable on a 12GB card. It is a good option if you want to try a slightly larger model on one GPU.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Higher-quality general Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;English content processing.&lt;/li&gt;
&lt;li&gt;More complex summarization and analysis.&lt;/li&gt;
&lt;li&gt;Trying an upgrade over 8B models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Gemma 3 12B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M or official QAT Q4
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Keep context modest
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you run out of VRAM, reduce context length first, or return to an 8B model. For an RTX 3060, 12B is &amp;ldquo;worth trying,&amp;rdquo; not a no-brainer default.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-4-deepseek-r1-distill-qwen-8b&#34;&gt;Recommendation 4: DeepSeek R1 Distill Qwen 8B
&lt;/h2&gt;&lt;p&gt;If you want to experience reasoning-style local models, try models like &lt;code&gt;DeepSeek R1 Distill Qwen 8B&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple reasoning tasks.&lt;/li&gt;
&lt;li&gt;Step-by-step analysis.&lt;/li&gt;
&lt;li&gt;Learning reasoning-model output style.&lt;/li&gt;
&lt;li&gt;Low-cost local experiments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;DeepSeek R1 Distill Qwen 8B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;These models may produce longer reasoning-style outputs, so speed and context usage can be heavier than ordinary instruction models. They are not always more comfortable for daily chat, but they are useful for reasoning experiments.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-5-phi--minicpm--smaller-models&#34;&gt;Recommendation 5: Phi / MiniCPM / Smaller Models
&lt;/h2&gt;&lt;p&gt;If your RTX 3060 is an 8GB variant, or your system RAM is limited, consider 3B and 4B models first.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fast Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Simple summaries.&lt;/li&gt;
&lt;li&gt;Embedding into local tools.&lt;/li&gt;
&lt;li&gt;Low-latency chat.&lt;/li&gt;
&lt;li&gt;Testing on older machines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These models may not match 8B or 12B quality, but they are light, fast, and easy to deploy.&lt;/p&gt;
&lt;h2 id=&#34;which-quantization-to-use&#34;&gt;Which Quantization to Use
&lt;/h2&gt;&lt;p&gt;Local models commonly use &lt;code&gt;GGUF&lt;/code&gt;, with quantization types such as Q4, Q5, Q6, and Q8.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th&gt;Traits&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;Small, fast, good enough&lt;/td&gt;
          &lt;td&gt;RTX 3060 first choice&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q5_K_M&lt;/td&gt;
          &lt;td&gt;Better quality, higher usage&lt;/td&gt;
          &lt;td&gt;Try with 8B models&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q6 / Q8&lt;/td&gt;
          &lt;td&gt;Closer to original quality, larger&lt;/td&gt;
          &lt;td&gt;Small models or more VRAM&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q2 / Q3&lt;/td&gt;
          &lt;td&gt;Saves VRAM but quality drops&lt;/td&gt;
          &lt;td&gt;Large-model tinkering&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For RTX 3060 12GB, the practical choices are:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B models: Q4_K_M or Q5_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;12B models: Q4_K_M first
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Larger models: not recommended as daily drivers
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;which-tool-to-use&#34;&gt;Which Tool to Use
&lt;/h2&gt;&lt;p&gt;Beginners can start with &lt;code&gt;Ollama&lt;/code&gt;, because installation and running models are simple.&lt;/p&gt;
&lt;p&gt;Common commands:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run qwen3:8b
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run llama3.1:8b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want finer control over GGUF files, GPU layers, and context length, use &lt;code&gt;llama.cpp&lt;/code&gt; or GUI tools based on it.&lt;/p&gt;
&lt;p&gt;Common choices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Ollama&lt;/code&gt;: easiest, best for beginners.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LM Studio&lt;/code&gt;: friendly GUI, good for downloading and switching models.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llama.cpp&lt;/code&gt;: most control, best for performance tuning.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;text-generation-webui&lt;/code&gt;: many features, good for backend testing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For local chat and simple Q&amp;amp;A, Ollama or LM Studio is enough.&lt;/p&gt;
&lt;h2 id=&#34;do-not-set-context-too-high&#34;&gt;Do Not Set Context Too High
&lt;/h2&gt;&lt;p&gt;Many models advertise long-context support, but do not blindly set context to the maximum on an RTX 3060.&lt;/p&gt;
&lt;p&gt;Longer context uses more KV cache and increases VRAM pressure. Even if the model loads, long context can slow generation down.&lt;/p&gt;
&lt;p&gt;Suggested settings:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Normal chat: 4K to 8K
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Document summaries: 8K to 16K
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Long-document RAG: chunk first; do not paste everything at once
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;An RTX 3060 is better suited to &amp;ldquo;moderate context + good model + good retrieval&amp;rdquo; than forcing hundreds of thousands of tokens into one prompt.&lt;/p&gt;
&lt;h2 id=&#34;choose-by-use-case&#34;&gt;Choose by Use Case
&lt;/h2&gt;&lt;p&gt;If you mainly write Chinese:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;First choice: Qwen3 8B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Alternative: DeepSeek R1 Distill Qwen 8B
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you mainly write English:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;First choice: Llama 3.1 8B Instruct Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Alternative: Gemma 3 12B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want speed:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;3B / 4B models
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Keep context at 4K to 8K
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want better quality:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B Q5_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;12B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Accept slower speed
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want coding help:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B coding models can help with explanations and small edits
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;For complex engineering tasks, use stronger cloud models
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Local RTX 3060 models are good for code explanation, function completion, small scripts, and offline assistance. For large refactors, difficult bugs, and cross-file Agent work, do not expect Claude Sonnet or GPT-5-level performance.&lt;/p&gt;
&lt;h2 id=&#34;reasonable-expectations&#34;&gt;Reasonable Expectations
&lt;/h2&gt;&lt;p&gt;The RTX 3060 12GB is good enough to turn local LLMs from toys into daily tools, but it will not recreate top cloud models at home.&lt;/p&gt;
&lt;p&gt;Its strengths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Low cost.&lt;/li&gt;
&lt;li&gt;More VRAM than 8GB cards.&lt;/li&gt;
&lt;li&gt;Good 8B model experience.&lt;/li&gt;
&lt;li&gt;Offline use.&lt;/li&gt;
&lt;li&gt;Local processing for privacy-sensitive materials.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Its limits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Large models are hard to run smoothly.&lt;/li&gt;
&lt;li&gt;Long context consumes VRAM.&lt;/li&gt;
&lt;li&gt;Slower than high-end GPUs.&lt;/li&gt;
&lt;li&gt;Small local models have limited complex reasoning.&lt;/li&gt;
&lt;li&gt;Multimodal and Agent workflows need more resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The stable route is: use 8B models as everyday local assistants, try 12B models for quality, and leave complex tasks to cloud models.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Recommended local LLM choices for RTX 3060 12GB:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Chinese general use: &lt;code&gt;Qwen3 8B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;English general use: &lt;code&gt;Llama 3.1 8B Instruct Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Higher-quality experiment: &lt;code&gt;Gemma 3 12B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Reasoning experiment: &lt;code&gt;DeepSeek R1 Distill Qwen 8B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Low-VRAM fast use: 3B / 4B small models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Choose &lt;code&gt;Q4_K_M&lt;/code&gt; first. Try &lt;code&gt;Q5_K_M&lt;/code&gt; for 8B models if you want better quality. Start with Ollama or LM Studio.&lt;/p&gt;
&lt;p&gt;Do not treat the RTX 3060 as a large-model server. Treat it as a local knowledge assistant, privacy document processor, lightweight coding helper, and model experiment card, and it will fit its real capabilities much better.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Qwen3 8B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3-8B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/Qwen/Qwen3-8B-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Llama 3.1 8B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Gemma 3 12B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-3-12b-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/unsloth/gemma-3-12b-it-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;llama.cpp: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama: &lt;a class=&#34;link&#34; href=&#34;https://ollama.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://ollama.com&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Hermes &#43; Qwen3.6: A Low-Cost Local Agent Deployment</title>
        <link>https://knightli.com/en/2026/05/04/hermes-qwen36-local-agent/</link>
        <pubDate>Mon, 04 May 2026 06:40:30 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/04/hermes-qwen36-local-agent/</guid>
        <description>&lt;p&gt;This article documents a local Agent deployment plan: run a Qwen3.6 GGUF model with &lt;code&gt;llama.cpp&lt;/code&gt; inside WSL2, then connect Hermes Agent to the local OpenAI-compatible API. This gives you a long-running local AI assistant on your own computer, without paying by online service Token usage.&lt;/p&gt;
&lt;p&gt;This setup is suitable for users who want to try local AI Agents while keeping data private and controllable over the long term. It can be used for daily Q&amp;amp;A, writing, coding assistance, document organization, and simple automation tasks. The larger the model, the higher the VRAM requirement. The original example uses Qwen3.6-27B, and 24GB VRAM is more stable. If your VRAM is smaller, choose a smaller model or a lower quantization.&lt;/p&gt;
&lt;h2 id=&#34;architecture&#34;&gt;Architecture
&lt;/h2&gt;&lt;p&gt;The overall chain is simple:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install WSL2 and Ubuntu 24.04 on Windows.&lt;/li&gt;
&lt;li&gt;Install CUDA Toolkit inside WSL2 and compile &lt;code&gt;llama.cpp&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Download the Qwen3.6 GGUF model.&lt;/li&gt;
&lt;li&gt;Start a local model service with &lt;code&gt;llama-server&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Install Hermes Agent and configure it to &lt;code&gt;http://localhost:8080/v1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Optional: write a startup script so the model service starts automatically when WSL2 opens.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Hermes provides the Agent capability, while Qwen3.6 provides the local LLM capability. Together, they turn the computer into a private local AI assistant.&lt;/p&gt;
&lt;h2 id=&#34;install-wsl2-and-ubuntu&#34;&gt;Install WSL2 and Ubuntu
&lt;/h2&gt;&lt;p&gt;Run in an administrator Windows PowerShell window:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;wsl&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-install&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;wsl&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-set-default-version&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After rebooting, install Ubuntu 24.04:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;wsl&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-install&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-d&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Ubuntu&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;24.04&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After installation, Ubuntu prompts you to set a username and password. Once inside Ubuntu, first check whether the NVIDIA GPU is visible in WSL2:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the GPU cannot be detected, update the NVIDIA driver on Windows first. WSL2 inherits the Windows driver, but CUDA Toolkit still needs to be installed separately inside WSL2.&lt;/p&gt;
&lt;h2 id=&#34;install-python-and-basic-tools&#34;&gt;Install Python and Basic Tools
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt update &lt;span class=&#34;o&#34;&gt;&amp;amp;&amp;amp;&lt;/span&gt; sudo apt install -y python3-pip python3-venv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You also need build tools, Git, and CMake:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt install -y cmake build-essential git
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;compile-llamacpp&#34;&gt;Compile llama.cpp
&lt;/h2&gt;&lt;p&gt;Clone the repository:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; llama.cpp
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If CUDA is already available in WSL2, compile directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake -B build -DGGML_CUDA&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;ON -DCMAKE_CUDA_ARCHITECTURES&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;89&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake --build build -j&lt;span class=&#34;k&#34;&gt;$(&lt;/span&gt;nproc&lt;span class=&#34;k&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;CMAKE_CUDA_ARCHITECTURES=89&lt;/code&gt; is suitable for Ada GPUs, such as RTX 40 series cards. Adjust it according to your actual GPU architecture.&lt;/p&gt;
&lt;p&gt;If compilation reports that CUDA Toolkit is missing, install CUDA Toolkit inside WSL2 first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo dpkg -i cuda-keyring_1.1-1_all.deb
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt update
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt install -y cuda-toolkit-12-8
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Configure environment variables:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;PATH&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;/usr/local/cuda-12.8/bin:&lt;span class=&#34;nv&#34;&gt;$PATH&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;/usr/local/cuda-12.8/lib64:&lt;span class=&#34;nv&#34;&gt;$LD_LIBRARY_PATH&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;export PATH=/usr/local/cuda-12.8/bin:$PATH&amp;#39;&lt;/span&gt; &amp;gt;&amp;gt; ~/.bashrc
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH&amp;#39;&lt;/span&gt; &amp;gt;&amp;gt; ~/.bashrc
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then rebuild:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ~/llama.cpp
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;rm -rf build
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake -B build -DGGML_CUDA&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;ON -DCMAKE_CUDA_ARCHITECTURES&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;89&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake --build build -j&lt;span class=&#34;k&#34;&gt;$(&lt;/span&gt;nproc&lt;span class=&#34;k&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;download-the-qwen36-gguf-model&#34;&gt;Download the Qwen3.6 GGUF Model
&lt;/h2&gt;&lt;p&gt;The example uses &lt;code&gt;Qwen3.6-27B-UD-Q4_K_XL.gguf&lt;/code&gt; from &lt;code&gt;unsloth/Qwen3.6-27B-GGUF&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;hf download unsloth/Qwen3.6-27B-GGUF &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Qwen3.6-27B-UD-Q4_K_XL.gguf &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--local-dir ~/models/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The file is about 17GB. If Hugging Face is slow, use a mirror such as ModelScope. Do not force a 27B model if your VRAM is insufficient; use a smaller model or lower quantization.&lt;/p&gt;
&lt;h2 id=&#34;start-the-local-model-service&#34;&gt;Start the Local Model Service
&lt;/h2&gt;&lt;p&gt;Start &lt;code&gt;llama-server&lt;/code&gt; with your own model file name:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;~/llama.cpp/build/bin/llama-server &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--n-gpu-layers &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--ctx-size &lt;span class=&#34;m&#34;&gt;32768&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--flash-attn on &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--temp 1.0 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--top-p 0.95 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--top-k &lt;span class=&#34;m&#34;&gt;20&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--presence-penalty 1.5 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--port &lt;span class=&#34;m&#34;&gt;8080&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After startup, open this in a Windows browser:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://localhost:8080
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For Hermes Agent or other OpenAI-compatible clients, the API endpoint is usually:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://localhost:8080/v1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;thinking-mode-tradeoff&#34;&gt;Thinking Mode Tradeoff
&lt;/h2&gt;&lt;p&gt;Qwen3.6 may enable Thinking mode by default. It is suitable for complex reasoning, complicated coding problems, and multi-step analysis, but it is slower.&lt;/p&gt;
&lt;p&gt;To disable Thinking mode, stop the service and add &lt;code&gt;--chat-template-kwargs&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;~/llama.cpp/build/bin/llama-server &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--n-gpu-layers &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--ctx-size &lt;span class=&#34;m&#34;&gt;32768&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--flash-attn on &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--temp 1.0 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--top-p 0.95 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--top-k &lt;span class=&#34;m&#34;&gt;20&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--presence-penalty 1.5 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--chat-template-kwargs &lt;span class=&#34;s1&#34;&gt;&amp;#39;{&amp;#34;enable_thinking&amp;#34;:false}&amp;#39;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--port &lt;span class=&#34;m&#34;&gt;8080&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After disabling Thinking, simple Q&amp;amp;A, writing, code completion, and code explanation become faster. For complex algorithm design, difficult debugging, and architecture analysis, Thinking mode is still recommended.&lt;/p&gt;
&lt;h2 id=&#34;install-hermes-agent&#34;&gt;Install Hermes Agent
&lt;/h2&gt;&lt;p&gt;Keep &lt;code&gt;llama-server&lt;/code&gt; running, then open a new WSL2 terminal and install Hermes Agent:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; bash
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The installer handles dependencies such as Python, Node.js, ripgrep, and ffmpeg. When configuring the model endpoint, choose a custom endpoint:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;URL: http://localhost:8080/v1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;API Key: 12345678
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Model: auto-detect
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For a local &lt;code&gt;llama-server&lt;/code&gt;, the API Key can be any placeholder value. After configuration, you can connect Telegram, WeChat, QQ, Discord, and other chat tools, allowing Hermes Agent to call the local model and execute tasks from those entry points.&lt;/p&gt;
&lt;h2 id=&#34;auto-start-the-model-service&#34;&gt;Auto-Start the Model Service
&lt;/h2&gt;&lt;p&gt;You can write a startup script so the model service starts automatically when a WSL2 terminal opens.&lt;/p&gt;
&lt;p&gt;Create the script:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;15
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;16
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;17
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;18
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;19
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cat &amp;gt; ~/start-llm.sh &lt;span class=&#34;s&#34;&gt;&amp;lt;&amp;lt; &amp;#39;EOF&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;echo &amp;#34;Starting Qwen3.6-27B llama-server...&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;~/llama.cpp/build/bin/llama-server \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--n-gpu-layers 99 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--ctx-size 65536 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--flash-attn on \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--temp 1.0 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--top-p 0.95 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--top-k 20 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--presence-penalty 1.5 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--port 8080 \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;--host 0.0.0.0 &amp;amp;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;echo &amp;#34;llama-server started, PID: $!&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;echo &amp;#34;API: http://localhost:8080/v1&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;echo &amp;#34;Chat UI: http://localhost:8080&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;EOF&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;chmod +x ~/start-llm.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Write it into &lt;code&gt;.bashrc&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;# Auto-start llama-server&amp;#39;&lt;/span&gt; &amp;gt;&amp;gt; ~/.bashrc
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;if ! pgrep -f &amp;#34;llama-server&amp;#34; &amp;gt; /dev/null 2&amp;gt;&amp;amp;1; then&amp;#39;&lt;/span&gt; &amp;gt;&amp;gt; ~/.bashrc
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;    ~/start-llm.sh&amp;#39;&lt;/span&gt; &amp;gt;&amp;gt; ~/.bashrc
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;fi&amp;#39;&lt;/span&gt; &amp;gt;&amp;gt; ~/.bashrc
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Each time you open a WSL2 terminal, it will start &lt;code&gt;llama-server&lt;/code&gt; if it is not already running. If it is running, it skips startup and avoids duplicate processes.&lt;/p&gt;
&lt;h2 id=&#34;notes&#34;&gt;Notes
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;27B models require substantial VRAM; 24GB VRAM is more stable. Use a smaller model if VRAM is limited.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--ctx-size 65536&lt;/code&gt; significantly increases VRAM and RAM pressure. If unstable, reduce it to &lt;code&gt;32768&lt;/code&gt; or lower.&lt;/li&gt;
&lt;li&gt;Both CUDA Toolkit in WSL2 and the Windows GPU driver must work properly. Either side can cause CUDA compilation or runtime failures.&lt;/li&gt;
&lt;li&gt;Hermes Agent calls the local service through an OpenAI-compatible API. The key is that &lt;code&gt;http://localhost:8080/v1&lt;/code&gt; responds correctly.&lt;/li&gt;
&lt;li&gt;If accessing from a phone or another device, handle Windows Firewall, LAN addresses, and security isolation. Do not expose the local model service directly to the public internet.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;related-links&#34;&gt;Related Links
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Original article: &lt;a class=&#34;link&#34; href=&#34;https://www.freedidi.com/24036.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hermes + Qwen3.6：本地最强 Agent 组合！零成本、无限 Token，太香了！&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;llama.cpp: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggerganov/llama.cpp&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;ggerganov/llama.cpp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hermes Agent: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NousResearch/hermes-agent&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NousResearch/hermes-agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Qwen3.6 GGUF example: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/Qwen3.6-27B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/Qwen3.6-27B-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>NVIDIA Releases Nemotron 3 Nano Omni: An Open Omnimodal Reasoning Model for Agents</title>
        <link>https://knightli.com/en/2026/05/01/nvidia-nemotron-3-nano-omni-multimodal-agents/</link>
        <pubDate>Fri, 01 May 2026 12:07:15 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/nvidia-nemotron-3-nano-omni-multimodal-agents/</guid>
        <description>&lt;p&gt;NVIDIA has released &lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt;, an open omnimodal reasoning model designed for agent workflows.
Its focus is not simply text question answering, but putting language, vision, and audio into the same reasoning framework so the model can handle inputs that are closer to real work.&lt;/p&gt;
&lt;p&gt;In positioning, &lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; looks more like a foundation model prepared for AI Agents.
It can understand information from screens, documents, images, speech, and video, then turn that information into actionable reasoning results.
This kind of capability fits computer operation, document intelligence, video understanding, voice interaction, customer service, education, and enterprise process automation.&lt;/p&gt;
&lt;h2 id=&#34;model-specs&#34;&gt;Model Specs
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; uses a MoE architecture.
The key specs NVIDIA lists are:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Item&lt;/th&gt;
          &lt;th&gt;Information&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Model name&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Architecture&lt;/td&gt;
          &lt;td&gt;MoE&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Parameter scale&lt;/td&gt;
          &lt;td&gt;30B total / 3B active&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Modalities&lt;/td&gt;
          &lt;td&gt;Text, image, audio, video&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Context length&lt;/td&gt;
          &lt;td&gt;256K tokens&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;License&lt;/td&gt;
          &lt;td&gt;Apache 2.0&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Main deployment direction&lt;/td&gt;
          &lt;td&gt;AI Agents, multimodal reasoning, enterprise agents&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The most notable point here is &lt;code&gt;30B-A3B&lt;/code&gt;.
It means the model has about 30B total parameters, but only activates about 3B parameters during each inference step.
This is a tradeoff between capability and inference cost: the model keeps a larger expert capacity while using only part of it at runtime.&lt;/p&gt;
&lt;p&gt;That said, MoE &lt;code&gt;active params&lt;/code&gt; does not mean VRAM can be estimated as if this were only a 3B model.
A full deployment still needs to account for expert weights, KV cache, vision and audio encoder modules, context length, and inference framework overhead.&lt;/p&gt;
&lt;h2 id=&#34;it-is-not-solving-a-single-modality-problem&#34;&gt;It Is Not Solving a Single-Modality Problem
&lt;/h2&gt;&lt;p&gt;Traditional large language models mainly process text.
Multimodal models add image understanding.
&lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; has a broader target: it emphasizes omnimodal input, meaning text, images, audio, and video are all brought into a unified reasoning process.&lt;/p&gt;
&lt;p&gt;This matters a lot for agents.
Real agent tasks are often not &amp;ldquo;take a piece of text and generate another piece of text&amp;rdquo;; they are more like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reading buttons, tables, and windows on a screen;&lt;/li&gt;
&lt;li&gt;parsing PDFs, screenshots, charts, and webpages;&lt;/li&gt;
&lt;li&gt;listening to spoken instructions or meeting recordings;&lt;/li&gt;
&lt;li&gt;understanding actions, scenes, and timing in video;&lt;/li&gt;
&lt;li&gt;combining those signals into the next operation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If a model can only handle one modality, an Agent needs extra glue between multiple specialized models.
The value of an omnimodal model is reducing that integration cost and letting the same model directly process more complex environmental inputs.&lt;/p&gt;
&lt;h2 id=&#34;built-for-computer-operation-and-document-intelligence&#34;&gt;Built for Computer Operation and Document Intelligence
&lt;/h2&gt;&lt;p&gt;NVIDIA specifically notes that &lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; can be used for computer-operation tasks.
These tasks usually require the model to understand user interfaces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;what controls are on the screen;&lt;/li&gt;
&lt;li&gt;what state the current window is in;&lt;/li&gt;
&lt;li&gt;which button or menu is the next target;&lt;/li&gt;
&lt;li&gt;what the content in tables, dialogs, and input boxes means.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is also one of the hard-to-avoid capabilities when AI Agents move into real deployment.
If an agent is going to help people operate office software, browsers, enterprise backends, or developer tools, it has to understand the interface, not just read API docs.&lt;/p&gt;
&lt;p&gt;Document intelligence follows a similar logic.
Enterprise materials often mix text, tables, images, scanned pages, and charts.
An omnimodal model can put all of that content into the same context for understanding, making it suitable for contract review, report analysis, invoice processing, knowledge-base QA, and process automation.&lt;/p&gt;
&lt;h2 id=&#34;audio-and-video-bring-agents-closer-to-real-scenarios&#34;&gt;Audio and Video Bring Agents Closer to Real Scenarios
&lt;/h2&gt;&lt;p&gt;Audio and video inputs can noticeably expand the range of agent applications.&lt;/p&gt;
&lt;p&gt;Audio scenarios include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;meeting recording summaries;&lt;/li&gt;
&lt;li&gt;customer service call analysis;&lt;/li&gt;
&lt;li&gt;voice command understanding;&lt;/li&gt;
&lt;li&gt;education and training content organization.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Video scenarios include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;instructional video understanding;&lt;/li&gt;
&lt;li&gt;security and industrial inspection;&lt;/li&gt;
&lt;li&gt;screen recording analysis;&lt;/li&gt;
&lt;li&gt;operation workflow review;&lt;/li&gt;
&lt;li&gt;temporal reasoning in multi-step tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If these tasks rely only on text transcription, a lot of visual and timing information is lost.
An omnimodal model can directly combine voice, frames, and textual clues, giving Agents a more complete sense of their environment.&lt;/p&gt;
&lt;h2 id=&#34;deployment-and-ecosystem&#34;&gt;Deployment and Ecosystem
&lt;/h2&gt;&lt;p&gt;NVIDIA is placing &lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; inside an open ecosystem, and the model uses the Apache 2.0 license.
That matters for developers and enterprises because it lowers the licensing barrier for experimentation, integration, and secondary development.&lt;/p&gt;
&lt;p&gt;From NVIDIA&amp;rsquo;s introduction, this model is also closely tied to its inference ecosystem.
For enterprise users, real deployment usually raises questions like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether it can run efficiently on NVIDIA GPUs;&lt;/li&gt;
&lt;li&gt;whether it supports long context and multimodal input;&lt;/li&gt;
&lt;li&gt;whether it can connect to existing Agent frameworks;&lt;/li&gt;
&lt;li&gt;whether it can process internal documents, audio/video, and UI screenshots;&lt;/li&gt;
&lt;li&gt;whether it can be deployed in private environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;NVIDIA emphasizes that the model has a clear throughput advantage and says it can reach up to 9x the throughput of comparable open omnimodal reasoning models.
The real value of that number still depends on the specific hardware, context length, input modalities, and inference framework.
But the direction is clear: NVIDIA wants to bring open multimodal models and its inference infrastructure together into enterprise Agent scenarios.&lt;/p&gt;
&lt;h2 id=&#34;suitable-use-cases&#34;&gt;Suitable Use Cases
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; is better suited to tasks such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Agents that need to understand text, images, audio, and video at the same time;&lt;/li&gt;
&lt;li&gt;enterprise document intelligence and knowledge-base QA;&lt;/li&gt;
&lt;li&gt;computer operation based on screenshots or web interfaces;&lt;/li&gt;
&lt;li&gt;multimodal analysis of meetings, customer service, and teaching content;&lt;/li&gt;
&lt;li&gt;video understanding, workflow review, and temporal reasoning;&lt;/li&gt;
&lt;li&gt;teams that require open licensing and private deployment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is not necessarily a fit for every regular user.
If the task is local chat, code completion, or simple QA, a single-modality language model may be lighter, faster, and more resource-efficient.
The value of &lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; mainly appears in complex input and multimodal Agent workflows.&lt;/p&gt;
&lt;h2 id=&#34;what-this-means-for-ai-agents&#34;&gt;What This Means for AI Agents
&lt;/h2&gt;&lt;p&gt;For AI Agents to truly enter work scenarios, they cannot only write text.
They need to understand interfaces, speech, documents, and changes in video, then turn that information into the next action.&lt;/p&gt;
&lt;p&gt;That is where &lt;code&gt;Nemotron 3 Nano Omni&lt;/code&gt; matters.
It is not simply making the model larger; it is unifying the many kinds of input Agents face into one reasoning model.
This can make it easier for developers to build agents for real tasks instead of building only around chat windows.&lt;/p&gt;
&lt;p&gt;From this angle, the point of NVIDIA&amp;rsquo;s release is not just &amp;ldquo;another multimodal model&amp;rdquo;.
It is part of a continuing effort to connect open models, GPU inference, enterprise Agents, and private deployment.
What will be worth watching next is how it performs in concrete Agent frameworks, enterprise workflows, and local deployments.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://blogs.nvidia.cn/blog/nemotron-3-nano-omni-multimodal-ai-agents/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA Technical Blog: NVIDIA Nemotron 3 Nano Omni&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models</title>
        <link>https://knightli.com/en/2026/05/01/qwen3-6-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 12:02:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/qwen3-6-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;: a 27B dense model.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;: a 35B total / 3B active MoE model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are also online product or API model names such as &lt;code&gt;Qwen3.6-Plus&lt;/code&gt; and &lt;code&gt;Qwen3.6-Max&lt;/code&gt;.
If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table.
This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.&lt;/p&gt;
&lt;p&gt;As with the Gemma 4 table in &lt;code&gt;/05/10&lt;/code&gt;, two concepts need to be separated first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GGUF file size&lt;/strong&gt;: how large the model weight file is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actual VRAM usage&lt;/strong&gt;: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Qwen3.6 has a very long default context. The official model card states native support for &lt;code&gt;262,144&lt;/code&gt; tokens and extension to &lt;code&gt;1,010,000&lt;/code&gt; tokens.
So the “minimum VRAM” column below only applies to short or medium context.
If you really want 128K, 256K, or longer context, reserve much more room for KV cache.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM&lt;/th&gt;
          &lt;th&gt;Good Fit&lt;/th&gt;
          &lt;th&gt;Avoid&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;8GB&lt;/td&gt;
          &lt;td&gt;Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk&lt;/td&gt;
          &lt;td&gt;Q4 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12GB&lt;/td&gt;
          &lt;td&gt;27B Q2/Q3, 35B-A3B Q2/Q3 with short context&lt;/td&gt;
          &lt;td&gt;27B Q4 with long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16GB&lt;/td&gt;
          &lt;td&gt;27B Q3/Q4, 35B-A3B Q3/IQ4_XS&lt;/td&gt;
          &lt;td&gt;35B-A3B Q4 with long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;27B Q4/Q5/Q6, 35B-A3B Q4&lt;/td&gt;
          &lt;td&gt;35B-A3B Q8, BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32GB&lt;/td&gt;
          &lt;td&gt;27B Q8, 35B-A3B Q5/Q6&lt;/td&gt;
          &lt;td&gt;BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;35B-A3B Q8, 27B with longer context more comfortably&lt;/td&gt;
          &lt;td&gt;35B-A3B BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB+&lt;/td&gt;
          &lt;td&gt;27B / 35B-A3B BF16&lt;/td&gt;
          &lt;td&gt;No need to chase BF16 for ordinary local chat&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you have a 24GB GPU, focus on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B UD-Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.&lt;/p&gt;
&lt;h2 id=&#34;official-weight-sizes&#34;&gt;Official Weight Sizes
&lt;/h2&gt;&lt;p&gt;The following BF16 weight sizes come from &lt;code&gt;model.safetensors.index.json&lt;/code&gt; in the official Hugging Face repositories.
They are useful as a reference for the original model scale.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Architecture&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Official BF16 Weight Size&lt;/th&gt;
          &lt;th&gt;Official Context&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;27B dense&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;55.56GB&lt;/td&gt;
          &lt;td&gt;Native 262K, extendable to 1,010K&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;35B total / 3B active MoE&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;71.90GB&lt;/td&gt;
          &lt;td&gt;Native 262K, extendable to 1,010K&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Although &lt;code&gt;35B-A3B&lt;/code&gt; activates about 3B parameters per step, it still needs to load the full MoE weights.
So it should not be estimated like a 3B small model.&lt;/p&gt;
&lt;h2 id=&#34;qwen36-27b-vram-table&#34;&gt;Qwen3.6-27B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt; is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model.
For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.39GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.85GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.85GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ3_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.99GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;VRAM-saving 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;3-bit entry point&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.59GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.44GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4, more VRAM efficient&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_NL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.07GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.82GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Recommended 27B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;19.51GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.52GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28.60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For ordinary local coding and chat, &lt;code&gt;Q4_K_M&lt;/code&gt; is the easiest starting point to recommend.
A 24GB GPU can run &lt;code&gt;Q4_K_M&lt;/code&gt; fairly comfortably, but for long context, reduce quantization size or context length.&lt;/p&gt;
&lt;h2 id=&#34;qwen36-35b-a3b-vram-table&#34;&gt;Qwen3.6-35B-A3B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt; is an MoE model with 35B total parameters and about 3B active parameters per step.
Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.&lt;/p&gt;
&lt;p&gt;But note that MoE &lt;code&gt;3B active&lt;/code&gt; mainly affects compute. It does not mean VRAM usage is comparable to a 3B model.
Full operation still needs the expert weights.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.76GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.52GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.29GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ3_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.21GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;VRAM-saving 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;3-bit entry point&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;17.73GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_NL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18.04GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4 recommended option&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.13GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Recommended 35B-A3B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;26.46GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;29.31GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;36.90GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;69.37GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;With 24GB VRAM, &lt;code&gt;UD-Q4_K_M&lt;/code&gt; is a key option, but do not set the context too high.
If you want room for 128K+ context, &lt;code&gt;UD-IQ4_XS&lt;/code&gt;, &lt;code&gt;UD-IQ4_NL&lt;/code&gt;, or 3-bit versions are more realistic.&lt;/p&gt;
&lt;h2 id=&#34;27b-vs-35b-a3b&#34;&gt;27B vs 35B-A3B
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Need&lt;/th&gt;
          &lt;th&gt;Better Choice&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Stable dense-model behavior&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Faster response, agents, and tool use&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Daily local use on 24GB VRAM&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;35B-A3B UD-Q4_K_M&lt;/code&gt; or &lt;code&gt;27B Q4_K_M&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Testing on 16GB VRAM&lt;/td&gt;
          &lt;td&gt;Use 2-bit/3-bit for both; avoid long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Long context first&lt;/td&gt;
          &lt;td&gt;Use lower-bit quantization and leave more KV cache room&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quality first with 32GB+ VRAM&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;27B Q5/Q6&lt;/code&gt; or &lt;code&gt;35B-A3B Q5/Q6&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you mainly write code, run agents, or use tools, &lt;code&gt;35B-A3B&lt;/code&gt; is worth trying first.
If you care more about dense-model stability and consistency, &lt;code&gt;27B&lt;/code&gt; is more straightforward.&lt;/p&gt;
&lt;h2 id=&#34;why-long-context-uses-so-much-vram&#34;&gt;Why Long Context Uses So Much VRAM
&lt;/h2&gt;&lt;p&gt;The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning.
But for local deployment, long context means a much larger &lt;code&gt;KV cache&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Actual VRAM usage is affected by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;KV cache&lt;/code&gt;: longer context means higher usage.&lt;/li&gt;
&lt;li&gt;Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.&lt;/li&gt;
&lt;li&gt;Whether &lt;code&gt;--language-model-only&lt;/code&gt; is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.&lt;/li&gt;
&lt;li&gt;Batch size and concurrency: more concurrency requires more VRAM.&lt;/li&gt;
&lt;li&gt;KV cache quantization: &lt;code&gt;q8_0&lt;/code&gt;, &lt;code&gt;q4_0&lt;/code&gt;, and similar settings can save VRAM, but may affect details.&lt;/li&gt;
&lt;li&gt;Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So do not look only at GGUF file size.
If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you just want to try Qwen3.6 locally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;12GB VRAM: try &lt;code&gt;27B UD-IQ2_M&lt;/code&gt; or &lt;code&gt;35B-A3B UD-IQ2_M&lt;/code&gt;, with short context.&lt;/li&gt;
&lt;li&gt;16GB VRAM: try &lt;code&gt;27B Q3_K_M&lt;/code&gt; or &lt;code&gt;35B-A3B UD-IQ3_XXS&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;24GB VRAM: prefer &lt;code&gt;27B Q4_K_M&lt;/code&gt;, &lt;code&gt;35B-A3B UD-IQ4_NL&lt;/code&gt;, or &lt;code&gt;35B-A3B UD-Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;32GB VRAM: consider &lt;code&gt;27B Q5/Q6&lt;/code&gt; or &lt;code&gt;35B-A3B Q5/Q6&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;48GB and above: try &lt;code&gt;Q8_0&lt;/code&gt;, or reserve more room for long context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most users do not need BF16.
The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-27B&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-27B - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-35B-A3B&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-35B-A3B - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-27B-FP8&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-27B-FP8 - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-35B-A3B-FP8 - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/Qwen3.6-27B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/Qwen3.6-27B-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/Qwen3.6-35B-A3B-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions</title>
        <link>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 11:55:25 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;DeepSeek V4 and Gemma 4 are not in the same class for local deployment.
With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.&lt;/p&gt;
&lt;p&gt;The official DeepSeek V4 Preview release mainly includes two inference models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;: &lt;code&gt;1.6T total / 49B active params&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;: &lt;code&gt;284B total / 13B active params&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The official Hugging Face collection also includes two Base models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This article only discusses rough VRAM requirements when the full model weights are loaded.
For MoE models, &lt;code&gt;active params&lt;/code&gt; mainly affects per-token compute. It does not mean only those parameters need to be loaded.
Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM Scale&lt;/th&gt;
          &lt;th&gt;What Is Realistic&lt;/th&gt;
          &lt;th&gt;Do Not Expect&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;Cannot fully run DeepSeek V4; use smaller distilled models or API&lt;/td&gt;
          &lt;td&gt;Full V4-Flash / V4-Pro local loading&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;Still not suitable for full loading; good for small models or remote API clients&lt;/td&gt;
          &lt;td&gt;Stable V4-Flash Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB&lt;/td&gt;
          &lt;td&gt;Theoretically try V4-Flash Q2/Q3 or heavy offload&lt;/td&gt;
          &lt;td&gt;V4-Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;128GB&lt;/td&gt;
          &lt;td&gt;V4-Flash Q4 becomes more realistic; Q5/Q6 still tight&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;192GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;256GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested&lt;/td&gt;
          &lt;td&gt;V4-Pro Q5 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;512GB&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4 starts to become discussable&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;1TB+&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8 and low-bit Pro-Base are more realistic&lt;/td&gt;
          &lt;td&gt;Low-cost single-machine deployment&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2TB+&lt;/td&gt;
          &lt;td&gt;Pro-Base FP8 class&lt;/td&gt;
          &lt;td&gt;Ordinary workstation deployment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target.
More realistic options are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the official DeepSeek API or compatible services.&lt;/li&gt;
&lt;li&gt;Wait for stable community GGUF/EXL2/MLX quantizations and inference support.&lt;/li&gt;
&lt;li&gt;Use smaller DeepSeek distilled models.&lt;/li&gt;
&lt;li&gt;Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-weight-sizes&#34;&gt;Official Weight Sizes
&lt;/h2&gt;&lt;p&gt;The following figures come from &lt;code&gt;model.safetensors.index.json&lt;/code&gt; in the official Hugging Face repositories.
They reflect current public weight file sizes, not full runtime VRAM use under long context.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Parameter Scale&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Official Weight Size&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total / 13B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td&gt;Inference model, smallest in this group&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total / 49B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td&gt;Inference model, stronger but enormous&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td&gt;Base model, closer to full FP8 weight size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td&gt;Base model, about 1.6TB&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even the smallest &lt;code&gt;V4-Flash&lt;/code&gt; is already close to 160GB of official weights.
That is why it should not be treated like a 13B model just because it has &lt;code&gt;13B active params&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-vram-estimate&#34;&gt;DeepSeek V4 Flash VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Flash&lt;/code&gt; is the most approachable DeepSeek V4 variant for local experiments.
But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.&lt;/p&gt;
&lt;p&gt;The table below uses the official 159.61GB weight size as the baseline.
Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Multi-GPU servers, inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;120GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td&gt;Quality-first quantization tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;More realistic starting point for Flash&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Large-VRAM single GPU or multi-GPU tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit experiments with clear quality risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If mature &lt;code&gt;V4-Flash Q4&lt;/code&gt; builds appear later, it still probably will not be a 24GB GPU model.
A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-vram-estimate&#34;&gt;DeepSeek V4 Pro VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro&lt;/code&gt; is the flagship inference model, with official weights around 864.70GB.
Even at 4-bit quantization, the full weights remain in the hundreds of GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB+&lt;/td&gt;
          &lt;td&gt;Multi-node or multi-GPU inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;648GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;High-quality quantized service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;540GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td&gt;Quality/cost balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;432GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Lowest practical quality line for Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;324GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;216GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments with high quality and stability risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For individual users, &lt;code&gt;V4-Pro&lt;/code&gt; is better consumed through an API.
If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-base-vram-estimate&#34;&gt;DeepSeek V4 Flash-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment.
&lt;code&gt;V4-Flash-Base&lt;/code&gt; has official weights of about 294.67GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Research, preprocessing, evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;221GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;184GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;147GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td&gt;Lower-cost Base experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;111GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you only want to use DeepSeek V4 capabilities, do not start with the Base model.
Base models cost more to deploy and tune; most applications should use the inference model or API.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-base-vram-estimate&#34;&gt;DeepSeek V4 Pro-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro-Base&lt;/code&gt; is the heaviest variant, with official weights around 1606.03GB.
That is already a 1.6TB-class model file.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.4TB+&lt;/td&gt;
          &lt;td&gt;Large-scale research clusters&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1205GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1004GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td&gt;Research and evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;803GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td&gt;Low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;602GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;402GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This kind of model should not be discussed in the framework of “can a home GPU run it?”
Even Q4 is already beyond the comfortable range of most single-machine workstations.&lt;/p&gt;
&lt;h2 id=&#34;why-active-params-are-not-enough&#34;&gt;Why Active Params Are Not Enough
&lt;/h2&gt;&lt;p&gt;DeepSeek V4 is an MoE model.
MoE means each token activates only part of the experts, so compute is much lower than the total parameter count.
But this does not mean VRAM only needs to hold the active parameters.&lt;/p&gt;
&lt;p&gt;Full local inference also depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether all expert weights must stay resident on GPU.&lt;/li&gt;
&lt;li&gt;Whether on-demand expert loading is supported.&lt;/li&gt;
&lt;li&gt;CPU memory to GPU memory transfer costs.&lt;/li&gt;
&lt;li&gt;NVMe offload latency.&lt;/li&gt;
&lt;li&gt;KV cache growth under long context.&lt;/li&gt;
&lt;li&gt;Extra runtime overhead under 1M context.&lt;/li&gt;
&lt;li&gt;Multi-node and multi-GPU communication cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;V4-Pro&lt;/code&gt; with &lt;code&gt;49B active&lt;/code&gt; should not be deployed like a 49B model.
&lt;code&gt;V4-Flash&lt;/code&gt; with &lt;code&gt;13B active&lt;/code&gt; should not be treated like a 13B small model either.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you are an ordinary individual user:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do not try to fully self-host DeepSeek V4.&lt;/li&gt;
&lt;li&gt;Use the official API when you need DeepSeek V4 capabilities.&lt;/li&gt;
&lt;li&gt;For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.&lt;/li&gt;
&lt;li&gt;With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 128GB to 256GB total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Watch for stable community implementations of &lt;code&gt;V4-Flash Q4/Q5&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Do not treat &lt;code&gt;V4-Pro&lt;/code&gt; as your main local model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 512GB+ total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;V4-Pro Q4&lt;/code&gt; starts to become an engineering validation target.&lt;/li&gt;
&lt;li&gt;You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key question for DeepSeek V4 local deployment is not “which quantized file should I download?”
It is “do I have the system-level inference capacity for this model?”
It is closer to a server model than a desktop model.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://api-docs.deepseek.com/news/news260424&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek V4 Preview Release - DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/collections/deepseek-ai/deepseek-v4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek-V4 collection - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models</title>
        <link>https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 11:42:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;Gemma 4 currently has four main sizes for local deployment: &lt;code&gt;E2B&lt;/code&gt;, &lt;code&gt;E4B&lt;/code&gt;, &lt;code&gt;26B A4B&lt;/code&gt;, and &lt;code&gt;31B&lt;/code&gt;.
&lt;code&gt;E2B&lt;/code&gt; and &lt;code&gt;E4B&lt;/code&gt; target lightweight and edge devices, &lt;code&gt;26B A4B&lt;/code&gt; uses an MoE architecture, and &lt;code&gt;31B&lt;/code&gt; is the larger dense model.&lt;/p&gt;
&lt;p&gt;The easiest mistake in local inference is mixing up two numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GGUF file size&lt;/strong&gt;: how large the model weight file is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actual VRAM usage&lt;/strong&gt;: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tables below estimate VRAM requirements based on GGUF file size.
The default assumption is local text inference with &lt;code&gt;llama.cpp&lt;/code&gt;, LM Studio, Ollama, or similar runtimes, using short to medium context.
If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM&lt;/th&gt;
          &lt;th&gt;Good Fit&lt;/th&gt;
          &lt;th&gt;Avoid&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;4GB&lt;/td&gt;
          &lt;td&gt;Low-bit E2B quantizations&lt;/td&gt;
          &lt;td&gt;E4B and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;6GB&lt;/td&gt;
          &lt;td&gt;E2B Q4/Q5, low-bit E4B&lt;/td&gt;
          &lt;td&gt;26B, 31B&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;8GB&lt;/td&gt;
          &lt;td&gt;E2B Q8, E4B Q4/Q5&lt;/td&gt;
          &lt;td&gt;26B Q4, 31B Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12GB&lt;/td&gt;
          &lt;td&gt;E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests&lt;/td&gt;
          &lt;td&gt;26B Q4 with long context, 31B Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-bit 26B, low-bit 31B&lt;/td&gt;
          &lt;td&gt;31B Q4 with long context, 26B Q5 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;26B Q4/Q5, 31B Q4&lt;/td&gt;
          &lt;td&gt;31B Q8, BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32GB&lt;/td&gt;
          &lt;td&gt;26B Q6/Q8, 31B Q5/Q6&lt;/td&gt;
          &lt;td&gt;BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;31B Q8 more comfortably, 26B Q8 with longer context&lt;/td&gt;
          &lt;td&gt;31B BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB+&lt;/td&gt;
          &lt;td&gt;26B/31B BF16&lt;/td&gt;
          &lt;td&gt;Single consumer GPU deployment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you just want something usable locally, start with &lt;code&gt;E4B Q4_K_M&lt;/code&gt; or &lt;code&gt;E2B Q4_K_M&lt;/code&gt;.
With 24GB VRAM, &lt;code&gt;26B A4B Q4_K_M&lt;/code&gt; and &lt;code&gt;31B Q4_K_M&lt;/code&gt; start to become realistic choices.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-e2b-vram-table&#34;&gt;Gemma 4 E2B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;E2B&lt;/code&gt; is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing.
It is easy to run, but complex reasoning, coding, and long tasks are limited.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.29GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.54GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Lightweight chat and summaries&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.98GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.11GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Recommended E2B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Slightly steadier than Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.50GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Higher-quality small model&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5.05GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Near-original precision for lightweight deployment&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.31GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Debugging, comparison, research&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For daily use, &lt;code&gt;E2B Q4_K_M&lt;/code&gt; is already enough.
With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-e4b-vram-table&#34;&gt;Gemma 4 E4B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;E4B&lt;/code&gt; is the more practical lightweight model.
Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.06GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Lightweight local assistant&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.72GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and speed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.98GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Recommended E4B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5.48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Steadier everyday use&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7.07GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.19GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.05GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your GPU has 8GB VRAM, &lt;code&gt;E4B Q4_K_M&lt;/code&gt; is a realistic starting point.
With 12GB or 16GB VRAM, &lt;code&gt;E4B Q8_0&lt;/code&gt; is also worth considering.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-26b-a4b-vram-table&#34;&gt;Gemma 4 26B A4B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;26B A4B&lt;/code&gt; is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference.
It is better suited to more complex Q&amp;amp;A, coding, tool use, and agent workflows.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.97GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme 16GB GPU tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.55GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Running 26B with low VRAM&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;Better quality while still VRAM-conscious&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.42GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.87GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Recommended 26B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.15GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.17GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;26.86GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.51GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td&gt;Not realistic for most single consumer GPUs&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;24GB VRAM is the comfortable dividing line for 26B A4B.
A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-31b-vram-table&#34;&gt;Gemma 4 31B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;31B&lt;/code&gt; is the larger dense model.
Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests with clear quality loss&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.75GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.77GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;16GB GPU experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.21GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;More VRAM-efficient 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14.74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.37GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4 compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18.32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Recommended 31B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.66GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32.64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.41GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Server or large-VRAM workstation&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point.
&lt;code&gt;Q4_K_M&lt;/code&gt; is the balanced choice, while &lt;code&gt;Q5_K_M&lt;/code&gt; and above make more sense with 32GB+ VRAM.&lt;/p&gt;
&lt;h2 id=&#34;why-actual-usage-is-higher-than-file-size&#34;&gt;Why Actual Usage Is Higher Than File Size
&lt;/h2&gt;&lt;p&gt;The GGUF file size is only the weight size.
Runtime usage also includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;KV cache&lt;/code&gt;: longer context means higher memory use.&lt;/li&gt;
&lt;li&gt;Batch size and concurrency: processing more tokens or more users increases VRAM.&lt;/li&gt;
&lt;li&gt;Multimodal components: image, audio, or video input often requires &lt;code&gt;mmproj&lt;/code&gt; or extra modules.&lt;/li&gt;
&lt;li&gt;Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.&lt;/li&gt;
&lt;li&gt;KV cache quantization: &lt;code&gt;q8_0&lt;/code&gt;, &lt;code&gt;q4_0&lt;/code&gt;, and similar modes can save VRAM, but may affect detail.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the “minimum VRAM” column should be read as the threshold for startup and short-context inference.
For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you just want to try Gemma 4 locally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4GB to 6GB VRAM: choose &lt;code&gt;E2B Q3_K_M&lt;/code&gt; or &lt;code&gt;E2B Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;8GB VRAM: prefer &lt;code&gt;E4B Q4_K_M&lt;/code&gt;; &lt;code&gt;E2B Q8_0&lt;/code&gt; is also fine.&lt;/li&gt;
&lt;li&gt;12GB VRAM: choose &lt;code&gt;E4B Q8_0&lt;/code&gt;, or try low-bit 26B/31B variants.&lt;/li&gt;
&lt;li&gt;16GB VRAM: try &lt;code&gt;26B A4B UD-Q3_K_M&lt;/code&gt; or &lt;code&gt;31B Q3_K_S&lt;/code&gt;, but do not expect long context to feel comfortable.&lt;/li&gt;
&lt;li&gt;24GB VRAM: focus on &lt;code&gt;26B A4B UD-Q4_K_M&lt;/code&gt; and &lt;code&gt;31B Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;32GB and above: consider &lt;code&gt;Q5_K_M&lt;/code&gt;, &lt;code&gt;Q6_K&lt;/code&gt;, or longer context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most users do not need BF16.
Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E2B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E2B-it - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E4B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E4B-it - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;ggml-org/gemma-4-26B-A4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-E2B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-E4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-26B-A4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-31B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-31B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Tune llama.cpp on 8GB VRAM: Why 32K Is Safer and 64K Needs KV Cache Quantization</title>
        <link>https://knightli.com/en/2026/04/23/llama-cpp-8g-vram-32k-64k-kv-cache-tuning/</link>
        <pubDate>Thu, 23 Apr 2026 12:13:04 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/23/llama-cpp-8g-vram-32k-64k-kv-cache-tuning/</guid>
        <description>&lt;p&gt;Whether &lt;code&gt;8GB&lt;/code&gt; of VRAM is enough to run local LLMs smoothly, especially under long-context workloads, is one of the most common questions people run into when using &lt;code&gt;llama.cpp&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There are three key takeaways worth remembering first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;On &lt;code&gt;8GB&lt;/code&gt; VRAM, &lt;code&gt;32K&lt;/code&gt; context is usually the safer balance point&lt;/li&gt;
&lt;li&gt;If you really want to run &lt;code&gt;64K&lt;/code&gt;, &lt;code&gt;KV Cache&lt;/code&gt; quantization is often essential&lt;/li&gt;
&lt;li&gt;In full-GPU inference, blindly increasing &lt;code&gt;CPU&lt;/code&gt; thread count can actually make performance worse&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;1-first-what-do-32k-64k-and-kv-cache-actually-mean&#34;&gt;1. First, what do 32K, 64K, and KV Cache actually mean?
&lt;/h2&gt;&lt;p&gt;For many readers, these are the three terms that cause the most confusion.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;32K&lt;/code&gt; and &lt;code&gt;64K&lt;/code&gt; refer to context length, meaning how many &lt;code&gt;tokens&lt;/code&gt; the model can process at one time. Here, &lt;code&gt;K&lt;/code&gt; means thousand, so &lt;code&gt;32K&lt;/code&gt; is about &lt;code&gt;32000 tokens&lt;/code&gt;, and &lt;code&gt;64K&lt;/code&gt; is about &lt;code&gt;64000 tokens&lt;/code&gt;. The longer the context, the more prior content the model can see at once, which is useful for long-document QA, long conversations, and multi-step analysis.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;KV Cache&lt;/code&gt; is an intermediate-result cache that the model keeps in order to speed up autoregressive generation. You can think of it like this: once the model has already read and computed part of the context, it does not need to recompute everything from scratch every time. Instead, it stores key intermediate information and reuses it. The &lt;code&gt;K&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt; come from &lt;code&gt;Key&lt;/code&gt; and &lt;code&gt;Value&lt;/code&gt; in the Transformer architecture.&lt;/p&gt;
&lt;p&gt;Why do these three terms always appear together? Because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32K&lt;/code&gt; and &lt;code&gt;64K&lt;/code&gt; define how much content you want the model to remember at once&lt;/li&gt;
&lt;li&gt;&lt;code&gt;KV Cache&lt;/code&gt; determines how much extra VRAM is needed to maintain that memory&lt;/li&gt;
&lt;li&gt;The longer the context, the larger the &lt;code&gt;KV Cache&lt;/code&gt; usually becomes, and the higher the VRAM pressure gets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So when long-context inference slows down, the root problem is often not that the model is &amp;ldquo;bad at computing&amp;rdquo;, but that the cache has grown large enough to push VRAM to its limit.&lt;/p&gt;
&lt;h2 id=&#34;2-why-does-32k-perform-so-differently-from-64k&#34;&gt;2. Why does 32K perform so differently from 64K?
&lt;/h2&gt;&lt;p&gt;Using roughly &lt;code&gt;30000&lt;/code&gt; Chinese characters from &lt;em&gt;The Three-Body Problem&lt;/em&gt; as a stress-test input, the comparison between &lt;code&gt;32K&lt;/code&gt; and &lt;code&gt;64K&lt;/code&gt; context can look dramatic: with similar document size, &lt;code&gt;64K&lt;/code&gt; can become much slower and total runtime can increase significantly.&lt;/p&gt;
&lt;p&gt;The reason is not that the model suddenly becomes worse. The real issue is hitting the VRAM boundary.&lt;/p&gt;
&lt;p&gt;At &lt;code&gt;32K&lt;/code&gt;, model weights plus cache may still fit within &lt;code&gt;8GB&lt;/code&gt; VRAM, so most data traffic stays on the GPU&amp;rsquo;s own memory bandwidth. But once you move to &lt;code&gt;64K&lt;/code&gt;, the cache grows further, total memory use approaches or exceeds the VRAM ceiling, and part of the data gets pushed into shared or system memory.&lt;/p&gt;
&lt;p&gt;At that point, what collapses is not raw compute, but bandwidth.&lt;/p&gt;
&lt;p&gt;In other words, what looks like &amp;ldquo;context doubled and performance crashed&amp;rdquo; is often really a case of the data path falling out of VRAM and into much slower memory.&lt;/p&gt;
&lt;h2 id=&#34;3-if-you-want-64k-kv-cache-quantization-matters-a-lot&#34;&gt;3. If you want 64K, KV Cache quantization matters a lot
&lt;/h2&gt;&lt;p&gt;One of the most important conclusions for &lt;code&gt;8GB&lt;/code&gt; VRAM users is that &lt;code&gt;KV Cache&lt;/code&gt; quantization matters a great deal.&lt;/p&gt;
&lt;p&gt;Without changing the model itself, quantizing only the cache can directly reduce cache memory usage under long context. That means some of the data that previously spilled out of VRAM can move back into VRAM. As a result, &lt;code&gt;64K&lt;/code&gt; is still heavier than &lt;code&gt;32K&lt;/code&gt;, but it is less likely to fall into the slowest performance zone.&lt;/p&gt;
&lt;p&gt;Put simply:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32K&lt;/code&gt; is the more practical default range for &lt;code&gt;8GB&lt;/code&gt; VRAM&lt;/li&gt;
&lt;li&gt;&lt;code&gt;64K&lt;/code&gt; is not impossible&lt;/li&gt;
&lt;li&gt;But without cache quantization, performance can drop from &amp;ldquo;usable&amp;rdquo; to &amp;ldquo;hard to use&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your goal is stable long-context inference, the usual priority should be:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Check whether VRAM is already near its ceiling&lt;/li&gt;
&lt;li&gt;Decide whether to enable &lt;code&gt;KV Cache&lt;/code&gt; quantization&lt;/li&gt;
&lt;li&gt;Only then continue experimenting with more aggressive throughput settings&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;4-low-gpu-utilization-does-not-mean-the-gpu-is-idle&#34;&gt;4. Low GPU utilization does not mean the GPU is idle
&lt;/h2&gt;&lt;p&gt;This is a point that often breaks intuition.&lt;/p&gt;
&lt;p&gt;When people see only 20% or 30% &lt;code&gt;GPU&lt;/code&gt; usage in Task Manager, they often assume:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the parameters must be wrong&lt;/li&gt;
&lt;li&gt;the model is not really running on the GPU&lt;/li&gt;
&lt;li&gt;the GPU is not being used fully&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But the more likely explanation in &lt;code&gt;llama.cpp&lt;/code&gt; inference is that the bottleneck is not core compute, but memory reads and writes.&lt;/p&gt;
&lt;p&gt;That means GPU cores may finish a batch of computation quickly, then spend the rest of the time waiting for the next batch of weights or cached data to arrive.&lt;/p&gt;
&lt;p&gt;So what you see becomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;core utilization is not especially high&lt;/li&gt;
&lt;li&gt;but end-to-end speed still fails to improve&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is not the GPU being lazy. It is the data path being too narrow.&lt;/p&gt;
&lt;p&gt;That is why you should not look only at &lt;code&gt;GPU Usage&lt;/code&gt; when judging local LLM performance. VRAM capacity, memory bandwidth, and cache spillover often matter more.&lt;/p&gt;
&lt;h2 id=&#34;5-increasing-throughput-parameters-can-help-but-only-if-vram-can-handle-it&#34;&gt;5. Increasing throughput parameters can help, but only if VRAM can handle it
&lt;/h2&gt;&lt;p&gt;Another useful idea is this: if GPU cores are not fully saturated, maybe you can increase throughput-related parameters so the GPU processes more data at once and uses its parallelism more effectively.&lt;/p&gt;
&lt;p&gt;This can indeed improve speed.&lt;/p&gt;
&lt;p&gt;But there is an important condition: VRAM must still have headroom.&lt;/p&gt;
&lt;p&gt;Because once you increase throughput-related settings, you often also increase VRAM usage. If you are already in a &lt;code&gt;64K&lt;/code&gt; scenario with large cache and VRAM near exhaustion, pushing those parameters further can lead to two outcomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a crash&lt;/li&gt;
&lt;li&gt;or a fallback into much slower shared-memory behavior&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the safer sequence is usually not &amp;ldquo;max out the knobs first&amp;rdquo;, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;protect the VRAM boundary first&lt;/li&gt;
&lt;li&gt;then try throughput optimization&lt;/li&gt;
&lt;li&gt;after every change, check both speed and stability again&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;6-more-cpu-threads-are-not-always-better&#34;&gt;6. More CPU threads are not always better
&lt;/h2&gt;&lt;p&gt;This is one of the easiest traps to remember.&lt;/p&gt;
&lt;p&gt;It is very natural to assume that more threads should mean better speed. But in practice, once the model is already running mostly on the GPU, forcing &lt;code&gt;CPU&lt;/code&gt; thread count higher can make performance noticeably worse.&lt;/p&gt;
&lt;p&gt;The reason is straightforward.&lt;/p&gt;
&lt;p&gt;In full-GPU inference, the &lt;code&gt;CPU&lt;/code&gt; is more of a scheduler and preprocessing helper than the main compute engine. If you open too many threads, CPU-side thread contention, scheduling overhead, and context-switching costs all become heavier, which can disrupt the data flow that should have stayed smooth.&lt;/p&gt;
&lt;p&gt;The result is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;code&gt;CPU&lt;/code&gt; looks busier&lt;/li&gt;
&lt;li&gt;but overall speed gets slower&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So in this kind of setup, default settings or lower thread counts are often more reliable than simply maxing everything out.&lt;/p&gt;
&lt;h2 id=&#34;7-a-more-practical-approach-for-8gb-vram-users&#34;&gt;7. A more practical approach for 8GB VRAM users
&lt;/h2&gt;&lt;p&gt;If we compress the conclusions above into a practical workflow, it looks roughly like this:&lt;/p&gt;
&lt;h3 id=&#34;1-treat-32k-as-the-default-goal&#34;&gt;1. Treat 32K as the default goal
&lt;/h3&gt;&lt;p&gt;If you only have an &lt;code&gt;8GB&lt;/code&gt; GPU, do not rush to chase &lt;code&gt;64K&lt;/code&gt;. &lt;code&gt;32K&lt;/code&gt; is usually the more realistic balance between speed, stability, and memory usage.&lt;/p&gt;
&lt;h3 id=&#34;2-if-you-want-64k-deal-with-the-cache-first&#34;&gt;2. If you want 64K, deal with the cache first
&lt;/h3&gt;&lt;p&gt;Do not start by asking whether you can squeeze out a little more speed. First confirm whether &lt;code&gt;KV Cache&lt;/code&gt; is quantized and whether VRAM is already near the limit.&lt;/p&gt;
&lt;h3 id=&#34;3-do-not-judge-everything-by-gpu-utilization&#34;&gt;3. Do not judge everything by GPU utilization
&lt;/h3&gt;&lt;p&gt;Low utilization does not necessarily mean the settings are wrong. It may simply mean memory bandwidth is the real bottleneck.&lt;/p&gt;
&lt;h3 id=&#34;4-throughput-optimization-is-valid-but-do-not-cross-the-vram-boundary&#34;&gt;4. Throughput optimization is valid, but do not cross the VRAM boundary
&lt;/h3&gt;&lt;p&gt;These parameters can help, but only if there is still enough VRAM headroom.&lt;/p&gt;
&lt;h3 id=&#34;5-be-conservative-with-cpu-threads-first&#34;&gt;5. Be conservative with CPU threads first
&lt;/h3&gt;&lt;p&gt;If the model is already running mostly on the GPU, higher CPU thread counts are not automatically better. Start with defaults or lower thread counts, then test gradually.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The most valuable part of this whole discussion is not just a few benchmark numbers, but the fact that it makes one easily overlooked truth much clearer:&lt;/p&gt;
&lt;p&gt;Local LLM tuning is often not about pushing every setting to the maximum. It is about understanding whether your real bottleneck is compute, VRAM capacity, memory bandwidth, or &lt;code&gt;CPU&lt;/code&gt; scheduling.&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;8GB&lt;/code&gt; VRAM users, the safer strategy is usually not to force the longest possible context, but to protect the VRAM boundary first and only then decide how far to push further.&lt;/p&gt;
&lt;p&gt;If you only remember one sentence, make it this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;32K&lt;/code&gt; is often the more stable working range for &lt;code&gt;8GB&lt;/code&gt; VRAM; &lt;code&gt;64K&lt;/code&gt; is possible, but only if you have already brought &lt;code&gt;KV Cache&lt;/code&gt; and VRAM usage under control.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio</title>
        <link>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</link>
        <pubDate>Wed, 22 Apr 2026 21:47:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</guid>
        <description>&lt;p&gt;Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.&lt;/p&gt;
&lt;p&gt;If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use &lt;code&gt;MoE&lt;/code&gt; models inside &lt;code&gt;LM Studio&lt;/code&gt; with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.&lt;/p&gt;
&lt;h2 id=&#34;01-why-a-16gb-gpu-is-not-necessarily-limited-to-12b-to-14b&#34;&gt;01 Why a 16GB GPU is not necessarily limited to 12B to 14B
&lt;/h2&gt;&lt;p&gt;The core idea is straightforward: VRAM size matters, but model architecture matters just as much.&lt;/p&gt;
&lt;p&gt;If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.&lt;/p&gt;
&lt;p&gt;But &lt;code&gt;MoE&lt;/code&gt; models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.&lt;/p&gt;
&lt;p&gt;That is exactly why a 16GB GPU still leaves some room to work with.&lt;/p&gt;
&lt;h2 id=&#34;02-key-practical-takeaway-35b-moe-models-can-run-surprisingly-fast&#34;&gt;02 Key practical takeaway: 35B MoE models can run surprisingly fast
&lt;/h2&gt;&lt;p&gt;One representative case is a quantized &lt;code&gt;MoE&lt;/code&gt; model such as &lt;code&gt;Qwen 3.5 35B A3B&lt;/code&gt;. With a 16GB GPU and the right settings in &lt;code&gt;LM Studio&lt;/code&gt;, &lt;code&gt;Q6&lt;/code&gt; quantization can reach something above 30 &lt;code&gt;tokens/s&lt;/code&gt;, and &lt;code&gt;Q4&lt;/code&gt; can sometimes test even higher.&lt;/p&gt;
&lt;p&gt;That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.&lt;/p&gt;
&lt;p&gt;As a comparison, large models of a similar scale that are not &lt;code&gt;MoE&lt;/code&gt; often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.&lt;/p&gt;
&lt;h2 id=&#34;03-in-lm-studio-the-key-is-not-just-one-parameter&#34;&gt;03 In LM Studio, the key is not just one parameter
&lt;/h2&gt;&lt;p&gt;If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the setting that forces part of the expert layers into CPU memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first one is easy to understand. &lt;code&gt;GPU Offload&lt;/code&gt; is basically something you push as high as possible, so the model prioritizes GPU computation.&lt;/p&gt;
&lt;p&gt;The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since &lt;code&gt;MoE&lt;/code&gt; models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.&lt;/p&gt;
&lt;p&gt;A safer way to tune it is to start within a range and then adjust gradually for your machine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;start with related values somewhere between &lt;code&gt;20&lt;/code&gt; and &lt;code&gt;35&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;then fine-tune based on VRAM usage and memory pressure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At its core, this method is using system memory to buy back VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;04-it-can-still-run-at-128k-context-and-smaller-contexts-reduce-vram-further&#34;&gt;04 It can still run at 128K context, and smaller contexts reduce VRAM further
&lt;/h2&gt;&lt;p&gt;Another interesting point is that even with the context length pushed to &lt;code&gt;128K&lt;/code&gt;, a 35B-class &lt;code&gt;MoE&lt;/code&gt; model can still maintain a relatively high speed.&lt;/p&gt;
&lt;p&gt;That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like &lt;code&gt;LM Studio&lt;/code&gt;, the real question is often not simply “can it run or not,” but rather:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;are you willing to trade more system memory for less VRAM usage&lt;/li&gt;
&lt;li&gt;are you willing to shorten the context length&lt;/li&gt;
&lt;li&gt;are you willing to accept different capability tradeoffs across quantization levels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the context is reduced further from &lt;code&gt;128K&lt;/code&gt; to &lt;code&gt;64K&lt;/code&gt; or &lt;code&gt;32K&lt;/code&gt;, VRAM pressure can drop even more. That means some 35B-class &lt;code&gt;MoE&lt;/code&gt; models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.&lt;/p&gt;
&lt;h2 id=&#34;05-the-cost-of-this-approach-much-higher-demands-on-ram-and-virtual-memory&#34;&gt;05 The cost of this approach: much higher demands on RAM and virtual memory
&lt;/h2&gt;&lt;p&gt;This kind of setup is not free performance.&lt;/p&gt;
&lt;p&gt;What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.&lt;/p&gt;
&lt;p&gt;So if you want to try it yourself, it is worth checking a few things first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether your system RAM is large enough&lt;/li&gt;
&lt;li&gt;whether your virtual memory allocation is large enough&lt;/li&gt;
&lt;li&gt;whether too many background applications are already consuming resources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.&lt;/p&gt;
&lt;h2 id=&#34;06-more-aggressive-quantization-is-not-always-better&#34;&gt;06 More aggressive quantization is not always better
&lt;/h2&gt;&lt;p&gt;There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.&lt;/p&gt;
&lt;p&gt;The practical takeaway is that some models do run faster under &lt;code&gt;Q4&lt;/code&gt;, but their original capability can also degrade more. By comparison, &lt;code&gt;Q6&lt;/code&gt; tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;maximum speed and fitting into VRAM&lt;/li&gt;
&lt;li&gt;or preserving more of the model’s original capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those two priorities do not necessarily lead to the same quantization choice.&lt;/p&gt;
&lt;h2 id=&#34;07-what-kinds-of-models-are-worth-trying&#34;&gt;07 What kinds of models are worth trying
&lt;/h2&gt;&lt;p&gt;From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;models built on &lt;code&gt;MoE&lt;/code&gt; architecture&lt;/li&gt;
&lt;li&gt;models that are well supported in &lt;code&gt;LM Studio&lt;/code&gt; and have complete quantized variants&lt;/li&gt;
&lt;li&gt;models with clear advantages in long context or instruction following&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the idea does not stop at one 35B &lt;code&gt;MoE&lt;/code&gt; model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.&lt;/p&gt;
&lt;p&gt;The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.&lt;/p&gt;
&lt;h2 id=&#34;08-short-conclusion&#34;&gt;08 Short conclusion
&lt;/h2&gt;&lt;p&gt;If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.&lt;/p&gt;
&lt;p&gt;A more accurate way to put it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a 16GB GPU is not automatically ruled out for larger models&lt;/li&gt;
&lt;li&gt;dense models and &lt;code&gt;MoE&lt;/code&gt; models need to be considered separately&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt; and expert-layer transfer to CPU memory inside &lt;code&gt;LM Studio&lt;/code&gt; can significantly change VRAM usage&lt;/li&gt;
&lt;li&gt;in practice, you are trading higher memory pressure for larger model scale and better usable speed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings</title>
        <link>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</link>
        <pubDate>Sun, 19 Apr 2026 00:18:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</guid>
        <description>&lt;p&gt;When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?&lt;/p&gt;
&lt;p&gt;This note summarizes how Ollama behaves with multiple GPUs. The short version:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama supports multiple GPUs.&lt;/li&gt;
&lt;li&gt;The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.&lt;/li&gt;
&lt;li&gt;By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.&lt;/li&gt;
&lt;li&gt;If a model does not fit on one GPU, Ollama can spread it across available GPUs.&lt;/li&gt;
&lt;li&gt;Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.&lt;/li&gt;
&lt;li&gt;SLI / NVLink is not required for multi-GPU use.&lt;/li&gt;
&lt;li&gt;To limit which GPUs Ollama can use, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt;, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt;, or &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-behavior-single-gpu-first-multi-gpu-when-needed&#34;&gt;Official Behavior: Single GPU First, Multi-GPU When Needed
&lt;/h2&gt;&lt;p&gt;Ollama&amp;rsquo;s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.&lt;/p&gt;
&lt;p&gt;The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.&lt;/p&gt;
&lt;p&gt;So do not think of Ollama multi-GPU as &amp;ldquo;more cards automatically means several times faster.&amp;rdquo; A more accurate model is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small model fits on one GPU: usually runs on one GPU.&lt;/li&gt;
&lt;li&gt;Large model does not fit on one GPU: split across multiple GPUs.&lt;/li&gt;
&lt;li&gt;Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use this command to see where the model is loaded:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The &lt;code&gt;PROCESSOR&lt;/code&gt; column may show something like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;48%/52% CPU/GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% CPU
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you see &lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.&lt;/p&gt;
&lt;h2 id=&#34;multi-gpu-is-not-simple-compute-stacking&#34;&gt;Multi-GPU Is Not Simple Compute Stacking
&lt;/h2&gt;&lt;p&gt;Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.&lt;/p&gt;
&lt;p&gt;So multi-GPU benefits usually fall into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.&lt;/li&gt;
&lt;li&gt;Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama&amp;rsquo;s default &amp;ldquo;use one GPU when it fits&amp;rdquo; strategy avoids that unnecessary PCIe cost.&lt;/p&gt;
&lt;h2 id=&#34;sli-or-nvlink-is-not-required&#34;&gt;SLI or NVLink Is Not Required
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.&lt;/p&gt;
&lt;p&gt;NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.&lt;/p&gt;
&lt;p&gt;What you should pay attention to is PCIe bandwidth. The difference between &lt;code&gt;x1&lt;/code&gt;, &lt;code&gt;x4&lt;/code&gt;, &lt;code&gt;x8&lt;/code&gt;, and &lt;code&gt;x16&lt;/code&gt; affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.&lt;/p&gt;
&lt;p&gt;Safer rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prefer x16 / x8 over mining-style x1 risers.&lt;/li&gt;
&lt;li&gt;PCIe bandwidth matters more when switching large models frequently.&lt;/li&gt;
&lt;li&gt;If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.&lt;/li&gt;
&lt;li&gt;For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;limit-which-nvidia-gpus-ollama-uses&#34;&gt;Limit Which NVIDIA GPUs Ollama Uses
&lt;/h2&gt;&lt;p&gt;On NVIDIA multi-GPU systems, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; to control which GPUs Ollama can see.&lt;/p&gt;
&lt;p&gt;Temporary run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Use only the second GPU:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Force Ollama not to use NVIDIA GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -L
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Example output:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then specify the UUID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Ollama is installed as a Linux systemd service, put the variable into the service environment:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl edit ollama.service
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Add:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;CUDA_VISIBLE_DEVICES=0,1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Reload and restart:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl daemon-reload
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl restart ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;amd-and-vulkan-device-selection&#34;&gt;AMD and Vulkan Device Selection
&lt;/h2&gt;&lt;p&gt;For AMD ROCm, use &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; to control visible GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;To force Ollama not to use ROCm GPUs, use an invalid ID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Ollama&amp;rsquo;s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_VULKAN&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Vulkan devices cause problems, disable them:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt;. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.&lt;/p&gt;
&lt;h2 id=&#34;exposing-multiple-gpus-in-docker&#34;&gt;Exposing Multiple GPUs in Docker
&lt;/h2&gt;&lt;p&gt;If you run Ollama in Docker, NVIDIA setups usually require &lt;code&gt;nvidia-container-toolkit&lt;/code&gt;, then &lt;code&gt;--gpus&lt;/code&gt; to expose devices.&lt;/p&gt;
&lt;p&gt;Expose all GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Expose specific GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#34;device=0,1&amp;#34;&amp;#39;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can also combine this with environment variables:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -e &lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If &lt;code&gt;nvidia-smi&lt;/code&gt; cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.&lt;/p&gt;
&lt;h2 id=&#34;what-is-ollama_sched_spread&#34;&gt;What Is &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;In some multi-GPU configuration discussions, you may see &lt;code&gt;OLLAMA_SCHED_SPREAD=1&lt;/code&gt; or &lt;code&gt;OLLAMA_SCHED_SPREAD=true&lt;/code&gt;. It is related to Ollama&amp;rsquo;s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_SCHED_SPREAD&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Or with systemd:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;OLLAMA_SCHED_SPREAD=true&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.&lt;/p&gt;
&lt;p&gt;Treat &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt; as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on &lt;code&gt;ollama ps&lt;/code&gt;, logs, and &lt;code&gt;nvidia-smi&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-check-whether-multiple-gpus-are-being-used&#34;&gt;How to Check Whether Multiple GPUs Are Being Used
&lt;/h2&gt;&lt;p&gt;Useful commands:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;watch -n 0.5 nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;View the Ollama service logs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;journalctl -u ollama -f
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If using Docker:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker logs -f ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Watch for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether Ollama discovers compatible GPUs.&lt;/li&gt;
&lt;li&gt;Whether the model shows &lt;code&gt;100% GPU&lt;/code&gt; or a CPU/GPU split.&lt;/li&gt;
&lt;li&gt;Whether each GPU has VRAM allocated.&lt;/li&gt;
&lt;li&gt;Whether VRAM grows on multiple GPUs during model loading.&lt;/li&gt;
&lt;li&gt;Whether generation token/s improves compared with CPU/RAM spillover.&lt;/li&gt;
&lt;li&gt;Whether OOM or model unloading happens frequently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.&lt;/p&gt;
&lt;h2 id=&#34;common-misunderstandings&#34;&gt;Common Misunderstandings
&lt;/h2&gt;&lt;h3 id=&#34;misunderstanding-1-two-12gb-gpus-equal-one-24gb-gpu&#34;&gt;Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU
&lt;/h3&gt;&lt;p&gt;Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the &amp;ldquo;does not fit&amp;rdquo; problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-2-different-gpu-models-cannot-be-mixed&#34;&gt;Misunderstanding 2: Different GPU Models Cannot Be Mixed
&lt;/h3&gt;&lt;p&gt;Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-3-multi-gpu-is-always-faster-than-single-gpu&#34;&gt;Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU
&lt;/h3&gt;&lt;p&gt;Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-4-nvlink--sli-is-required&#34;&gt;Misunderstanding 4: NVLink / SLI Is Required
&lt;/h3&gt;&lt;p&gt;No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-5-adding-a-gpu-does-not-require-restarting-services&#34;&gt;Misunderstanding 5: Adding a GPU Does Not Require Restarting Services
&lt;/h3&gt;&lt;p&gt;Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.&lt;/p&gt;
&lt;h2 id=&#34;gpu-selection-suggestions&#34;&gt;GPU Selection Suggestions
&lt;/h2&gt;&lt;p&gt;For Ollama local inference, the rough priority is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Larger single-GPU VRAM is usually easier to manage.&lt;/li&gt;
&lt;li&gt;Identical GPUs are easier to troubleshoot than mixed GPUs.&lt;/li&gt;
&lt;li&gt;More complete PCIe lanes make large-model loading smoother.&lt;/li&gt;
&lt;li&gt;Older cards should be checked for CUDA compute capability or ROCm support first.&lt;/li&gt;
&lt;li&gt;Multi-GPU power, cooling, and chassis airflow must be planned ahead.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For budget second-hand platforms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dual RTX 3090 remains a common high-VRAM option.&lt;/li&gt;
&lt;li&gt;Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.&lt;/li&gt;
&lt;li&gt;Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.&lt;/li&gt;
&lt;li&gt;Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU support is best understood as &amp;ldquo;VRAM expansion first, performance acceleration second.&amp;rdquo; If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.&lt;/p&gt;
&lt;p&gt;In practice, use &lt;code&gt;ollama ps&lt;/code&gt; to check where the model is loaded, then use &lt;code&gt;nvidia-smi&lt;/code&gt; or ROCm tools to observe VRAM allocation. For GPU selection, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; on NVIDIA, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; on AMD ROCm, and &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt; for Vulkan. If running in Docker, first make sure the container can see the GPUs.&lt;/p&gt;
&lt;p&gt;Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Ollama FAQ: How does Ollama load models on multiple GPUs?: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama GPU docs: Hardware support / GPU Selection: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama Docker Hub: &lt;a class=&#34;link&#34; href=&#34;https://hub.docker.com/r/ollama/ollama&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://hub.docker.com/r/ollama/ollama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;NVIDIA Container Toolkit: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvidia-container-toolkit&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvidia-container-toolkit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Gemma 4 E4B Uncensored vs Official: What Actually Changes</title>
        <link>https://knightli.com/en/2026/04/18/gemma-4-e4b-uncensored-vs-official/</link>
        <pubDate>Sat, 18 Apr 2026 10:20:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/18/gemma-4-e4b-uncensored-vs-official/</guid>
        <description>&lt;p&gt;If you see a model like &lt;code&gt;HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&lt;/code&gt;, the most important point is this: it is &lt;strong&gt;not a new Google base model&lt;/strong&gt;. It is a derivative release built on top of the official &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;, but with alignment behavior intentionally pushed toward fewer refusals.&lt;/p&gt;
&lt;p&gt;That means the real difference is usually &lt;strong&gt;behavioral policy and response style&lt;/strong&gt;, not a brand-new architecture.&lt;/p&gt;
&lt;h2 id=&#34;what-the-derivative-model-explicitly-claims&#34;&gt;What the derivative model explicitly claims
&lt;/h2&gt;&lt;p&gt;According to its Hugging Face model card, the HauhauCS release says:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it is based on &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;it makes &amp;ldquo;no changes to datasets or capabilities&amp;rdquo;&lt;/li&gt;
&lt;li&gt;it is &amp;ldquo;just without the refusals&amp;rdquo;&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;Aggressive&lt;/code&gt; variant is &amp;ldquo;fully unlocked and won&amp;rsquo;t refuse prompts&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those are the creator&amp;rsquo;s claims, not an independent benchmark. Still, they tell you the intended positioning very clearly: this is an unofficial derivative optimized to reduce safety refusals.&lt;/p&gt;
&lt;h2 id=&#34;official-model-vs-uncensored-derivative&#34;&gt;Official model vs &amp;ldquo;uncensored&amp;rdquo; derivative
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Dimension&lt;/th&gt;
          &lt;th&gt;Official &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;&lt;/th&gt;
          &lt;th&gt;&lt;code&gt;Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&lt;/code&gt;&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Source&lt;/td&gt;
          &lt;td&gt;Official Google release&lt;/td&gt;
          &lt;td&gt;Third-party derivative on Hugging Face&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Base architecture&lt;/td&gt;
          &lt;td&gt;Gemma 4 E4B instruction-tuned model&lt;/td&gt;
          &lt;td&gt;Same base family, explicitly described as based on &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Main goal&lt;/td&gt;
          &lt;td&gt;General-purpose helpful assistant with responsible-use framing&lt;/td&gt;
          &lt;td&gt;Reduce refusals and keep answering even when the official model might decline&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Safety posture&lt;/td&gt;
          &lt;td&gt;Aligned with Gemma family safety docs and prohibited-use policy&lt;/td&gt;
          &lt;td&gt;Intentionally weakened refusal behavior&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Response style&lt;/td&gt;
          &lt;td&gt;More likely to refuse, redirect, or soften certain requests&lt;/td&gt;
          &lt;td&gt;More likely to answer directly, including prompts the official model may block&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Risk profile&lt;/td&gt;
          &lt;td&gt;Lower misuse risk by default, but still not risk-free&lt;/td&gt;
          &lt;td&gt;Higher misuse risk, higher chance of unsafe or non-compliant output&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Predictability in products&lt;/td&gt;
          &lt;td&gt;Easier to justify in normal apps and enterprise environments&lt;/td&gt;
          &lt;td&gt;Harder to justify in public-facing, business, or policy-sensitive deployments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Compliance burden&lt;/td&gt;
          &lt;td&gt;Still requires application-level safeguards&lt;/td&gt;
          &lt;td&gt;Requires even stronger downstream safeguards because the model itself is less restrictive&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;the-core-difference-is-alignment-not-raw-capability&#34;&gt;The core difference is alignment, not raw capability
&lt;/h2&gt;&lt;p&gt;Many users mistakenly treat &amp;ldquo;uncensored&amp;rdquo; as if it means &amp;ldquo;smarter.&amp;rdquo; That is usually the wrong frame.&lt;/p&gt;
&lt;p&gt;For a derivative like this, what changes first is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how often the model refuses&lt;/li&gt;
&lt;li&gt;how strongly it follows harmful or policy-sensitive instructions&lt;/li&gt;
&lt;li&gt;how much filtering remains in its final answers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What does &lt;strong&gt;not&lt;/strong&gt; automatically change:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the underlying Gemma 4 family architecture&lt;/li&gt;
&lt;li&gt;context window class&lt;/li&gt;
&lt;li&gt;multimodal support class&lt;/li&gt;
&lt;li&gt;general reasoning ceiling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, an uncensored derivative is often better described as a &lt;strong&gt;different behavioral tuning&lt;/strong&gt; of the same model family, not a higher-tier model.&lt;/p&gt;
&lt;h2 id=&#34;why-the-official-version-behaves-differently&#34;&gt;Why the official version behaves differently
&lt;/h2&gt;&lt;p&gt;Google&amp;rsquo;s official Gemma materials frame the family as being built for responsible AI development. The Gemma model card highlights misuse, harmful content, privacy, and bias risks, and Google&amp;rsquo;s Gemma Prohibited Use Policy explicitly forbids using Gemma or model derivatives to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;facilitate dangerous, illegal, or malicious activities&lt;/li&gt;
&lt;li&gt;generate harmful or deceptive content&lt;/li&gt;
&lt;li&gt;override or circumvent safety filters&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the official model is not just &amp;ldquo;more conservative&amp;rdquo; by accident. Its surrounding policy and intended deployment posture are deliberately different.&lt;/p&gt;
&lt;h2 id=&#34;when-the-official-model-is-the-better-choice&#34;&gt;When the official model is the better choice
&lt;/h2&gt;&lt;p&gt;Use the official &lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt; path if you care about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;product deployment&lt;/li&gt;
&lt;li&gt;enterprise or team use&lt;/li&gt;
&lt;li&gt;lower legal and policy exposure&lt;/li&gt;
&lt;li&gt;fewer obviously unsafe outputs&lt;/li&gt;
&lt;li&gt;easier documentation and review&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For most normal applications, this is the safer default.&lt;/p&gt;
&lt;h2 id=&#34;when-people-choose-the-uncensored-derivative&#34;&gt;When people choose the uncensored derivative
&lt;/h2&gt;&lt;p&gt;Users usually choose an uncensored derivative for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;local private experimentation&lt;/li&gt;
&lt;li&gt;testing where the official model refuses too early&lt;/li&gt;
&lt;li&gt;roleplay or open-ended creative prompting&lt;/li&gt;
&lt;li&gt;comparing alignment behavior across variants&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But this comes with a real trade-off: you are moving more safety responsibility from the model provider to yourself.&lt;/p&gt;
&lt;h2 id=&#34;practical-conclusion&#34;&gt;Practical conclusion
&lt;/h2&gt;&lt;p&gt;The difference between a so-called &amp;ldquo;jailbroken&amp;rdquo; Gemma 4 E4B and the ordinary official version is mostly this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the official version is optimized for usable capability &lt;strong&gt;with guardrails&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;the uncensored derivative is optimized for fewer refusals &lt;strong&gt;with weaker guardrails&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does &lt;strong&gt;not&lt;/strong&gt; automatically make the uncensored model stronger. It mainly makes it more permissive.&lt;/p&gt;
&lt;p&gt;If your goal is stable, explainable, and lower-risk deployment, use the official model first. If your goal is local experimentation and you understand the compliance and safety trade-offs, then an uncensored derivative is a behavior variant worth testing separately, not a drop-in &amp;ldquo;better&amp;rdquo; replacement.&lt;/p&gt;
&lt;h2 id=&#34;sources&#34;&gt;Sources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Hugging Face: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hugging Face: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E4B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E4B-it&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Google AI for Developers: &lt;a class=&#34;link&#34; href=&#34;https://ai.google.dev/gemma/prohibited_use_policy&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Gemma Prohibited Use Policy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Google AI for Developers: &lt;a class=&#34;link&#34; href=&#34;https://ai.google.dev/gemma/docs/core/model_card&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Gemma model card&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Use llama-quantize for GGUF Models</title>
        <link>https://knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</link>
        <pubDate>Sun, 12 Apr 2026 09:42:36 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</guid>
        <description>&lt;p&gt;&lt;code&gt;llama-quantize&lt;/code&gt; is the quantization tool in &lt;code&gt;llama.cpp&lt;/code&gt;. It is used to convert high-precision &lt;code&gt;GGUF&lt;/code&gt; models into smaller quantized versions.&lt;/p&gt;
&lt;p&gt;Its most common use is turning formats such as &lt;code&gt;F32&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, or &lt;code&gt;FP16&lt;/code&gt; into versions like &lt;code&gt;Q4_K_M&lt;/code&gt;, &lt;code&gt;Q5_K_M&lt;/code&gt;, or &lt;code&gt;Q8_0&lt;/code&gt; that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.&lt;/p&gt;
&lt;h2 id=&#34;basic-workflow&#34;&gt;Basic workflow
&lt;/h2&gt;&lt;p&gt;A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# install Python dependencies&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 -m pip install -r requirements.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# convert the model to ggml FP16 format&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 convert_hf_to_gguf.py ./models/mymodel/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# quantize the model to 4-bits (using Q4_K_M method)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After that, you can run the quantized model with &lt;code&gt;llama-cli&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# start inference on a gguf model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p &lt;span class=&#34;s2&#34;&gt;&amp;#34;You are a helpful assistant&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;common-options&#34;&gt;Common options
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--allow-requantize&lt;/code&gt;: allows requantizing an already quantized model, usually not ideal for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--leave-output-tensor&lt;/code&gt;: keeps the output layer unquantized, increasing size but sometimes helping quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--pure&lt;/code&gt;: disables mixed quantization and uses a more uniform quant type&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--imatrix&lt;/code&gt;: uses an importance matrix to improve quantization quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--keep-split&lt;/code&gt;: keeps the original shard layout instead of producing one merged file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want a practical starting point, this is often enough:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;how-to-choose-a-quant&#34;&gt;How to choose a quant
&lt;/h2&gt;&lt;p&gt;You can think of quant levels as a tradeoff between size, speed, and quality:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;: larger, but usually safer for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K&lt;/code&gt; / &lt;code&gt;Q5_K_M&lt;/code&gt;: common balanced choices&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;: a very common default with a good size-quality balance&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt; / &lt;code&gt;Q2&lt;/code&gt;: useful when hardware is very limited, but quality loss is more visible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.&lt;/p&gt;
&lt;h2 id=&#34;practical-takeaway&#34;&gt;Practical takeaway
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;start with &lt;code&gt;Q4_K_M&lt;/code&gt; or &lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;move up to &lt;code&gt;Q6_K&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt; if quality matters more&lt;/li&gt;
&lt;li&gt;move down to &lt;code&gt;Q3&lt;/code&gt; or &lt;code&gt;Q2&lt;/code&gt; if memory is tight&lt;/li&gt;
&lt;li&gt;compare versions with the same prompt set&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, &lt;code&gt;llama-quantize&lt;/code&gt; is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Get GGUF Models from Hugging Face with llama.cpp</title>
        <link>https://knightli.com/en/2026/04/12/llama-cpp-hugging-face-gguf-models/</link>
        <pubDate>Sun, 12 Apr 2026 09:31:38 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/12/llama-cpp-hugging-face-gguf-models/</guid>
        <description>&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; can work directly with GGUF models hosted on Hugging Face, so you do not always need to download model files manually first.&lt;/p&gt;
&lt;p&gt;If a model repository already provides GGUF files, you can use the &lt;code&gt;-hf&lt;/code&gt; argument in the CLI, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;By default, this downloads from Hugging Face.&lt;br&gt;
If you use another service that exposes a Hugging Face compatible API, you can switch the download endpoint with the &lt;code&gt;MODEL_ENDPOINT&lt;/code&gt; environment variable.&lt;/p&gt;
&lt;p&gt;One important detail is that &lt;code&gt;llama.cpp&lt;/code&gt; only works directly with the &lt;code&gt;GGUF&lt;/code&gt; format.&lt;br&gt;
If your model is in another format, you need to convert it first with the &lt;code&gt;convert_*.py&lt;/code&gt; scripts provided in the repository.&lt;/p&gt;
&lt;p&gt;Hugging Face also offers several online tools related to &lt;code&gt;llama.cpp&lt;/code&gt;, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;converting models to &lt;code&gt;GGUF&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;quantizing weights to reduce size&lt;/li&gt;
&lt;li&gt;converting LoRA adapters&lt;/li&gt;
&lt;li&gt;editing GGUF metadata in the browser&lt;/li&gt;
&lt;li&gt;hosting &lt;code&gt;llama.cpp&lt;/code&gt; inference endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only want the practical takeaway, start with repositories that already provide &lt;code&gt;GGUF&lt;/code&gt;, then use &lt;code&gt;llama-cli -hf &amp;lt;user&amp;gt;/&amp;lt;model&amp;gt;&lt;/code&gt;. In most cases, that is the simplest path.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>What Does `it` Mean in Gemma-4-31B-it</title>
        <link>https://knightli.com/en/2026/04/11/gemma-4-31b-it-meaning/</link>
        <pubDate>Sat, 11 Apr 2026 20:45:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/11/gemma-4-31b-it-meaning/</guid>
        <description>&lt;p&gt;In &lt;code&gt;gemma-4-31B-it&lt;/code&gt;, &lt;code&gt;it&lt;/code&gt; stands for &lt;code&gt;Instruction Tuned&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For most users, that means this version is designed for chat, Q&amp;amp;A, coding help, and other instruction-following tasks.&lt;/p&gt;
&lt;h2 id=&#34;what-it-means&#34;&gt;What &lt;code&gt;it&lt;/code&gt; means
&lt;/h2&gt;&lt;p&gt;Models often come in two common forms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Base / Pre-trained: closer to a raw next-token predictor&lt;/li&gt;
&lt;li&gt;&lt;code&gt;it&lt;/code&gt;: tuned to follow user instructions more reliably&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you ask something like &amp;ldquo;translate this text&amp;rdquo; or &amp;ldquo;write a Python script&amp;rdquo;, the &lt;code&gt;it&lt;/code&gt; version usually behaves more like an assistant.&lt;/p&gt;
&lt;h2 id=&#34;what-31b-means&#34;&gt;What &lt;code&gt;31B&lt;/code&gt; means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;31B&lt;/code&gt; means the model has about 31 billion parameters.&lt;/p&gt;
&lt;p&gt;In general:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;more parameters often mean stronger capability&lt;/li&gt;
&lt;li&gt;but also higher VRAM or RAM requirements&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;31B&lt;/code&gt; is a relatively large model and needs stronger hardware.&lt;/p&gt;
&lt;h2 id=&#34;what-gemma-4-means&#34;&gt;What &lt;code&gt;Gemma-4&lt;/code&gt; means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Gemma-4&lt;/code&gt; identifies the model family and generation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Gemma&lt;/code&gt;: Google&amp;rsquo;s open model family&lt;/li&gt;
&lt;li&gt;&lt;code&gt;4&lt;/code&gt;: the fourth generation in that family&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;which-one-to-choose&#34;&gt;Which one to choose
&lt;/h2&gt;&lt;p&gt;If your goal is chat, Q&amp;amp;A, translation, or coding, the &lt;code&gt;-it&lt;/code&gt; version is usually the better choice.&lt;/p&gt;
&lt;p&gt;The base version is more relevant for lower-level research, fine-tuning, or custom training workflows.&lt;/p&gt;
&lt;h2 id=&#34;one-line-summary&#34;&gt;One-line summary
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;gemma-4-31B-it&lt;/code&gt; means: Gemma 4 family, 31 billion parameters, instruction-tuned for conversation and task execution.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2</title>
        <link>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</link>
        <pubDate>Sat, 11 Apr 2026 20:07:29 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</guid>
        <description>&lt;p&gt;When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.&lt;/p&gt;
&lt;h2 id=&#34;understand-32-16-and-q-levels-first&#34;&gt;Understand 32, 16, and Q levels first
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;: closest to original/uncompressed quality, but hardware demand is extreme.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;: still very close to original quality, around half the size of &lt;code&gt;32&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;: common entry point for quantized models (&lt;code&gt;Q8_0&lt;/code&gt; or &lt;code&gt;Q8&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;, &lt;code&gt;Q5&lt;/code&gt;, &lt;code&gt;Q4&lt;/code&gt;, &lt;code&gt;Q3&lt;/code&gt;, &lt;code&gt;Q2&lt;/code&gt;: lower number means lower resource use and higher quality loss risk.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;what-k_m--k_s-means&#34;&gt;What &lt;code&gt;K_M&lt;/code&gt; / &lt;code&gt;K_S&lt;/code&gt; means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;K_M&lt;/code&gt; and &lt;code&gt;K_S&lt;/code&gt; are mixed quantization variants:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;most weights stay at the target quantization level&lt;/li&gt;
&lt;li&gt;important parts keep higher precision&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So at the same level, &lt;code&gt;Qx_K_M&lt;/code&gt; or &lt;code&gt;Qx_K_S&lt;/code&gt; is usually slightly better than plain &lt;code&gt;Qx&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;practical-picking-strategy&#34;&gt;Practical picking strategy
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;If hardware allows, start with &lt;code&gt;Q8&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If memory is tight, step down through &lt;code&gt;Q6&lt;/code&gt; / &lt;code&gt;Q5&lt;/code&gt; / &lt;code&gt;Q4&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Try not to go below &lt;code&gt;Q4&lt;/code&gt;; &lt;code&gt;Q4_K_M&lt;/code&gt; is a common lower bound.&lt;/li&gt;
&lt;li&gt;Below &lt;code&gt;Q4&lt;/code&gt;, quality degradation becomes increasingly visible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quality-order-best-to-worst&#34;&gt;Quality order (best to worst)
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Above this point, quality is effectively the same, but hardware requirements are extreme &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; This is the typical sweet spot &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Below this point, quality loss becomes visible &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want one short rule: start with &lt;code&gt;Q8&lt;/code&gt; or &lt;code&gt;Q6_K_M&lt;/code&gt;, then move down to &lt;code&gt;Q5&lt;/code&gt; or &lt;code&gt;Q4_K_M&lt;/code&gt; only when needed.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Access a Local Ollama API Over LAN on Windows</title>
        <link>https://knightli.com/en/2026/04/11/ollama-api-lan-access-windows/</link>
        <pubDate>Sat, 11 Apr 2026 16:43:52 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/11/ollama-api-lan-access-windows/</guid>
        <description>&lt;p&gt;If you want other devices in the same LAN to access your local Ollama API, follow these steps.&lt;/p&gt;
&lt;h2 id=&#34;set-the-listening-host&#34;&gt;Set the listening host
&lt;/h2&gt;&lt;p&gt;First, set Ollama to listen on all network interfaces:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;OLLAMA_HOST=0.0.0.0:11434&lt;/code&gt;&lt;/p&gt;
&lt;h2 id=&#34;open-the-firewall&#34;&gt;Open the firewall
&lt;/h2&gt;&lt;p&gt;In Windows Firewall advanced settings, create an inbound rule and allow the target port (for example &lt;code&gt;8080&lt;/code&gt;):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Press Win + S, search and open &amp;ldquo;Windows Defender Firewall&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Click &amp;ldquo;Advanced settings&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Select &amp;ldquo;Inbound Rules&amp;rdquo; -&amp;gt; &amp;ldquo;New Rule&amp;hellip;&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Choose &amp;ldquo;Port&amp;rdquo;, then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Select protocol (usually TCP), enter the target port in &amp;ldquo;Specific local ports&amp;rdquo; (for example &lt;code&gt;8080&lt;/code&gt;), then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Choose &amp;ldquo;Allow the connection&amp;rdquo;, then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;In &amp;ldquo;Profile&amp;rdquo;, select Domain, Private, and Public, then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Name the rule (for example &lt;code&gt;OpenPort8080&lt;/code&gt;) and click &amp;ldquo;Finish&amp;rdquo;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;run-ollama&#34;&gt;Run Ollama
&lt;/h2&gt;&lt;p&gt;Ollama run 模型&lt;/p&gt;
&lt;h2 id=&#34;access-the-model-through-api&#34;&gt;Access the model through API
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl http://192.168.x.xxx:11434/api/generate -d &lt;span class=&#34;s1&#34;&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;model&amp;#34;: &amp;#34;gemma4&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;prompt&amp;#34;: &amp;#34;这个是什么模型?&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;</description>
        </item>
        <item>
        <title>Gemma 4 Local Runtime Guide: From One-Command Start to Dev Integration</title>
        <link>https://knightli.com/en/2026/04/10/gemma4-local-runtime-options/</link>
        <pubDate>Fri, 10 Apr 2026 22:54:17 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/10/gemma4-local-runtime-options/</guid>
        <description>&lt;p&gt;If you want to run Gemma 4 locally, you can choose from four practical paths depending on your goal and hardware.&lt;/p&gt;
&lt;h2 id=&#34;1-fastest-start-ollama-recommended&#34;&gt;1) Fastest start: Ollama (recommended)
&lt;/h2&gt;&lt;p&gt;This is the lowest-friction option for quick testing, daily chat, and local API usage.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Highlights:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works on Windows, macOS, and Linux&lt;/li&gt;
&lt;li&gt;Handles hardware acceleration automatically&lt;/li&gt;
&lt;li&gt;Offers OpenAI-style local API compatibility&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2-gui-workflow-lm-studio--unsloth-studio&#34;&gt;2) GUI workflow: LM Studio / Unsloth Studio
&lt;/h2&gt;&lt;p&gt;If you prefer a desktop UI instead of terminal commands:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LM Studio: browse and run Gemma 4 quantized variants from Hugging Face (for example 4-bit, 8-bit), with resource visibility.&lt;/li&gt;
&lt;li&gt;Unsloth Studio: supports both inference and low-VRAM fine-tuning, often friendlier on 6GB-8GB GPUs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;3-low-spec-and-maximum-control-llamacpp&#34;&gt;3) Low-spec and maximum control: llama.cpp
&lt;/h2&gt;&lt;p&gt;Good for older hardware, CPU-focused setups, or users who want deeper runtime control.&lt;/p&gt;
&lt;p&gt;With &lt;code&gt;.gguf&lt;/code&gt; model files and quantization, Gemma 4 can be made practical on much smaller hardware budgets.&lt;/p&gt;
&lt;h2 id=&#34;4-developer-integration-transformers--vllm&#34;&gt;4) Developer integration: Transformers / vLLM
&lt;/h2&gt;&lt;p&gt;If you need Gemma 4 inside your own application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transformers: straightforward Python integration&lt;/li&gt;
&lt;li&gt;vLLM: high-throughput inference for stronger GPU environments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-selection&#34;&gt;Quick selection
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Need&lt;/th&gt;
          &lt;th&gt;Recommended tools&lt;/th&gt;
          &lt;th&gt;Hardware bar&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;I just want it running now&lt;/td&gt;
          &lt;td&gt;Ollama&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I want a ChatGPT-like UI&lt;/td&gt;
          &lt;td&gt;LM Studio&lt;/td&gt;
          &lt;td&gt;Medium&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;My VRAM is limited (6GB-8GB)&lt;/td&gt;
          &lt;td&gt;Unsloth / llama.cpp&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I am building local AI apps&lt;/td&gt;
          &lt;td&gt;Ollama / Transformers / vLLM&lt;/td&gt;
          &lt;td&gt;Medium to high&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I need fine-tuning&lt;/td&gt;
          &lt;td&gt;Unsloth Studio&lt;/td&gt;
          &lt;td&gt;Medium to high&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;model-size-suggestion&#34;&gt;Model size suggestion
&lt;/h2&gt;&lt;p&gt;Gemma 4 comes in multiple sizes (for example E2B, E4B, 31B).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start with quantized E2B/E4B on mainstream laptops&lt;/li&gt;
&lt;li&gt;Move to larger variants only after your baseline pipeline is stable&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>What are Ollama cloud models and how do you use them</title>
        <link>https://knightli.com/en/2026/04/09/ollama-cloud-models-guide/</link>
        <pubDate>Thu, 09 Apr 2026 18:42:32 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/ollama-cloud-models-guide/</guid>
        <description>&lt;p&gt;If you already use &lt;code&gt;Ollama&lt;/code&gt; to run local models, cloud models are easy to understand.&lt;/p&gt;
&lt;p&gt;There is only one core difference:&lt;br&gt;
local models run on your own machine, while cloud models run on Ollama&amp;rsquo;s cloud infrastructure and return the result to you.&lt;/p&gt;
&lt;h2 id=&#34;what-are-ollama-cloud-models&#34;&gt;What are Ollama cloud models
&lt;/h2&gt;&lt;p&gt;Ollama cloud models keep the Ollama workflow, but move the actual computation from your local machine to the cloud.&lt;/p&gt;
&lt;p&gt;The main benefits are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less pressure on local hardware&lt;/li&gt;
&lt;li&gt;Easier access to larger models that your machine cannot run well&lt;/li&gt;
&lt;li&gt;You can keep using the familiar Ollama workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;how-they-differ-from-local-models&#34;&gt;How they differ from local models
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Item&lt;/th&gt;
          &lt;th&gt;Local models&lt;/th&gt;
          &lt;th&gt;Cloud models&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Runtime location&lt;/td&gt;
          &lt;td&gt;Your machine&lt;/td&gt;
          &lt;td&gt;Cloud&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Hardware requirements&lt;/td&gt;
          &lt;td&gt;High&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Latency&lt;/td&gt;
          &lt;td&gt;Usually lower&lt;/td&gt;
          &lt;td&gt;Affected by network&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Privacy&lt;/td&gt;
          &lt;td&gt;Stronger&lt;/td&gt;
          &lt;td&gt;Requests are sent to the cloud&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you care more about privacy, low latency, and offline use, local models are a better fit.&lt;br&gt;
If your hardware is limited but you still want to use larger models, cloud models are more convenient.&lt;/p&gt;
&lt;h2 id=&#34;how-to-identify-a-cloud-model&#34;&gt;How to identify a cloud model
&lt;/h2&gt;&lt;p&gt;At the moment, Ollama cloud models are typically labeled with a &lt;code&gt;-cloud&lt;/code&gt; suffix, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;gpt-oss:120b-cloud
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The available model list may change over time, so the official Ollama pages should be treated as the source of truth.&lt;/p&gt;
&lt;h2 id=&#34;how-to-use-them&#34;&gt;How to use them
&lt;/h2&gt;&lt;p&gt;First, sign in:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama signin
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After that, run a cloud model directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gpt-oss:120b-cloud
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you are calling it from code, you can also configure an API key:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;OLLAMA_API_KEY&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;your_api_key
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Python example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ollama&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Client&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;client&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Client&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;host&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;https://ollama.com&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Authorization&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Bearer &amp;#34;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;environ&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;OLLAMA_API_KEY&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;user&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Why is the sky blue?&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;part&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;client&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;gpt-oss:120b-cloud&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;stream&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;part&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;message&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;end&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;flush&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Ollama cloud models can be summarized in one sentence:&lt;/p&gt;
&lt;p&gt;the commands are almost the same, but the model is no longer running on your local machine.&lt;/p&gt;
&lt;p&gt;If your computer cannot handle large models well, but you still want to keep the Ollama workflow, cloud models are a very direct option.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Download a GGUF Model from Hugging Face and Import It into Ollama</title>
        <link>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</link>
        <pubDate>Thu, 09 Apr 2026 11:00:07 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</guid>
        <description>&lt;p&gt;If a model is not available in the official Ollama library, or if you want to use a specific &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face, you can download it manually and then import it into Ollama.&lt;/p&gt;
&lt;h2 id=&#34;step-1-download-the-gguf-file-from-hugging-face&#34;&gt;Step 1: Download the GGUF file from Hugging Face
&lt;/h2&gt;&lt;p&gt;First, find the target model&amp;rsquo;s &lt;code&gt;GGUF&lt;/code&gt; file on Hugging Face. You will usually see multiple quantized versions, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the &lt;code&gt;.gguf&lt;/code&gt; file in a fixed directory so you can reference it from the &lt;code&gt;Modelfile&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;step-2-write-the-modelfile&#34;&gt;Step 2: Write the Modelfile
&lt;/h2&gt;&lt;p&gt;Create a &lt;code&gt;Modelfile&lt;/code&gt; in the same directory as the model file. The most basic version looks like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./model.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the filename is different, replace it with the actual filename, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./gemma-3-12b-it-q4_k_m.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If your goal is just to get it running, this single &lt;code&gt;FROM&lt;/code&gt; line is usually enough.&lt;/p&gt;
&lt;h2 id=&#34;step-3-import-it-into-ollama&#34;&gt;Step 3: Import it into Ollama
&lt;/h2&gt;&lt;p&gt;Then run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama create myModelName -f Modelfile
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;myModelName&lt;/code&gt; is the local model name you want to use inside Ollama&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-f Modelfile&lt;/code&gt; tells Ollama to create the model from that file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once the creation succeeds, the GGUF file becomes a local model that you can call directly.&lt;/p&gt;
&lt;h2 id=&#34;step-4-run-the-model&#34;&gt;Step 4: Run the model
&lt;/h2&gt;&lt;p&gt;After creation, run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run myModelName
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;From that point on, it works much like a model pulled with &lt;code&gt;ollama pull&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-inspect-an-existing-models-modelfile&#34;&gt;How to inspect an existing model&amp;rsquo;s Modelfile
&lt;/h2&gt;&lt;p&gt;If you are not sure how to write a &lt;code&gt;Modelfile&lt;/code&gt;, you can inspect the configuration of an existing model directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama show --modelfile llama3.2
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This command prints the &lt;code&gt;Modelfile&lt;/code&gt; for &lt;code&gt;llama3.2&lt;/code&gt;, which is useful as a reference for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How &lt;code&gt;FROM&lt;/code&gt; should be written&lt;/li&gt;
&lt;li&gt;How the template and system prompt are structured&lt;/li&gt;
&lt;li&gt;How parameters are declared&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;when-this-approach-makes-sense&#34;&gt;When this approach makes sense
&lt;/h2&gt;&lt;p&gt;This manual Hugging Face import flow is useful when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model you want is not available in Ollama&amp;rsquo;s official library&lt;/li&gt;
&lt;li&gt;You want a specific quantized variant&lt;/li&gt;
&lt;li&gt;You have already downloaded the &lt;code&gt;GGUF&lt;/code&gt; file manually&lt;/li&gt;
&lt;li&gt;You want finer control over how the model is packaged&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If Ollama already provides an official version, using &lt;code&gt;pull&lt;/code&gt; is usually simpler. But when you need a specific quantization or a custom wrapper, &lt;code&gt;GGUF + Modelfile&lt;/code&gt; gives you more flexibility.&lt;/p&gt;
&lt;h2 id=&#34;common-notes&#34;&gt;Common notes
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;The path after &lt;code&gt;FROM&lt;/code&gt; must match the actual location of the &lt;code&gt;.gguf&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;If the filename contains spaces or special characters, it is better to rename it first.&lt;/li&gt;
&lt;li&gt;Different &lt;code&gt;GGUF&lt;/code&gt; quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.&lt;/li&gt;
&lt;li&gt;If the model is a chat model, you may still need to adjust the prompt template later for better results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Downloading a &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal &lt;code&gt;Modelfile&lt;/code&gt;, then run &lt;code&gt;ollama create&lt;/code&gt;, and you can bring a third-party &lt;code&gt;GGUF&lt;/code&gt; model into your Ollama workflow.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Troubleshoot Slow `ollama pull` Model Downloads</title>
        <link>https://knightli.com/en/2026/04/09/ollama-download-slow-troubleshooting/</link>
        <pubDate>Thu, 09 Apr 2026 10:42:39 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/ollama-download-slow-troubleshooting/</guid>
        <description>&lt;p&gt;&lt;code&gt;ollama pull model_name:tag&lt;/code&gt; can be very slow in some regions, and the download process is not always stable.&lt;/p&gt;
&lt;p&gt;If your issue looks like repeated interruptions halfway through a large model download, with errors such as &lt;code&gt;TLS handshake timeout&lt;/code&gt; or &lt;code&gt;unexpected EOF&lt;/code&gt;, the bottleneck may not be &lt;code&gt;registry.ollama.ai&lt;/code&gt; itself, but the actual download path after the redirect.&lt;/p&gt;
&lt;p&gt;This article walks through a simple troubleshooting approach: first get the real model file URLs, then confirm where the traffic actually ends up, and finally optimize only the domains that matter.&lt;/p&gt;
&lt;h2 id=&#34;get-the-model-file-download-urls&#34;&gt;Get the model file download URLs
&lt;/h2&gt;&lt;p&gt;You can use the following project to extract the manifest and blob download URLs for an Ollama model directly:&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/Gholamrezadar/ollama-direct-downloader&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/Gholamrezadar/ollama-direct-downloader&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Using &lt;code&gt;gemma4:latest&lt;/code&gt; as an example, you can extract links like the following.&lt;/p&gt;
&lt;h3 id=&#34;manifest-url&#34;&gt;Manifest URL
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/manifests/latest
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;blob-urls&#34;&gt;Blob URLs
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you only want a quick verification, you can also download the manifest and blobs directly with &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/manifests/latest&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;latest&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;sha256-f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;sha256-7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;the-real-download-url-after-the-redirect&#34;&gt;The real download URL after the redirect
&lt;/h2&gt;&lt;p&gt;If you try downloading one of the blobs with &lt;code&gt;wget&lt;/code&gt;, you will notice that the request does not stay on &lt;code&gt;registry.ollama.ai&lt;/code&gt;. It gets redirected to a &lt;code&gt;Cloudflare R2&lt;/code&gt; object storage URL:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;There are a few key details in the log:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;registry.ollama.ai&lt;/code&gt; returns &lt;code&gt;307 Temporary Redirect&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The final download URL lands on &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The large file transfer is actually being served by the object storage domain behind the redirect&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This matters because if your proxy or routing rules only cover &lt;code&gt;registry.ollama.ai&lt;/code&gt; but not &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt;, downloads can still be slow or repeatedly interrupted.&lt;/p&gt;
&lt;p&gt;Here is one example of an actual redirect log:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--2026-04-09 09:22:04--  https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Resolving registry.ollama.ai (registry.ollama.ai)... 104.21.75.227, 172.67.182.229, 2606:4700:3034::ac43:b6e5, ...
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Connecting to registry.ollama.ai (registry.ollama.ai)|104.21.75.227|:443... connected.
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;HTTP request sent, awaiting response... 307 Temporary Redirect
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Location: https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?... [following]
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--2026-04-09 09:22:05--  https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?...
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Resolving dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com (dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com)... 172.64.66.1, 2606:4700:2ff9::1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Connecting to dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com|172.64.66.1|:443... connected.
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;HTTP request sent, awaiting response... 200 OK
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Length: 9608338848 (8.9G) [application/octet-stream]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;adjust-your-network-settings&#34;&gt;Adjust your network settings
&lt;/h2&gt;&lt;p&gt;Once you confirm the real download path, the troubleshooting direction becomes much clearer.&lt;/p&gt;
&lt;p&gt;If you are using a proxy, split routing, or custom DNS, check these first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether &lt;code&gt;registry.ollama.ai&lt;/code&gt; and &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt; are using the same stable route&lt;/li&gt;
&lt;li&gt;Whether your proxy rules cover only the former but miss the latter&lt;/li&gt;
&lt;li&gt;Whether your current outbound path is suitable for sustained multi-GB downloads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key issue here is not simply whether the official site opens, but whether the redirected object storage path is stable enough for long-running large-file transfers. In many cases, the real bottleneck is the &lt;code&gt;Cloudflare R2&lt;/code&gt; layer rather than the registry domain in front of it.&lt;/p&gt;
&lt;h2 id=&#34;before-and-after-comparison&#34;&gt;Before-and-after comparison
&lt;/h2&gt;&lt;p&gt;Here is one real-world example while downloading &lt;code&gt;gemma4:31b-it-q8_0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Before adjusting the network path, the download was slow and failed midway:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;PS C:\Users\knightli&amp;gt; ollama run gemma4:31b-it-q8_0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling manifest
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling a0feadb736f5:  38% ▕██████████████████████                                    ▏  12 GB/ 33 GB  1.2 MB/s   4h40m
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Error: max retries exceeded: unexpected EOF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After the adjustment, the same model download became noticeably faster and more stable:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;PS C:\Users\knightli&amp;gt; ollama run gemma4:31b-it-q8_0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling manifest
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling a0feadb736f5:  46% ▕████████████████████████████████████████████████████████████████▏ 15 GB/ 33 GB  8.5 MB/s  35m23s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This does not mean every network environment will see the same improvement, but it does support one useful conclusion: the bottleneck may be the actual large-file download path rather than the Ollama client itself.&lt;/p&gt;
&lt;h2 id=&#34;a-more-practical-troubleshooting-order&#34;&gt;A more practical troubleshooting order
&lt;/h2&gt;&lt;p&gt;If you run into the same issue, this order usually works well:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;ollama pull&lt;/code&gt; or &lt;code&gt;ollama run&lt;/code&gt; once and confirm the issue is reproducible.&lt;/li&gt;
&lt;li&gt;Test a blob URL with &lt;code&gt;wget&lt;/code&gt; or &lt;code&gt;curl -L&lt;/code&gt; and confirm whether it redirects to &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Adjust your proxy or routing only for the real download domain, then test speed and stability again.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The benefit of this order is that each step validates one clear hypothesis, so you do not have to troubleshoot blindly.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;When &lt;code&gt;ollama pull&lt;/code&gt; is slow, the problem is often not that &lt;code&gt;registry.ollama.ai&lt;/code&gt; is unreachable, but that the &lt;code&gt;Cloudflare R2&lt;/code&gt; path actually serving the large files is unstable.&lt;/p&gt;
&lt;p&gt;So instead of retrying over and over, a better approach is to identify the real download path first and optimize the network route where the traffic actually lands.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Gemma 4 on Raspberry Pi 5: It Works, But Responses Are Slow</title>
        <link>https://knightli.com/en/2026/04/08/gemma4-on-raspberry-pi5-benchmark/</link>
        <pubDate>Wed, 08 Apr 2026 18:42:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/08/gemma4-on-raspberry-pi5-benchmark/</guid>
        <description>&lt;p&gt;I ran a near-limit experiment: running Gemma 4 on a &lt;code&gt;Raspberry Pi 5 (8GB RAM)&lt;/code&gt;. I was not targeting larger variants, only the smallest &lt;code&gt;E2B&lt;/code&gt; model.&lt;/p&gt;
&lt;p&gt;Conclusion first: it runs and it is usable, but it fits low-interaction workflows better than real-time chat.&lt;/p&gt;
&lt;h2 id=&#34;test-environment&#34;&gt;Test Environment
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Device: Raspberry Pi 5 (4-core CPU, 8GB RAM)&lt;/li&gt;
&lt;li&gt;OS: Ubuntu Server (no GUI)&lt;/li&gt;
&lt;li&gt;Access method: SSH&lt;/li&gt;
&lt;li&gt;Runtime: LM Studio CLI (command-line-only mode)&lt;/li&gt;
&lt;li&gt;Model: Gemma 4 E2B (about 4.5GB)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;step-1-install-and-start-lm-studio-cli&#34;&gt;Step 1: Install and Start LM Studio CLI
&lt;/h2&gt;&lt;p&gt;I installed the LM Studio CLI build on the Pi, then started the service and checked available commands.&lt;/p&gt;
&lt;p&gt;For a terminal-only setup, this deployment mode is a good fit for Raspberry Pi.&lt;/p&gt;
&lt;h2 id=&#34;step-2-move-model-storage-to-ssd&#34;&gt;Step 2: Move Model Storage to SSD
&lt;/h2&gt;&lt;p&gt;To avoid heavy SD card writes, I switched model download storage to an external SSD.&lt;/p&gt;
&lt;p&gt;On Raspberry Pi 5, SSD usage is much more practical than on older models. For long-term local model runs, SSD is strongly recommended.&lt;/p&gt;
&lt;h2 id=&#34;step-3-download-and-load-gemma-4-e2b&#34;&gt;Step 3: Download and Load Gemma 4 E2B
&lt;/h2&gt;&lt;p&gt;After download, the model loaded into memory successfully.&lt;/p&gt;
&lt;p&gt;According to official information, Gemma 4 includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tool-calling support for agent-style workflows (function calling)&lt;/li&gt;
&lt;li&gt;Multimodal capabilities (image/video; smaller models also include audio-related capability)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;128K&lt;/code&gt; context window&lt;/li&gt;
&lt;li&gt;Apache 2.0 license (commercial use allowed)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Given Raspberry Pi hardware limits, E2B is the most practical tier to start with.&lt;/p&gt;
&lt;h2 id=&#34;step-4-start-api-and-enable-lan-access&#34;&gt;Step 4: Start API and Enable LAN Access
&lt;/h2&gt;&lt;p&gt;After loading, I started the API on local port &lt;code&gt;4000&lt;/code&gt; and confirmed model listing works via HTTP.&lt;/p&gt;
&lt;p&gt;The issue: by default, it only listens on localhost, so other LAN devices cannot access it directly.&lt;/p&gt;
&lt;p&gt;Since host binding was not exposed by the startup options, I used &lt;code&gt;socat&lt;/code&gt; for port forwarding, bridging an external Pi port to LM Studio&amp;rsquo;s internal port.&lt;/p&gt;
&lt;p&gt;Result: successful. I could query the model list from a MacBook on the same LAN.&lt;/p&gt;
&lt;h2 id=&#34;step-5-connect-to-editor-zed&#34;&gt;Step 5: Connect to Editor (Zed)
&lt;/h2&gt;&lt;p&gt;LM Studio&amp;rsquo;s local server is OpenAI-API-compatible, so most tools that support custom &lt;code&gt;base_url&lt;/code&gt; can connect.&lt;/p&gt;
&lt;p&gt;I added a new LLM provider in Zed pointing to the Pi-hosted Gemma 4 instance, and in-editor chat worked.&lt;/p&gt;
&lt;h2 id=&#34;practical-usability&#34;&gt;Practical Usability
&lt;/h2&gt;&lt;p&gt;This setup is suitable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Local automation scripts&lt;/li&gt;
&lt;li&gt;Low-concurrency, low-real-time assistant tasks&lt;/li&gt;
&lt;li&gt;Personal learning and edge-device experimentation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Less suitable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;High-frequency interactive chat&lt;/li&gt;
&lt;li&gt;Development collaboration scenarios sensitive to response latency&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Running Gemma 4 (E2B) on &lt;code&gt;Raspberry Pi 5&lt;/code&gt; is feasible, and the practical output quality is better than expected.&lt;/p&gt;
&lt;p&gt;If your goal is offline operation, tool integration, and lightweight-to-mid tasks, this setup is worth trying. If your goal is smooth real-time interaction, stronger hardware is still the better choice.&lt;/p&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/openclaw-connect-gemma4-local/&#34; &gt;Connect OpenClaw to Local Gemma 4: Complete Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Connect OpenClaw to Local Gemma 4: Complete Setup Guide</title>
        <link>https://knightli.com/en/2026/04/08/openclaw-connect-gemma4-local/</link>
        <pubDate>Wed, 08 Apr 2026 18:18:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/08/openclaw-connect-gemma4-local/</guid>
        <description>&lt;p&gt;This guide shows how to connect &lt;code&gt;OpenClaw&lt;/code&gt; to a local &lt;code&gt;Gemma 4&lt;/code&gt; model through &lt;code&gt;Ollama&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you have not deployed Gemma 4 locally yet, start here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;step-1-start-the-ollama-api-service&#34;&gt;Step 1: Start the Ollama API Service
&lt;/h2&gt;&lt;p&gt;Start Ollama first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then verify the API quickly with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl http://localhost:11434/api/generate -d &lt;span class=&#34;s1&#34;&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;model&amp;#34;: &amp;#34;gemma4:12b&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;prompt&amp;#34;: &amp;#34;Hello&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you get a model response, your local API is ready.&lt;/p&gt;
&lt;h2 id=&#34;step-2-configure-openclaw-to-use-ollama&#34;&gt;Step 2: Configure OpenClaw to Use Ollama
&lt;/h2&gt;&lt;p&gt;The OpenClaw config file is usually located at:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;~/.openclaw/config.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Edit &lt;code&gt;config.yaml&lt;/code&gt; and add a local model entry under &lt;code&gt;models&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;models&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;c&#34;&gt;# Your existing model config...&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;gemma4-local&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;provider&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;ollama&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;base_url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;http://localhost:11434&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;gemma4:12b&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;timeout&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;120s&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;step-3-set-default-model-optional&#34;&gt;Step 3: Set Default Model (Optional)
&lt;/h2&gt;&lt;p&gt;If you want Gemma 4 as the default model:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;default_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;gemma4-local&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;step-4-restart-and-verify-openclaw&#34;&gt;Step 4: Restart and Verify OpenClaw
&lt;/h2&gt;&lt;p&gt;Restart OpenClaw:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;openclaw restart
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;List available models:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;openclaw models list
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Run a quick chat test:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;openclaw chat --model gemma4-local &lt;span class=&#34;s2&#34;&gt;&amp;#34;Hello&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the chat returns normally, OpenClaw is successfully connected to local Gemma 4.&lt;/p&gt;
&lt;h2 id=&#34;common-troubleshooting&#34;&gt;Common Troubleshooting
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;connection refused&lt;/code&gt;: make sure &lt;code&gt;ollama serve&lt;/code&gt; is running.&lt;/li&gt;
&lt;li&gt;Model not found: check model name with &lt;code&gt;ollama list&lt;/code&gt; (for example &lt;code&gt;gemma4:12b&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Timeout: increase &lt;code&gt;timeout&lt;/code&gt; and test a smaller model first.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide</title>
        <link>https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/</link>
        <pubDate>Wed, 08 Apr 2026 18:06:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/</guid>
        <description>&lt;p&gt;If you want to run Gemma 4 locally on a laptop, &lt;code&gt;Ollama&lt;/code&gt; is one of the fastest and simplest options. Even without complex setup, you can usually get it running in about five minutes.&lt;/p&gt;
&lt;h2 id=&#34;step-1-install-ollama&#34;&gt;Step 1: Install Ollama
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Open &lt;code&gt;https://ollama.com&lt;/code&gt; and download the installer for your OS.&lt;/li&gt;
&lt;li&gt;Complete installation based on your system:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;macOS: drag it to &lt;code&gt;Applications&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Windows: run the &lt;code&gt;.exe&lt;/code&gt; installer.&lt;/li&gt;
&lt;li&gt;Linux: use the install script from the official site.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After installation, Ollama runs as a background service. Beyond initial setup, daily usage is mostly simple commands.&lt;/p&gt;
&lt;h2 id=&#34;step-2-download-a-gemma-4-model&#34;&gt;Step 2: Download a Gemma 4 Model
&lt;/h2&gt;&lt;p&gt;Open a terminal and run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama pull gemma4:4b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If your machine is stronger, you can switch to &lt;code&gt;12b&lt;/code&gt; or &lt;code&gt;27b&lt;/code&gt;. Once downloaded, the model is stored locally.&lt;/p&gt;
&lt;p&gt;Check downloaded models with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama list
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;step-3-run-the-model&#34;&gt;Step 3: Run the Model
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4:4b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This opens an interactive chat session in your terminal. Type your prompt and press Enter. To exit, type:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;/bye
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you prefer a browser chat UI, you can pair it with &lt;code&gt;Open WebUI&lt;/code&gt;. It wraps Ollama with a local web interface and is usually quick to set up with Docker.&lt;/p&gt;
&lt;h2 id=&#34;laptop-performance-tips&#34;&gt;Laptop Performance Tips
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Apple Silicon (M2/M3/M4): Metal acceleration is enabled by default, and &lt;code&gt;12B&lt;/code&gt; can run well.&lt;/li&gt;
&lt;li&gt;NVIDIA GPU: CUDA is used automatically when a compatible GPU is detected. Keep drivers updated.&lt;/li&gt;
&lt;li&gt;CPU-only inference: works, but larger models will be slower. For most CPU-only setups, &lt;code&gt;4B&lt;/code&gt; is the practical default.&lt;/li&gt;
&lt;li&gt;Free memory before loading large models: as a rough rule, each billion parameters needs about &lt;code&gt;0.5GB to 1GB&lt;/code&gt; RAM.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;how-to-choose-a-model&#34;&gt;How to Choose a Model
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 1B&lt;/code&gt;: good for lightweight Q&amp;amp;A, simple summarization, and quick lookups; limited on complex reasoning.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 4B&lt;/code&gt;: best for most daily tasks (writing help, coding help, document summarization) with strong speed/quality balance.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 12B&lt;/code&gt;: better for longer context and more complex tasks, especially coding and reasoning.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 27B&lt;/code&gt;: better for high-demand workloads and closer to frontier-cloud quality, but needs significantly stronger hardware.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide</title>
        <link>https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/</link>
        <pubDate>Wed, 08 Apr 2026 17:55:53 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/</guid>
        <description>&lt;p&gt;If you want to run Gemma 4 offline on your phone, this guide walks you through the full process from setup to practical usage.&lt;/p&gt;
&lt;h2 id=&#34;step-1-get-the-app&#34;&gt;Step 1: Get the App
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Google AI Edge Gallery&lt;/code&gt; is currently not available on Google Play, so you need to install it via APK sideloading.&lt;/p&gt;
&lt;p&gt;On your Android device, go to:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Settings -&amp;gt; Apps -&amp;gt; Special app access -&amp;gt; Install unknown apps&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Then:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Find your browser (for example, Chrome or Firefox) and enable &amp;ldquo;Allow from this source.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Open the &lt;code&gt;Google AI Edge Gallery&lt;/code&gt; GitHub Releases page in your mobile browser.&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;URL: &lt;a class=&#34;link&#34; href=&#34;https://github.com/google-ai-edge/gallery/releases&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/google-ai-edge/gallery/releases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&#34;3&#34;&gt;
&lt;li&gt;Download the latest &lt;code&gt;.apk&lt;/code&gt; package.&lt;/li&gt;
&lt;li&gt;After the download completes, open the file from notifications or your file manager and follow the prompts.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With a stable connection, this step usually takes around 2 minutes.&lt;/p&gt;
&lt;h2 id=&#34;step-2-open-the-app-and-grant-permissions&#34;&gt;Step 2: Open the App and Grant Permissions
&lt;/h2&gt;&lt;p&gt;When you first open &lt;code&gt;AI Edge Gallery&lt;/code&gt;, it will request storage permission to save model files. It&amp;rsquo;s best to allow this; otherwise, the app cannot download or load models.&lt;/p&gt;
&lt;p&gt;You will typically see these sections on the home screen:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Ask Image&lt;/code&gt;: Vision tasks (describe images, answer questions about photos)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AI Chat&lt;/code&gt;: Standard text chat&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Summarize&lt;/code&gt;: Paste text and generate summaries&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Smart Reply&lt;/code&gt;: Generate reply suggestions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For most users, &lt;code&gt;AI Chat&lt;/code&gt; is the primary entry point.&lt;/p&gt;
&lt;h2 id=&#34;step-3-download-a-gemma-4-model&#34;&gt;Step 3: Download a Gemma 4 Model
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Enter &lt;code&gt;AI Chat&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Tap &lt;code&gt;Get Models&lt;/code&gt; when prompted.&lt;/li&gt;
&lt;li&gt;Choose a Gemma 4 model from the list (model size is shown).&lt;/li&gt;
&lt;li&gt;Pick based on your device capability; if your phone has &lt;code&gt;8GB RAM&lt;/code&gt;, start with &lt;code&gt;Gemma 4 4B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Tap &lt;code&gt;Download&lt;/code&gt; and let it run in the background.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Note: Larger models take longer to download. You can download multiple models and switch between them later. Downloaded models stay on your device, so you do not need to re-download them.&lt;/p&gt;
&lt;h2 id=&#34;step-4-start-chatting&#34;&gt;Step 4: Start Chatting
&lt;/h2&gt;&lt;p&gt;After the model download is finished:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Tap the model name to load it (the first load usually takes 10 to 30 seconds depending on model size and device performance).&lt;/li&gt;
&lt;li&gt;Enter your prompt in the chat box and send it.&lt;/li&gt;
&lt;li&gt;The model generates responses locally, and your data does not leave the phone.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The first reply is often slower due to model warm-up. Later messages in the same session are usually faster.&lt;/p&gt;
&lt;h2 id=&#34;step-5-try-vision-features-gemma-4-multimodal&#34;&gt;Step 5: Try Vision Features (Gemma 4 Multimodal)
&lt;/h2&gt;&lt;p&gt;If you downloaded a Gemma 4 multimodal variant:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go back to the main menu and open &lt;code&gt;Ask Image&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Select an image or take a photo.&lt;/li&gt;
&lt;li&gt;Ask a question (for example, &amp;ldquo;What&amp;rsquo;s in this image?&amp;rdquo; or &amp;ldquo;Is there any text I should pay attention to?&amp;rdquo;).&lt;/li&gt;
&lt;li&gt;Wait for the model to analyze the image locally and return a result.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This feature works offline, and your image is not sent to external servers.&lt;/p&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B</title>
        <link>https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/</link>
        <pubDate>Sun, 05 Apr 2026 08:30:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/</guid>
        <description>&lt;p&gt;Gemma 4 focuses on &lt;code&gt;multimodality&lt;/code&gt; and &lt;code&gt;local offline inference&lt;/code&gt;, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-model-comparison&#34;&gt;Gemma 4 Model Comparison
&lt;/h2&gt;&lt;blockquote&gt;
&lt;p&gt;The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Parameter Size&lt;/th&gt;
          &lt;th&gt;Positioning&lt;/th&gt;
          &lt;th&gt;Key Strengths&lt;/th&gt;
          &lt;th&gt;Main Limitations&lt;/th&gt;
          &lt;th&gt;Recommended Scenarios&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 2B&lt;/td&gt;
          &lt;td&gt;2B&lt;/td&gt;
          &lt;td&gt;Ultra-lightweight&lt;/td&gt;
          &lt;td&gt;Low latency, low resource usage, lowest deployment barrier&lt;/td&gt;
          &lt;td&gt;Limited performance on complex reasoning and long task chains&lt;/td&gt;
          &lt;td&gt;Mobile, IoT, lightweight Q&amp;amp;A, simple automation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 4B&lt;/td&gt;
          &lt;td&gt;4B&lt;/td&gt;
          &lt;td&gt;Lightweight enhanced&lt;/td&gt;
          &lt;td&gt;Stronger understanding and generation than 2B, still easy to deploy locally&lt;/td&gt;
          &lt;td&gt;Limited ceiling for heavy coding and complex agent tasks&lt;/td&gt;
          &lt;td&gt;Local assistant, basic document work, multilingual daily tasks&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 26B&lt;/td&gt;
          &lt;td&gt;26B&lt;/td&gt;
          &lt;td&gt;High-performance (MoE)&lt;/td&gt;
          &lt;td&gt;Better reasoning and tool use, suitable for production workflows&lt;/td&gt;
          &lt;td&gt;Significantly higher VRAM requirement and hardware threshold&lt;/td&gt;
          &lt;td&gt;Coding assistant, complex workflows, enterprise internal agents&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 31B&lt;/td&gt;
          &lt;td&gt;31B&lt;/td&gt;
          &lt;td&gt;High-performance (dense)&lt;/td&gt;
          &lt;td&gt;Best overall capability and stronger stability on complex tasks&lt;/td&gt;
          &lt;td&gt;Highest resource cost and tuning complexity&lt;/td&gt;
          &lt;td&gt;Advanced reasoning, complex coding tasks, heavy automation&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;how-to-choose-start-from-hardware-and-tasks&#34;&gt;How to Choose: Start from Hardware and Tasks
&lt;/h2&gt;&lt;p&gt;If your top concern is whether it runs smoothly, use this guideline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8GB&lt;/code&gt; VRAM: prioritize &lt;code&gt;2B/4B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;12GB&lt;/code&gt; VRAM: prioritize &lt;code&gt;4B&lt;/code&gt; or quantized variants of larger models.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;24GB&lt;/code&gt; VRAM: focus on &lt;code&gt;26B&lt;/code&gt;, and evaluate quantized &lt;code&gt;31B&lt;/code&gt; based on workload.&lt;/li&gt;
&lt;li&gt;Higher VRAM or multi-GPU: consider high-precision &lt;code&gt;31B&lt;/code&gt; setups.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prioritize stability and inference speed first, then scale up model size gradually.&lt;/p&gt;
&lt;h2 id=&#34;four-typical-use-cases&#34;&gt;Four Typical Use Cases
&lt;/h2&gt;&lt;h3 id=&#34;1-local-general-assistant&#34;&gt;1) Local General Assistant
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;4B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: strong balance between cost and quality, suitable for long-running local use.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-coding-and-automation&#34;&gt;2) Coding and Automation
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;26B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: more stable in multi-step tasks, tool calls, and script generation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;3-advanced-reasoning-and-complex-agents&#34;&gt;3) Advanced Reasoning and Complex Agents
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;31B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: stronger robustness under complex context.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;4-edge-devices-and-lightweight-offline-use&#34;&gt;4) Edge Devices and Lightweight Offline Use
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;2B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: easiest to deploy on resource-constrained devices.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;deployment-suggestions-ollama&#34;&gt;Deployment Suggestions (Ollama)
&lt;/h2&gt;&lt;p&gt;A practical approach is to iterate in small steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;4B&lt;/code&gt; to establish a baseline (latency, memory, quality).&lt;/li&gt;
&lt;li&gt;Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).&lt;/li&gt;
&lt;li&gt;Compare &lt;code&gt;26B/31B&lt;/code&gt; against that set for accuracy, latency, and VRAM cost.&lt;/li&gt;
&lt;li&gt;Upgrade only when the gain is clear.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For low-cost fast rollout: start with &lt;code&gt;2B/4B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For production-grade local AI workflows: prioritize &lt;code&gt;26B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For advanced reasoning and heavy automation: move to &lt;code&gt;31B&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/&#34; &gt;LLM Quantization Guide (FP16/Q8/Q5/Q4/Q2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/&#34; &gt;How to Check Whether Ollama Uses GPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android (Chinese)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        
    </channel>
</rss>
