<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Ollama on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/ollama/</link>
        <description>Recent content in Ollama on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Fri, 15 May 2026 23:27:50 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/ollama/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Claude Code &#43; Ollama Local Deployment Guide: Build a Free AI Coding Assistant with CC Switch</title>
        <link>https://knightli.com/en/2026/05/15/claude-code-ollama-cc-switch-local-agent/</link>
        <pubDate>Fri, 15 May 2026 23:27:50 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/15/claude-code-ollama-cc-switch-local-agent/</guid>
        <description>&lt;p&gt;&lt;code&gt;Claude Code&lt;/code&gt; has become a popular AI coding assistant recently. Its appeal is not just that it can chat about code, but that it can read a project, modify files, run commands, install dependencies, and keep fixing errors in an agent-like workflow.&lt;/p&gt;
&lt;p&gt;The hard part is cost. Once a project grows, long context and repeated agent turns can burn through API quota quickly. If you just want to experiment, refactor small utilities, generate scripts, or work on a private local project, it is natural to ask: can Claude Code&amp;rsquo;s workflow be kept while the model runs locally?&lt;/p&gt;
&lt;p&gt;The key tool in this setup is &lt;code&gt;CC Switch&lt;/code&gt;. It lets Claude Code connect to the local &lt;code&gt;Ollama&lt;/code&gt; service through an OpenAI-compatible API endpoint, so requests can be forwarded to a local model instead of the official Claude API.&lt;/p&gt;
&lt;h2 id=&#34;what-this-setup-solves&#34;&gt;What This Setup Solves
&lt;/h2&gt;&lt;p&gt;You can think of the whole setup as:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Claude Code desktop
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;+ CC Switch API forwarding layer
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;+ Ollama local model
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Claude Code is still responsible for the coding workflow and project operations. CC Switch handles model provider configuration and API compatibility. Ollama runs the model locally.&lt;/p&gt;
&lt;p&gt;This does not make a local model suddenly become Claude. Its real value is that it makes Claude Code&amp;rsquo;s agent workflow usable in lower-cost, offline, and private local scenarios.&lt;/p&gt;
&lt;h2 id=&#34;basic-preparation&#34;&gt;Basic Preparation
&lt;/h2&gt;&lt;p&gt;Before you start, prepare these pieces:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install &lt;code&gt;Git&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;Ollama&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pull a local model suitable for coding.&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;CC Switch&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Have Claude Code available on your machine.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For the model side, you can start with coding-oriented models, such as Qwen Coder, DeepSeek Coder, or other models with decent tool-calling and code generation behavior. The larger the model, the better the result may be, but memory and GPU pressure will also rise.&lt;/p&gt;
&lt;p&gt;If your machine only has limited memory, start with a smaller model first. Confirm that the workflow runs smoothly before trying a larger one.&lt;/p&gt;
&lt;h2 id=&#34;key-cc-switch-configuration&#34;&gt;Key CC Switch Configuration
&lt;/h2&gt;&lt;p&gt;After Ollama starts, its default local API address is usually:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:11434/v1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;In CC Switch, choose an OpenAI-compatible provider type, commonly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;OpenAI Chat Completions
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then point the base URL to Ollama&amp;rsquo;s local address.&lt;/p&gt;
&lt;p&gt;For the API key field, local Ollama normally does not need a real key, but many tools still require an environment variable or placeholder. You can use:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ANTHROPIC_API_KEY
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;or another placeholder variable accepted by your local setup.&lt;/p&gt;
&lt;p&gt;One configuration item is worth special attention:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&amp;#34;inferenceModels&amp;#34;=&amp;#34;[\&amp;#34;haiku\&amp;#34;,\&amp;#34;sonnet\&amp;#34;,\&amp;#34;opus\&amp;#34;]&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This means mapping Claude Code&amp;rsquo;s expected model roles to the local provider. In practice, you need to bind &lt;code&gt;haiku&lt;/code&gt;, &lt;code&gt;sonnet&lt;/code&gt;, and &lt;code&gt;opus&lt;/code&gt; to the model names exposed by Ollama or CC Switch. If this mapping is wrong, Claude Code may fail to call the model or may keep falling back to an unexpected configuration.&lt;/p&gt;
&lt;h2 id=&#34;where-claude-code-is-strong&#34;&gt;Where Claude Code Is Strong
&lt;/h2&gt;&lt;p&gt;Claude Code&amp;rsquo;s biggest advantage is not raw completion. It is the full coding workflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reading and understanding project structure;&lt;/li&gt;
&lt;li&gt;locating related files based on a task;&lt;/li&gt;
&lt;li&gt;editing code directly;&lt;/li&gt;
&lt;li&gt;running commands and tests;&lt;/li&gt;
&lt;li&gt;observing errors and iterating;&lt;/li&gt;
&lt;li&gt;completing multi-step tasks in one session.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is why many people want to keep Claude Code even when switching to a local model. A normal chat UI can generate code snippets, but it does not naturally operate inside a repository. Claude Code is closer to an executable development assistant.&lt;/p&gt;
&lt;h2 id=&#34;what-role-ollama-plays-here&#34;&gt;What Role Ollama Plays Here
&lt;/h2&gt;&lt;p&gt;Ollama is responsible for local model runtime and management. It handles model downloading, loading, and local inference.&lt;/p&gt;
&lt;p&gt;The advantage is clear: requests stay on your machine, repeated use does not create API bills, and you can use it when the network is limited. For private code, this is also easier to accept than sending every context window to a cloud model.&lt;/p&gt;
&lt;p&gt;The trade-off is also clear. Local models depend heavily on your hardware and on model quality. A smaller model can handle simple edits, explanations, and script generation, but it may struggle with large cross-file refactors or subtle architectural decisions.&lt;/p&gt;
&lt;h2 id=&#34;where-the-experience-has-boundaries&#34;&gt;Where The Experience Has Boundaries
&lt;/h2&gt;&lt;p&gt;This setup should not be treated as a full replacement for Claude&amp;rsquo;s strongest cloud models.&lt;/p&gt;
&lt;p&gt;You may run into these issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;weaker long-context understanding;&lt;/li&gt;
&lt;li&gt;unstable tool-calling behavior in complex tasks;&lt;/li&gt;
&lt;li&gt;slower inference on CPU-only machines;&lt;/li&gt;
&lt;li&gt;more hallucinated file paths or APIs;&lt;/li&gt;
&lt;li&gt;less reliable multi-round planning;&lt;/li&gt;
&lt;li&gt;lower success rate on large repository refactors.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the better expectation is: use it as a free local development assistant, not as a perfect substitute for a top-tier cloud model.&lt;/p&gt;
&lt;h2 id=&#34;multimodal-compatibility-is-still-unstable&#34;&gt;Multimodal Compatibility Is Still Unstable
&lt;/h2&gt;&lt;p&gt;Some users want Claude Code to handle screenshots, UI images, diagrams, or other multimodal inputs. This part depends on the local model and the forwarding layer.&lt;/p&gt;
&lt;p&gt;If the selected Ollama model does not support vision, or CC Switch does not translate the request format correctly, multimodal features may fail. Even with a vision model, behavior may differ from Claude&amp;rsquo;s official API.&lt;/p&gt;
&lt;p&gt;For now, this setup is more suitable for text and code workflows. Treat multimodal support as experimental.&lt;/p&gt;
&lt;h2 id=&#34;who-should-try-it&#34;&gt;Who Should Try It
&lt;/h2&gt;&lt;p&gt;This setup is suitable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;developers who want to try Claude Code&amp;rsquo;s workflow at low cost;&lt;/li&gt;
&lt;li&gt;users who frequently write scripts, small tools, and automation snippets;&lt;/li&gt;
&lt;li&gt;teams that want to keep code on local machines;&lt;/li&gt;
&lt;li&gt;learners who want an AI coding assistant without constant API spend;&lt;/li&gt;
&lt;li&gt;people testing different local coding models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is less suitable if you rely heavily on long context, large monorepos, strict code review quality, or complex full-project refactors.&lt;/p&gt;
&lt;h2 id=&#34;usage-advice&#34;&gt;Usage Advice
&lt;/h2&gt;&lt;p&gt;Start with small tasks.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;explain a single file;&lt;/li&gt;
&lt;li&gt;refactor a small function;&lt;/li&gt;
&lt;li&gt;generate a shell script;&lt;/li&gt;
&lt;li&gt;fix a simple error;&lt;/li&gt;
&lt;li&gt;add a small feature;&lt;/li&gt;
&lt;li&gt;write unit tests for a narrow module.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After each change, run tests or at least review the diff yourself. A local model can be useful, but you should not blindly accept every generated edit.&lt;/p&gt;
&lt;p&gt;If the model keeps losing context, reduce the task scope. Instead of asking it to &amp;ldquo;refactor the whole project&amp;rdquo;, ask it to &amp;ldquo;refactor this function&amp;rdquo; or &amp;ldquo;add validation in this file&amp;rdquo;.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Claude Code + CC Switch + Ollama&lt;/code&gt; is an interesting combination. It keeps Claude Code&amp;rsquo;s agent-style development workflow while moving inference to a local model.&lt;/p&gt;
&lt;p&gt;Its biggest strengths are lower cost, local privacy, and a smooth development workflow. Its limits are also obvious: model quality, hardware performance, long context, and tool-calling stability all affect the final experience.&lt;/p&gt;
&lt;p&gt;If you already use Ollama and want a more practical local AI coding workflow, this setup is worth trying. Just remember to start small, verify every change, and treat the local model as an assistant rather than an automatic engineer.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Local LLM Models Recommended for an RTX 3060 GPU</title>
        <link>https://knightli.com/en/2026/05/08/rtx-3060-local-llm-models/</link>
        <pubDate>Fri, 08 May 2026 09:25:24 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/08/rtx-3060-local-llm-models/</guid>
        <description>&lt;p&gt;The most common RTX 3060 variant has 12GB of VRAM. It is not a top-tier AI GPU, but it is a very usable card for local LLMs, especially 7B, 8B, 9B, and 12B models.&lt;/p&gt;
&lt;p&gt;If you only want a quick rule of thumb, remember this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On an RTX 3060 12GB, prioritize around-8B models in Q4_K_M or Q5_K_M quantization. Choose Q4 for stability, and try Q5 if you want better quality.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Do not start by chasing 32B or 70B models. Even if they can run with low-bit quantization and CPU offloading, their speed and experience are usually not suitable for daily use.&lt;/p&gt;
&lt;h2 id=&#34;start-with-the-vram-limit&#34;&gt;Start With the VRAM Limit
&lt;/h2&gt;&lt;p&gt;For local LLMs on an RTX 3060 12GB, the real limit is VRAM.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model Size&lt;/th&gt;
          &lt;th&gt;Recommended Quantization&lt;/th&gt;
          &lt;th&gt;RTX 3060 12GB Experience&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;3B / 4B&lt;/td&gt;
          &lt;td&gt;Q4, Q5, Q8&lt;/td&gt;
          &lt;td&gt;Very easy, fast&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;7B / 8B / 9B&lt;/td&gt;
          &lt;td&gt;Q4_K_M, Q5_K_M&lt;/td&gt;
          &lt;td&gt;Best balance of quality and speed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12B / 14B&lt;/td&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;Usable, but avoid huge context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;30B+&lt;/td&gt;
          &lt;td&gt;Q2 / Q3 or partial offload&lt;/td&gt;
          &lt;td&gt;Possible to tinker with, not recommended daily&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;70B+&lt;/td&gt;
          &lt;td&gt;Very low quantization or heavy CPU/RAM use&lt;/td&gt;
          &lt;td&gt;More like an experiment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs do not only consume VRAM for the model file. Context length, KV cache, batch size, inference framework, and drivers all consume resources.&lt;/p&gt;
&lt;p&gt;So 12GB of VRAM does not mean you can load a 12GB model file directly. It is better to leave room for the system and context.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-1-qwen3-8b&#34;&gt;Recommendation 1: Qwen3 8B
&lt;/h2&gt;&lt;p&gt;If you mainly use Chinese, &lt;code&gt;Qwen3 8B&lt;/code&gt; is one of the first models worth trying on an RTX 3060.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Chinese Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Summarization and rewriting.&lt;/li&gt;
&lt;li&gt;Everyday knowledge assistant work.&lt;/li&gt;
&lt;li&gt;Simple code explanation.&lt;/li&gt;
&lt;li&gt;Local RAG.&lt;/li&gt;
&lt;li&gt;Lightweight Agent flows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Qwen3 8B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M: first choice
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q5_K_M: better quality, more VRAM pressure
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Qwen models are friendly to Chinese usage. For daily writing, information organization, and Chinese instruction following, Qwen3 8B is a good first model.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-2-llama-31-8b-instruct&#34;&gt;Recommendation 2: Llama 3.1 8B Instruct
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Llama 3.1 8B Instruct&lt;/code&gt; is a stable general-purpose model with mature English capability and ecosystem support.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;English Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Lightweight coding help.&lt;/li&gt;
&lt;li&gt;General chat.&lt;/li&gt;
&lt;li&gt;Document summarization.&lt;/li&gt;
&lt;li&gt;Prompt testing.&lt;/li&gt;
&lt;li&gt;Comparing different inference tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Llama 3.1 8B Instruct GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M: better speed and VRAM stability
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q5_K_M: better answer quality
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you mainly process English materials, or want a model with many tutorials and broad compatibility, Llama 3.1 8B is still a good baseline.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-3-gemma-3-12b&#34;&gt;Recommendation 3: Gemma 3 12B
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Gemma 3 12B&lt;/code&gt; is closer to the upper practical limit for an RTX 3060 12GB.&lt;/p&gt;
&lt;p&gt;It uses more VRAM than 8B models, but Q4 quantization can still make it usable on a 12GB card. It is a good option if you want to try a slightly larger model on one GPU.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Higher-quality general Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;English content processing.&lt;/li&gt;
&lt;li&gt;More complex summarization and analysis.&lt;/li&gt;
&lt;li&gt;Trying an upgrade over 8B models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Gemma 3 12B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M or official QAT Q4
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Keep context modest
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you run out of VRAM, reduce context length first, or return to an 8B model. For an RTX 3060, 12B is &amp;ldquo;worth trying,&amp;rdquo; not a no-brainer default.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-4-deepseek-r1-distill-qwen-8b&#34;&gt;Recommendation 4: DeepSeek R1 Distill Qwen 8B
&lt;/h2&gt;&lt;p&gt;If you want to experience reasoning-style local models, try models like &lt;code&gt;DeepSeek R1 Distill Qwen 8B&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple reasoning tasks.&lt;/li&gt;
&lt;li&gt;Step-by-step analysis.&lt;/li&gt;
&lt;li&gt;Learning reasoning-model output style.&lt;/li&gt;
&lt;li&gt;Low-cost local experiments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended choice:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;DeepSeek R1 Distill Qwen 8B GGUF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;These models may produce longer reasoning-style outputs, so speed and context usage can be heavier than ordinary instruction models. They are not always more comfortable for daily chat, but they are useful for reasoning experiments.&lt;/p&gt;
&lt;h2 id=&#34;recommendation-5-phi--minicpm--smaller-models&#34;&gt;Recommendation 5: Phi / MiniCPM / Smaller Models
&lt;/h2&gt;&lt;p&gt;If your RTX 3060 is an 8GB variant, or your system RAM is limited, consider 3B and 4B models first.&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fast Q&amp;amp;A.&lt;/li&gt;
&lt;li&gt;Simple summaries.&lt;/li&gt;
&lt;li&gt;Embedding into local tools.&lt;/li&gt;
&lt;li&gt;Low-latency chat.&lt;/li&gt;
&lt;li&gt;Testing on older machines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These models may not match 8B or 12B quality, but they are light, fast, and easy to deploy.&lt;/p&gt;
&lt;h2 id=&#34;which-quantization-to-use&#34;&gt;Which Quantization to Use
&lt;/h2&gt;&lt;p&gt;Local models commonly use &lt;code&gt;GGUF&lt;/code&gt;, with quantization types such as Q4, Q5, Q6, and Q8.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th&gt;Traits&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;Small, fast, good enough&lt;/td&gt;
          &lt;td&gt;RTX 3060 first choice&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q5_K_M&lt;/td&gt;
          &lt;td&gt;Better quality, higher usage&lt;/td&gt;
          &lt;td&gt;Try with 8B models&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q6 / Q8&lt;/td&gt;
          &lt;td&gt;Closer to original quality, larger&lt;/td&gt;
          &lt;td&gt;Small models or more VRAM&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q2 / Q3&lt;/td&gt;
          &lt;td&gt;Saves VRAM but quality drops&lt;/td&gt;
          &lt;td&gt;Large-model tinkering&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For RTX 3060 12GB, the practical choices are:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B models: Q4_K_M or Q5_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;12B models: Q4_K_M first
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Larger models: not recommended as daily drivers
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;which-tool-to-use&#34;&gt;Which Tool to Use
&lt;/h2&gt;&lt;p&gt;Beginners can start with &lt;code&gt;Ollama&lt;/code&gt;, because installation and running models are simple.&lt;/p&gt;
&lt;p&gt;Common commands:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run qwen3:8b
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run llama3.1:8b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want finer control over GGUF files, GPU layers, and context length, use &lt;code&gt;llama.cpp&lt;/code&gt; or GUI tools based on it.&lt;/p&gt;
&lt;p&gt;Common choices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Ollama&lt;/code&gt;: easiest, best for beginners.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LM Studio&lt;/code&gt;: friendly GUI, good for downloading and switching models.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llama.cpp&lt;/code&gt;: most control, best for performance tuning.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;text-generation-webui&lt;/code&gt;: many features, good for backend testing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For local chat and simple Q&amp;amp;A, Ollama or LM Studio is enough.&lt;/p&gt;
&lt;h2 id=&#34;do-not-set-context-too-high&#34;&gt;Do Not Set Context Too High
&lt;/h2&gt;&lt;p&gt;Many models advertise long-context support, but do not blindly set context to the maximum on an RTX 3060.&lt;/p&gt;
&lt;p&gt;Longer context uses more KV cache and increases VRAM pressure. Even if the model loads, long context can slow generation down.&lt;/p&gt;
&lt;p&gt;Suggested settings:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Normal chat: 4K to 8K
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Document summaries: 8K to 16K
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Long-document RAG: chunk first; do not paste everything at once
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;An RTX 3060 is better suited to &amp;ldquo;moderate context + good model + good retrieval&amp;rdquo; than forcing hundreds of thousands of tokens into one prompt.&lt;/p&gt;
&lt;h2 id=&#34;choose-by-use-case&#34;&gt;Choose by Use Case
&lt;/h2&gt;&lt;p&gt;If you mainly write Chinese:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;First choice: Qwen3 8B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Alternative: DeepSeek R1 Distill Qwen 8B
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you mainly write English:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;First choice: Llama 3.1 8B Instruct Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Alternative: Gemma 3 12B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want speed:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;3B / 4B models
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Keep context at 4K to 8K
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want better quality:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B Q5_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;12B Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Accept slower speed
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want coding help:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;8B coding models can help with explanations and small edits
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;For complex engineering tasks, use stronger cloud models
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Local RTX 3060 models are good for code explanation, function completion, small scripts, and offline assistance. For large refactors, difficult bugs, and cross-file Agent work, do not expect Claude Sonnet or GPT-5-level performance.&lt;/p&gt;
&lt;h2 id=&#34;reasonable-expectations&#34;&gt;Reasonable Expectations
&lt;/h2&gt;&lt;p&gt;The RTX 3060 12GB is good enough to turn local LLMs from toys into daily tools, but it will not recreate top cloud models at home.&lt;/p&gt;
&lt;p&gt;Its strengths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Low cost.&lt;/li&gt;
&lt;li&gt;More VRAM than 8GB cards.&lt;/li&gt;
&lt;li&gt;Good 8B model experience.&lt;/li&gt;
&lt;li&gt;Offline use.&lt;/li&gt;
&lt;li&gt;Local processing for privacy-sensitive materials.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Its limits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Large models are hard to run smoothly.&lt;/li&gt;
&lt;li&gt;Long context consumes VRAM.&lt;/li&gt;
&lt;li&gt;Slower than high-end GPUs.&lt;/li&gt;
&lt;li&gt;Small local models have limited complex reasoning.&lt;/li&gt;
&lt;li&gt;Multimodal and Agent workflows need more resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The stable route is: use 8B models as everyday local assistants, try 12B models for quality, and leave complex tasks to cloud models.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Recommended local LLM choices for RTX 3060 12GB:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Chinese general use: &lt;code&gt;Qwen3 8B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;English general use: &lt;code&gt;Llama 3.1 8B Instruct Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Higher-quality experiment: &lt;code&gt;Gemma 3 12B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Reasoning experiment: &lt;code&gt;DeepSeek R1 Distill Qwen 8B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Low-VRAM fast use: 3B / 4B small models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Choose &lt;code&gt;Q4_K_M&lt;/code&gt; first. Try &lt;code&gt;Q5_K_M&lt;/code&gt; for 8B models if you want better quality. Start with Ollama or LM Studio.&lt;/p&gt;
&lt;p&gt;Do not treat the RTX 3060 as a large-model server. Treat it as a local knowledge assistant, privacy document processor, lightweight coding helper, and model experiment card, and it will fit its real capabilities much better.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Qwen3 8B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3-8B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/Qwen/Qwen3-8B-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Llama 3.1 8B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Gemma 3 12B GGUF: &lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-3-12b-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://huggingface.co/unsloth/gemma-3-12b-it-GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;llama.cpp: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama: &lt;a class=&#34;link&#34; href=&#34;https://ollama.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://ollama.com&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Fix Ollama Using CPU Instead of GPU</title>
        <link>https://knightli.com/en/2026/04/24/fix-ollama-using-cpu-instead-of-gpu/</link>
        <pubDate>Fri, 24 Apr 2026 18:30:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/24/fix-ollama-using-cpu-instead-of-gpu/</guid>
        <description>&lt;p&gt;When running local LLMs, one of the most frustrating problems is this: your machine clearly has a GPU, yet &lt;code&gt;Ollama&lt;/code&gt; still leans heavily on the &lt;code&gt;CPU&lt;/code&gt;, and performance is painfully slow.&lt;/p&gt;
&lt;p&gt;The short version is that this is usually not caused by one single issue. The most common causes are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Ollama&lt;/code&gt; is not detecting any usable GPU&lt;/li&gt;
&lt;li&gt;The driver, &lt;code&gt;ROCm&lt;/code&gt;, or &lt;code&gt;CUDA&lt;/code&gt; environment is not set up correctly&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;Ollama&lt;/code&gt; service was started without the right environment variables&lt;/li&gt;
&lt;li&gt;The model is too large and has fallen back to &lt;code&gt;CPU&lt;/code&gt; or mixed &lt;code&gt;CPU/GPU&lt;/code&gt; loading&lt;/li&gt;
&lt;li&gt;On AMD platforms, there may be extra compatibility issues such as &lt;code&gt;ROCm&lt;/code&gt; version mismatch, &lt;code&gt;gfx&lt;/code&gt; settings, or device visibility problems&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The fastest way to troubleshoot it is to go through the checks below in order.&lt;/p&gt;
&lt;h2 id=&#34;1-first-confirm-whether-ollama-is-really-not-using-the-gpu&#34;&gt;1. First, confirm whether Ollama is really not using the GPU
&lt;/h2&gt;&lt;p&gt;The most direct check is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Focus on the &lt;code&gt;PROCESSOR&lt;/code&gt; column.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;100% GPU&lt;/code&gt;: the model is fully running on the GPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;100% CPU&lt;/code&gt;: the GPU is not being used at all&lt;/li&gt;
&lt;li&gt;Results like &lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;: part of the model is in VRAM, and part has spilled into system memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you see &lt;code&gt;100% CPU&lt;/code&gt;, the next step is to focus on environment and service configuration.&lt;br&gt;
If you see mixed loading, that does not necessarily mean the GPU is broken. In many cases, it simply means VRAM is not enough.&lt;/p&gt;
&lt;h2 id=&#34;2-rule-out-the-most-common-misunderstanding-first-the-model-does-not-fit-into-vram&#34;&gt;2. Rule out the most common misunderstanding first: the model does not fit into VRAM
&lt;/h2&gt;&lt;p&gt;Many people assume that once a GPU is installed, &lt;code&gt;Ollama&lt;/code&gt; will always run fully on it. That is not how it works.&lt;/p&gt;
&lt;p&gt;If the model is too large, the context is too long, or some other loaded model is already occupying VRAM, &lt;code&gt;Ollama&lt;/code&gt; may fall back to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partial GPU + partial CPU&lt;/li&gt;
&lt;li&gt;Full &lt;code&gt;100% CPU&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At this point, the two simplest tests are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Try a smaller model first&lt;br&gt;
For example, test with a &lt;code&gt;4B&lt;/code&gt; or &lt;code&gt;7B&lt;/code&gt; model before jumping straight to much larger ones.&lt;/li&gt;
&lt;li&gt;Unload other active models and test again&lt;br&gt;
Run &lt;code&gt;ollama ps&lt;/code&gt; first and make sure nothing else is occupying VRAM.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If smaller models use the GPU but larger ones do not, the real problem is usually VRAM capacity rather than the driver.&lt;/p&gt;
&lt;h2 id=&#34;3-check-whether-the-gpu-driver-and-the-lower-level-runtime-are-actually-working&#34;&gt;3. Check whether the GPU driver and the lower-level runtime are actually working
&lt;/h2&gt;&lt;p&gt;If even small models run only on &lt;code&gt;CPU&lt;/code&gt;, the next step is to check the underlying environment.&lt;/p&gt;
&lt;h3 id=&#34;nvidia&#34;&gt;NVIDIA
&lt;/h3&gt;&lt;p&gt;First confirm that the driver is working and the system can see the GPU. A common check is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If this already fails, &lt;code&gt;Ollama&lt;/code&gt; is very unlikely to use the GPU correctly.&lt;/p&gt;
&lt;h3 id=&#34;amd--rocm&#34;&gt;AMD / ROCm
&lt;/h3&gt;&lt;p&gt;If you are using an &lt;code&gt;AMD GPU&lt;/code&gt;, especially with &lt;code&gt;ROCm&lt;/code&gt;, start with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;rocminfo
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;rocm-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If these tools cannot list the device properly, the problem is still below &lt;code&gt;Ollama&lt;/code&gt;, so there is no point debugging the application layer yet.&lt;/p&gt;
&lt;p&gt;On AMD, the most common issue is not simply “is the driver installed,” but rather:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;ROCm&lt;/code&gt; version does not match the OS version&lt;/li&gt;
&lt;li&gt;The current GPU architecture has incomplete support&lt;/li&gt;
&lt;li&gt;The device exists, but the runtime is not being exposed correctly to &lt;code&gt;Ollama&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;4-restart-the-ollama-service-not-just-your-terminal&#34;&gt;4. Restart the Ollama service, not just your terminal
&lt;/h2&gt;&lt;p&gt;This is a very common trap.&lt;/p&gt;
&lt;p&gt;Many people install drivers, change environment variables, fix &lt;code&gt;ROCm&lt;/code&gt;, then just open a new terminal and continue with &lt;code&gt;ollama run&lt;/code&gt;. But if &lt;code&gt;Ollama&lt;/code&gt; is running as a background service, it may still be using the old environment.&lt;/p&gt;
&lt;p&gt;So the safer approach is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fully restart the &lt;code&gt;Ollama&lt;/code&gt; service&lt;/li&gt;
&lt;li&gt;Reboot the machine if necessary&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are running it as a service on Linux, make sure the service process was actually restarted instead of reusing the old one.&lt;/p&gt;
&lt;h2 id=&#34;5-check-whether-the-environment-variables-are-really-reaching-the-service&#34;&gt;5. Check whether the environment variables are really reaching the service
&lt;/h2&gt;&lt;p&gt;This matters especially on &lt;code&gt;AMD ROCm&lt;/code&gt; systems.&lt;/p&gt;
&lt;p&gt;Some machines work fine when commands are run manually in a shell, but the &lt;code&gt;Ollama&lt;/code&gt; service still uses only &lt;code&gt;CPU&lt;/code&gt;. In that case, the usual reason is that the service process never received the variables you set in your shell.&lt;/p&gt;
&lt;p&gt;Common variables to look at include:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ROCR_VISIBLE_DEVICES
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;HSA_OVERRIDE_GFX_VERSION
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Specifically:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; limits or selects which GPUs &lt;code&gt;ROCm&lt;/code&gt; can see&lt;/li&gt;
&lt;li&gt;&lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt; is often used as a compatibility workaround on some AMD platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only &lt;code&gt;export&lt;/code&gt; these variables in the current terminal, but &lt;code&gt;Ollama&lt;/code&gt; is started by systemd, a desktop background service, or another daemon, they may not take effect.&lt;/p&gt;
&lt;p&gt;In other words, “it looks set in my terminal” does not mean &lt;code&gt;Ollama&lt;/code&gt; is actually using it.&lt;/p&gt;
&lt;h2 id=&#34;6-on-amd-platforms-focus-on-rocm-compatibility&#34;&gt;6. On AMD platforms, focus on ROCm compatibility
&lt;/h2&gt;&lt;p&gt;Based on the public page metadata, the original video for this topic is tied to &lt;code&gt;AMD Max+ 395&lt;/code&gt;, &lt;code&gt;strix halo&lt;/code&gt;, and &lt;code&gt;AMD ROCm&lt;/code&gt;.&lt;br&gt;
In setups like these, &lt;code&gt;Ollama&lt;/code&gt; failing to use the GPU is often more dependent on version matching than on NVIDIA systems.&lt;/p&gt;
&lt;p&gt;Start by checking these:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Whether the installed &lt;code&gt;ROCm&lt;/code&gt; version fits the current OS and GPU&lt;/li&gt;
&lt;li&gt;Whether the GPU belongs to an architecture with solid &lt;code&gt;ROCm&lt;/code&gt; support&lt;/li&gt;
&lt;li&gt;Whether you need to set &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Whether an older &lt;code&gt;Ollama&lt;/code&gt; build or older inference runtime is causing compatibility issues&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If &lt;code&gt;rocminfo&lt;/code&gt; works and the GPU is visible to the system, but &lt;code&gt;Ollama&lt;/code&gt; still runs only on &lt;code&gt;CPU&lt;/code&gt;, the issue is often in the version combination rather than in model parameters.&lt;/p&gt;
&lt;h2 id=&#34;7-in-docker-wsl-or-remote-environments-also-check-device-mapping&#34;&gt;7. In Docker, WSL, or remote environments, also check device mapping
&lt;/h2&gt;&lt;p&gt;If you are not running on bare metal but inside:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Docker&lt;/li&gt;
&lt;li&gt;WSL&lt;/li&gt;
&lt;li&gt;Remote containers&lt;/li&gt;
&lt;li&gt;Virtualized environments&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then you need to check one more layer: whether the GPU device is actually being exposed inside that environment.&lt;/p&gt;
&lt;p&gt;A typical symptom looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The host machine can see the GPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Ollama&lt;/code&gt; inside the container or subsystem still uses only &lt;code&gt;CPU&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In that case, the issue may not be &lt;code&gt;Ollama&lt;/code&gt; itself. The container or subsystem may simply not have GPU access.&lt;/p&gt;
&lt;h2 id=&#34;8-check-logs-last-but-check-them-for-the-right-reason&#34;&gt;8. Check logs last, but check them for the right reason
&lt;/h2&gt;&lt;p&gt;If you have already gone through the earlier steps, the most effective next move is not endless reinstalling, but looking directly at the &lt;code&gt;Ollama&lt;/code&gt; startup and runtime logs.&lt;/p&gt;
&lt;p&gt;Focus on two kinds of messages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether a GPU was detected at all&lt;/li&gt;
&lt;li&gt;Whether there are driver, library loading, or device initialization errors&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the logs clearly say something like “no compatible GPU found” or “failed to initialize ROCm/CUDA,” the troubleshooting direction becomes much clearer immediately.&lt;/p&gt;
&lt;h2 id=&#34;troubleshooting-order&#34;&gt;Troubleshooting Order
&lt;/h2&gt;&lt;p&gt;If you only want the shortest path, use this order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;ollama ps&lt;/code&gt; and confirm whether it is &lt;code&gt;GPU&lt;/code&gt;, &lt;code&gt;CPU&lt;/code&gt;, or mixed loading&lt;/li&gt;
&lt;li&gt;Try a smaller model to rule out VRAM limits&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;nvidia-smi&lt;/code&gt;, &lt;code&gt;rocminfo&lt;/code&gt;, and &lt;code&gt;rocm-smi&lt;/code&gt; to verify the lower-level environment first&lt;/li&gt;
&lt;li&gt;Fully restart the &lt;code&gt;Ollama&lt;/code&gt; service&lt;/li&gt;
&lt;li&gt;Check service environment variables, especially &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; and &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt; on AMD&lt;/li&gt;
&lt;li&gt;If you are in Docker or WSL, verify device mapping&lt;/li&gt;
&lt;li&gt;Finally, inspect logs for the exact error&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;When &lt;code&gt;Ollama&lt;/code&gt; uses &lt;code&gt;CPU&lt;/code&gt; instead of &lt;code&gt;GPU&lt;/code&gt;, the root cause usually falls into one of three groups:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The GPU is not being detected at all&lt;/li&gt;
&lt;li&gt;The GPU is detectable, but the runtime environment is not reaching &lt;code&gt;Ollama&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The GPU is working, but the model is too large and falls back to &lt;code&gt;CPU&lt;/code&gt; or mixed memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once you separate those three cases, troubleshooting becomes much faster.&lt;br&gt;
If you are on an AMD platform, pay special attention to &lt;code&gt;ROCm&lt;/code&gt; version matching, device visibility, and compatibility variables instead of focusing only on the &lt;code&gt;Ollama&lt;/code&gt; command itself.&lt;/p&gt;
&lt;p&gt;Original video: &lt;a class=&#34;link&#34; href=&#34;https://www.bilibili.com/video/BV1cHoYBqE8k/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.bilibili.com/video/BV1cHoYBqE8k/&lt;/a&gt;&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings</title>
        <link>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</link>
        <pubDate>Sun, 19 Apr 2026 00:18:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</guid>
        <description>&lt;p&gt;When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?&lt;/p&gt;
&lt;p&gt;This note summarizes how Ollama behaves with multiple GPUs. The short version:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama supports multiple GPUs.&lt;/li&gt;
&lt;li&gt;The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.&lt;/li&gt;
&lt;li&gt;By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.&lt;/li&gt;
&lt;li&gt;If a model does not fit on one GPU, Ollama can spread it across available GPUs.&lt;/li&gt;
&lt;li&gt;Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.&lt;/li&gt;
&lt;li&gt;SLI / NVLink is not required for multi-GPU use.&lt;/li&gt;
&lt;li&gt;To limit which GPUs Ollama can use, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt;, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt;, or &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-behavior-single-gpu-first-multi-gpu-when-needed&#34;&gt;Official Behavior: Single GPU First, Multi-GPU When Needed
&lt;/h2&gt;&lt;p&gt;Ollama&amp;rsquo;s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.&lt;/p&gt;
&lt;p&gt;The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.&lt;/p&gt;
&lt;p&gt;So do not think of Ollama multi-GPU as &amp;ldquo;more cards automatically means several times faster.&amp;rdquo; A more accurate model is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small model fits on one GPU: usually runs on one GPU.&lt;/li&gt;
&lt;li&gt;Large model does not fit on one GPU: split across multiple GPUs.&lt;/li&gt;
&lt;li&gt;Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use this command to see where the model is loaded:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The &lt;code&gt;PROCESSOR&lt;/code&gt; column may show something like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;48%/52% CPU/GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% CPU
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you see &lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.&lt;/p&gt;
&lt;h2 id=&#34;multi-gpu-is-not-simple-compute-stacking&#34;&gt;Multi-GPU Is Not Simple Compute Stacking
&lt;/h2&gt;&lt;p&gt;Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.&lt;/p&gt;
&lt;p&gt;So multi-GPU benefits usually fall into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.&lt;/li&gt;
&lt;li&gt;Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama&amp;rsquo;s default &amp;ldquo;use one GPU when it fits&amp;rdquo; strategy avoids that unnecessary PCIe cost.&lt;/p&gt;
&lt;h2 id=&#34;sli-or-nvlink-is-not-required&#34;&gt;SLI or NVLink Is Not Required
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.&lt;/p&gt;
&lt;p&gt;NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.&lt;/p&gt;
&lt;p&gt;What you should pay attention to is PCIe bandwidth. The difference between &lt;code&gt;x1&lt;/code&gt;, &lt;code&gt;x4&lt;/code&gt;, &lt;code&gt;x8&lt;/code&gt;, and &lt;code&gt;x16&lt;/code&gt; affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.&lt;/p&gt;
&lt;p&gt;Safer rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prefer x16 / x8 over mining-style x1 risers.&lt;/li&gt;
&lt;li&gt;PCIe bandwidth matters more when switching large models frequently.&lt;/li&gt;
&lt;li&gt;If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.&lt;/li&gt;
&lt;li&gt;For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;limit-which-nvidia-gpus-ollama-uses&#34;&gt;Limit Which NVIDIA GPUs Ollama Uses
&lt;/h2&gt;&lt;p&gt;On NVIDIA multi-GPU systems, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; to control which GPUs Ollama can see.&lt;/p&gt;
&lt;p&gt;Temporary run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Use only the second GPU:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Force Ollama not to use NVIDIA GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -L
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Example output:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then specify the UUID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Ollama is installed as a Linux systemd service, put the variable into the service environment:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl edit ollama.service
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Add:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;CUDA_VISIBLE_DEVICES=0,1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Reload and restart:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl daemon-reload
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl restart ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;amd-and-vulkan-device-selection&#34;&gt;AMD and Vulkan Device Selection
&lt;/h2&gt;&lt;p&gt;For AMD ROCm, use &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; to control visible GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;To force Ollama not to use ROCm GPUs, use an invalid ID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Ollama&amp;rsquo;s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_VULKAN&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Vulkan devices cause problems, disable them:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt;. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.&lt;/p&gt;
&lt;h2 id=&#34;exposing-multiple-gpus-in-docker&#34;&gt;Exposing Multiple GPUs in Docker
&lt;/h2&gt;&lt;p&gt;If you run Ollama in Docker, NVIDIA setups usually require &lt;code&gt;nvidia-container-toolkit&lt;/code&gt;, then &lt;code&gt;--gpus&lt;/code&gt; to expose devices.&lt;/p&gt;
&lt;p&gt;Expose all GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Expose specific GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#34;device=0,1&amp;#34;&amp;#39;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can also combine this with environment variables:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -e &lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If &lt;code&gt;nvidia-smi&lt;/code&gt; cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.&lt;/p&gt;
&lt;h2 id=&#34;what-is-ollama_sched_spread&#34;&gt;What Is &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;In some multi-GPU configuration discussions, you may see &lt;code&gt;OLLAMA_SCHED_SPREAD=1&lt;/code&gt; or &lt;code&gt;OLLAMA_SCHED_SPREAD=true&lt;/code&gt;. It is related to Ollama&amp;rsquo;s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_SCHED_SPREAD&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Or with systemd:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;OLLAMA_SCHED_SPREAD=true&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.&lt;/p&gt;
&lt;p&gt;Treat &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt; as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on &lt;code&gt;ollama ps&lt;/code&gt;, logs, and &lt;code&gt;nvidia-smi&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-check-whether-multiple-gpus-are-being-used&#34;&gt;How to Check Whether Multiple GPUs Are Being Used
&lt;/h2&gt;&lt;p&gt;Useful commands:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;watch -n 0.5 nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;View the Ollama service logs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;journalctl -u ollama -f
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If using Docker:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker logs -f ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Watch for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether Ollama discovers compatible GPUs.&lt;/li&gt;
&lt;li&gt;Whether the model shows &lt;code&gt;100% GPU&lt;/code&gt; or a CPU/GPU split.&lt;/li&gt;
&lt;li&gt;Whether each GPU has VRAM allocated.&lt;/li&gt;
&lt;li&gt;Whether VRAM grows on multiple GPUs during model loading.&lt;/li&gt;
&lt;li&gt;Whether generation token/s improves compared with CPU/RAM spillover.&lt;/li&gt;
&lt;li&gt;Whether OOM or model unloading happens frequently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.&lt;/p&gt;
&lt;h2 id=&#34;common-misunderstandings&#34;&gt;Common Misunderstandings
&lt;/h2&gt;&lt;h3 id=&#34;misunderstanding-1-two-12gb-gpus-equal-one-24gb-gpu&#34;&gt;Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU
&lt;/h3&gt;&lt;p&gt;Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the &amp;ldquo;does not fit&amp;rdquo; problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-2-different-gpu-models-cannot-be-mixed&#34;&gt;Misunderstanding 2: Different GPU Models Cannot Be Mixed
&lt;/h3&gt;&lt;p&gt;Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-3-multi-gpu-is-always-faster-than-single-gpu&#34;&gt;Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU
&lt;/h3&gt;&lt;p&gt;Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-4-nvlink--sli-is-required&#34;&gt;Misunderstanding 4: NVLink / SLI Is Required
&lt;/h3&gt;&lt;p&gt;No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-5-adding-a-gpu-does-not-require-restarting-services&#34;&gt;Misunderstanding 5: Adding a GPU Does Not Require Restarting Services
&lt;/h3&gt;&lt;p&gt;Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.&lt;/p&gt;
&lt;h2 id=&#34;gpu-selection-suggestions&#34;&gt;GPU Selection Suggestions
&lt;/h2&gt;&lt;p&gt;For Ollama local inference, the rough priority is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Larger single-GPU VRAM is usually easier to manage.&lt;/li&gt;
&lt;li&gt;Identical GPUs are easier to troubleshoot than mixed GPUs.&lt;/li&gt;
&lt;li&gt;More complete PCIe lanes make large-model loading smoother.&lt;/li&gt;
&lt;li&gt;Older cards should be checked for CUDA compute capability or ROCm support first.&lt;/li&gt;
&lt;li&gt;Multi-GPU power, cooling, and chassis airflow must be planned ahead.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For budget second-hand platforms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dual RTX 3090 remains a common high-VRAM option.&lt;/li&gt;
&lt;li&gt;Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.&lt;/li&gt;
&lt;li&gt;Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.&lt;/li&gt;
&lt;li&gt;Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU support is best understood as &amp;ldquo;VRAM expansion first, performance acceleration second.&amp;rdquo; If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.&lt;/p&gt;
&lt;p&gt;In practice, use &lt;code&gt;ollama ps&lt;/code&gt; to check where the model is loaded, then use &lt;code&gt;nvidia-smi&lt;/code&gt; or ROCm tools to observe VRAM allocation. For GPU selection, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; on NVIDIA, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; on AMD ROCm, and &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt; for Vulkan. If running in Docker, first make sure the container can see the GPUs.&lt;/p&gt;
&lt;p&gt;Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Ollama FAQ: How does Ollama load models on multiple GPUs?: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama GPU docs: Hardware support / GPU Selection: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama Docker Hub: &lt;a class=&#34;link&#34; href=&#34;https://hub.docker.com/r/ollama/ollama&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://hub.docker.com/r/ollama/ollama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;NVIDIA Container Toolkit: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvidia-container-toolkit&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvidia-container-toolkit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Deploy Hermes Agent Locally on Windows with WSL &#43; Ollama and Connect Telegram</title>
        <link>https://knightli.com/en/2026/04/18/windows-wsl-ollama-hermes-agent-telegram/</link>
        <pubDate>Sat, 18 Apr 2026 00:48:22 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/18/windows-wsl-ollama-hermes-agent-telegram/</guid>
        <description>&lt;p&gt;If you want to run &lt;code&gt;Hermes Agent&lt;/code&gt; on &lt;code&gt;Windows&lt;/code&gt; with as little friction as possible, a practical path is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;keep Windows as the host system&lt;/li&gt;
&lt;li&gt;run &lt;code&gt;Ubuntu&lt;/code&gt; inside &lt;code&gt;WSL&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;use &lt;code&gt;Ollama&lt;/code&gt; to serve the local model&lt;/li&gt;
&lt;li&gt;let &lt;code&gt;Hermes Agent&lt;/code&gt; connect directly to the local Ollama endpoint&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach keeps the environment relatively clean, lets you run most commands in a Linux-style workflow, and avoids preparing a separate Linux machine.&lt;/p&gt;
&lt;h2 id=&#34;overall-flow&#34;&gt;Overall flow
&lt;/h2&gt;&lt;p&gt;You can split the setup into 4 steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Enable &lt;code&gt;WSL&lt;/code&gt; and install &lt;code&gt;Ubuntu&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Install Python, Node.js, Git, and other basics inside Ubuntu&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;Ollama&lt;/code&gt; and pull a local model&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;Hermes Agent&lt;/code&gt;, then connect &lt;code&gt;Telegram&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your goal is simply to get Hermes Agent running first, by the end of step 3 you are already close.&lt;/p&gt;
&lt;h2 id=&#34;1-install-wsl-and-ubuntu&#34;&gt;1. Install WSL and Ubuntu
&lt;/h2&gt;&lt;p&gt;Run this in PowerShell with administrator privileges:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;wsl&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-install&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After the installation finishes, restart the PC, then continue with Ubuntu:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;wsl&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;-install&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;-d&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Ubuntu&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After that, open Ubuntu in WSL. Most of the remaining commands are run there.&lt;/p&gt;
&lt;h2 id=&#34;2-update-ubuntu-and-install-the-base-environment&#34;&gt;2. Update Ubuntu and install the base environment
&lt;/h2&gt;&lt;p&gt;Update the system first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt update
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt upgrade -y
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then install Python, extraction tools, Node.js, and Git.&lt;/p&gt;
&lt;h3 id=&#34;install-python&#34;&gt;Install Python
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt install python3-pip python3-venv -y
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;install-zstd&#34;&gt;Install zstd
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt install -y zstd
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;install-nodejs&#34;&gt;Install Node.js
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -fsSL https://deb.nodesource.com/setup_22.x &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; sudo -E bash -
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt install -y nodejs
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;install-git&#34;&gt;Install Git
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt update
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt install -y git
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can quickly verify the installation with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;node -v
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;npm -v
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git --version
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;3-install-ollama-and-pull-gemma-4&#34;&gt;3. Install Ollama and pull Gemma 4
&lt;/h2&gt;&lt;p&gt;Install Ollama:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -fsSL https://ollama.com/install.sh &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want a local model for Hermes Agent, starting with &lt;code&gt;Gemma 4&lt;/code&gt; is reasonable.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4:e4b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If your machine is weaker, you can also try:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4:e2b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Larger variants include:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4:26b
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4:31b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For most normal &lt;code&gt;Windows + WSL&lt;/code&gt; setups, &lt;code&gt;gemma4:e4b&lt;/code&gt; is usually the more practical starting point.&lt;/p&gt;
&lt;h2 id=&#34;4-install-and-configure-hermes-agent&#34;&gt;4. Install and configure Hermes Agent
&lt;/h2&gt;&lt;p&gt;Install it with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; bash
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After installation, point it to the local Ollama endpoint:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:11434
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Use the local model name you actually installed, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;gemma4:e4b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the installer asks you to refresh the shell, run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;source&lt;/span&gt; ~/.bashrc
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;common-hermes-agent-commands&#34;&gt;Common Hermes Agent commands
&lt;/h2&gt;&lt;p&gt;These are the commands you will use most often:&lt;/p&gt;
&lt;h3 id=&#34;start&#34;&gt;Start
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;hermes
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;re-enter-setup&#34;&gt;Re-enter setup
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;hermes setup
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;configure-the-chat-gateway&#34;&gt;Configure the chat gateway
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;hermes setup gateway
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;update&#34;&gt;Update
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;hermes update
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;basic-telegram-connection-steps&#34;&gt;Basic Telegram connection steps
&lt;/h2&gt;&lt;p&gt;If you want Hermes Agent to send and receive messages through Telegram, the core step is still:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;hermes setup gateway
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then prepare the two Telegram-side items you need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;create a bot with &lt;code&gt;BotFather&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;get your &lt;code&gt;User ID&lt;/code&gt; with &lt;code&gt;@userinfobot&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once you have those basics, continue filling them into the Hermes Agent gateway setup.&lt;/p&gt;
&lt;h2 id=&#34;who-this-setup-fits&#34;&gt;Who this setup fits
&lt;/h2&gt;&lt;p&gt;This workflow is a good fit if:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Windows is your main desktop system&lt;/li&gt;
&lt;li&gt;you do not want to maintain a separate Linux host&lt;/li&gt;
&lt;li&gt;you want to get a local Agent running first, then expand to chat platforms&lt;/li&gt;
&lt;li&gt;you prefer local models instead of depending on cloud APIs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you mainly want to experience a local Agent rather than build a full production deployment immediately, this path is already practical enough.&lt;/p&gt;
&lt;h2 id=&#34;a-few-things-to-keep-in-mind&#34;&gt;A few things to keep in mind
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;WSL&lt;/code&gt; is still a compatibility layer, so in extreme cases it may not behave exactly like native Linux&lt;/li&gt;
&lt;li&gt;whether a large model runs smoothly still depends on your RAM, VRAM, and CPU / GPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gemma4:e4b&lt;/code&gt; is a realistic starting point, but actual experience still depends on the machine&lt;/li&gt;
&lt;li&gt;Hermes Agent platform integration is an extension step; getting the local model path working first, then adding Telegram, is usually more stable&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;If you want to deploy Hermes Agent locally on Windows with as little friction as possible, the smoother order is:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;WSL -&amp;gt; Ubuntu -&amp;gt; Ollama -&amp;gt; Gemma 4 -&amp;gt; Hermes Agent -&amp;gt; Telegram&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Get the local model running first, then add the gateway integration. That usually gives you a much higher success rate. For most users, this is easier to troubleshoot than piling on every component at the beginning, and it also leaves room for later expansion.&lt;/p&gt;
&lt;h2 id=&#34;original-reference&#34;&gt;Original reference
&lt;/h2&gt;&lt;p&gt;This post is rewritten and organized based on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Xchaoge Blog: &lt;a class=&#34;link&#34; href=&#34;https://www.xchaoge.com/21.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;太简单了！Hermes Agent 本地部署（无需API）接入 Telegram + 微信&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Access a Local Ollama API Over LAN on Windows</title>
        <link>https://knightli.com/en/2026/04/11/ollama-api-lan-access-windows/</link>
        <pubDate>Sat, 11 Apr 2026 16:43:52 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/11/ollama-api-lan-access-windows/</guid>
        <description>&lt;p&gt;If you want other devices in the same LAN to access your local Ollama API, follow these steps.&lt;/p&gt;
&lt;h2 id=&#34;set-the-listening-host&#34;&gt;Set the listening host
&lt;/h2&gt;&lt;p&gt;First, set Ollama to listen on all network interfaces:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;OLLAMA_HOST=0.0.0.0:11434&lt;/code&gt;&lt;/p&gt;
&lt;h2 id=&#34;open-the-firewall&#34;&gt;Open the firewall
&lt;/h2&gt;&lt;p&gt;In Windows Firewall advanced settings, create an inbound rule and allow the target port (for example &lt;code&gt;8080&lt;/code&gt;):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Press Win + S, search and open &amp;ldquo;Windows Defender Firewall&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Click &amp;ldquo;Advanced settings&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Select &amp;ldquo;Inbound Rules&amp;rdquo; -&amp;gt; &amp;ldquo;New Rule&amp;hellip;&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Choose &amp;ldquo;Port&amp;rdquo;, then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Select protocol (usually TCP), enter the target port in &amp;ldquo;Specific local ports&amp;rdquo; (for example &lt;code&gt;8080&lt;/code&gt;), then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Choose &amp;ldquo;Allow the connection&amp;rdquo;, then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;In &amp;ldquo;Profile&amp;rdquo;, select Domain, Private, and Public, then click &amp;ldquo;Next&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Name the rule (for example &lt;code&gt;OpenPort8080&lt;/code&gt;) and click &amp;ldquo;Finish&amp;rdquo;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;run-ollama&#34;&gt;Run Ollama
&lt;/h2&gt;&lt;p&gt;Ollama run 模型&lt;/p&gt;
&lt;h2 id=&#34;access-the-model-through-api&#34;&gt;Access the model through API
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl http://192.168.x.xxx:11434/api/generate -d &lt;span class=&#34;s1&#34;&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;model&amp;#34;: &amp;#34;gemma4&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;prompt&amp;#34;: &amp;#34;这个是什么模型?&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;</description>
        </item>
        <item>
        <title>Gemma 4 Local Runtime Guide: From One-Command Start to Dev Integration</title>
        <link>https://knightli.com/en/2026/04/10/gemma4-local-runtime-options/</link>
        <pubDate>Fri, 10 Apr 2026 22:54:17 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/10/gemma4-local-runtime-options/</guid>
        <description>&lt;p&gt;If you want to run Gemma 4 locally, you can choose from four practical paths depending on your goal and hardware.&lt;/p&gt;
&lt;h2 id=&#34;1-fastest-start-ollama-recommended&#34;&gt;1) Fastest start: Ollama (recommended)
&lt;/h2&gt;&lt;p&gt;This is the lowest-friction option for quick testing, daily chat, and local API usage.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Highlights:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works on Windows, macOS, and Linux&lt;/li&gt;
&lt;li&gt;Handles hardware acceleration automatically&lt;/li&gt;
&lt;li&gt;Offers OpenAI-style local API compatibility&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2-gui-workflow-lm-studio--unsloth-studio&#34;&gt;2) GUI workflow: LM Studio / Unsloth Studio
&lt;/h2&gt;&lt;p&gt;If you prefer a desktop UI instead of terminal commands:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LM Studio: browse and run Gemma 4 quantized variants from Hugging Face (for example 4-bit, 8-bit), with resource visibility.&lt;/li&gt;
&lt;li&gt;Unsloth Studio: supports both inference and low-VRAM fine-tuning, often friendlier on 6GB-8GB GPUs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;3-low-spec-and-maximum-control-llamacpp&#34;&gt;3) Low-spec and maximum control: llama.cpp
&lt;/h2&gt;&lt;p&gt;Good for older hardware, CPU-focused setups, or users who want deeper runtime control.&lt;/p&gt;
&lt;p&gt;With &lt;code&gt;.gguf&lt;/code&gt; model files and quantization, Gemma 4 can be made practical on much smaller hardware budgets.&lt;/p&gt;
&lt;h2 id=&#34;4-developer-integration-transformers--vllm&#34;&gt;4) Developer integration: Transformers / vLLM
&lt;/h2&gt;&lt;p&gt;If you need Gemma 4 inside your own application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transformers: straightforward Python integration&lt;/li&gt;
&lt;li&gt;vLLM: high-throughput inference for stronger GPU environments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-selection&#34;&gt;Quick selection
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Need&lt;/th&gt;
          &lt;th&gt;Recommended tools&lt;/th&gt;
          &lt;th&gt;Hardware bar&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;I just want it running now&lt;/td&gt;
          &lt;td&gt;Ollama&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I want a ChatGPT-like UI&lt;/td&gt;
          &lt;td&gt;LM Studio&lt;/td&gt;
          &lt;td&gt;Medium&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;My VRAM is limited (6GB-8GB)&lt;/td&gt;
          &lt;td&gt;Unsloth / llama.cpp&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I am building local AI apps&lt;/td&gt;
          &lt;td&gt;Ollama / Transformers / vLLM&lt;/td&gt;
          &lt;td&gt;Medium to high&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I need fine-tuning&lt;/td&gt;
          &lt;td&gt;Unsloth Studio&lt;/td&gt;
          &lt;td&gt;Medium to high&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;model-size-suggestion&#34;&gt;Model size suggestion
&lt;/h2&gt;&lt;p&gt;Gemma 4 comes in multiple sizes (for example E2B, E4B, 31B).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start with quantized E2B/E4B on mainstream laptops&lt;/li&gt;
&lt;li&gt;Move to larger variants only after your baseline pipeline is stable&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>What are Ollama cloud models and how do you use them</title>
        <link>https://knightli.com/en/2026/04/09/ollama-cloud-models-guide/</link>
        <pubDate>Thu, 09 Apr 2026 18:42:32 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/ollama-cloud-models-guide/</guid>
        <description>&lt;p&gt;If you already use &lt;code&gt;Ollama&lt;/code&gt; to run local models, cloud models are easy to understand.&lt;/p&gt;
&lt;p&gt;There is only one core difference:&lt;br&gt;
local models run on your own machine, while cloud models run on Ollama&amp;rsquo;s cloud infrastructure and return the result to you.&lt;/p&gt;
&lt;h2 id=&#34;what-are-ollama-cloud-models&#34;&gt;What are Ollama cloud models
&lt;/h2&gt;&lt;p&gt;Ollama cloud models keep the Ollama workflow, but move the actual computation from your local machine to the cloud.&lt;/p&gt;
&lt;p&gt;The main benefits are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less pressure on local hardware&lt;/li&gt;
&lt;li&gt;Easier access to larger models that your machine cannot run well&lt;/li&gt;
&lt;li&gt;You can keep using the familiar Ollama workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;how-they-differ-from-local-models&#34;&gt;How they differ from local models
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Item&lt;/th&gt;
          &lt;th&gt;Local models&lt;/th&gt;
          &lt;th&gt;Cloud models&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Runtime location&lt;/td&gt;
          &lt;td&gt;Your machine&lt;/td&gt;
          &lt;td&gt;Cloud&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Hardware requirements&lt;/td&gt;
          &lt;td&gt;High&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Latency&lt;/td&gt;
          &lt;td&gt;Usually lower&lt;/td&gt;
          &lt;td&gt;Affected by network&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Privacy&lt;/td&gt;
          &lt;td&gt;Stronger&lt;/td&gt;
          &lt;td&gt;Requests are sent to the cloud&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you care more about privacy, low latency, and offline use, local models are a better fit.&lt;br&gt;
If your hardware is limited but you still want to use larger models, cloud models are more convenient.&lt;/p&gt;
&lt;h2 id=&#34;how-to-identify-a-cloud-model&#34;&gt;How to identify a cloud model
&lt;/h2&gt;&lt;p&gt;At the moment, Ollama cloud models are typically labeled with a &lt;code&gt;-cloud&lt;/code&gt; suffix, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;gpt-oss:120b-cloud
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The available model list may change over time, so the official Ollama pages should be treated as the source of truth.&lt;/p&gt;
&lt;h2 id=&#34;how-to-use-them&#34;&gt;How to use them
&lt;/h2&gt;&lt;p&gt;First, sign in:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama signin
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After that, run a cloud model directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gpt-oss:120b-cloud
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you are calling it from code, you can also configure an API key:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;export&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;OLLAMA_API_KEY&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;your_api_key
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Python example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;os&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;ollama&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Client&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;client&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Client&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;host&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;https://ollama.com&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;headers&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;Authorization&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Bearer &amp;#34;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;os&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;environ&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;OLLAMA_API_KEY&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;user&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Why is the sky blue?&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;part&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;client&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;gpt-oss:120b-cloud&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;messages&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;stream&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;part&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;message&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;][&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;end&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;flush&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Ollama cloud models can be summarized in one sentence:&lt;/p&gt;
&lt;p&gt;the commands are almost the same, but the model is no longer running on your local machine.&lt;/p&gt;
&lt;p&gt;If your computer cannot handle large models well, but you still want to keep the Ollama workflow, cloud models are a very direct option.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Download a GGUF Model from Hugging Face and Import It into Ollama</title>
        <link>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</link>
        <pubDate>Thu, 09 Apr 2026 11:00:07 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/import-huggingface-gguf-into-ollama/</guid>
        <description>&lt;p&gt;If a model is not available in the official Ollama library, or if you want to use a specific &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face, you can download it manually and then import it into Ollama.&lt;/p&gt;
&lt;h2 id=&#34;step-1-download-the-gguf-file-from-hugging-face&#34;&gt;Step 1: Download the GGUF file from Hugging Face
&lt;/h2&gt;&lt;p&gt;First, find the target model&amp;rsquo;s &lt;code&gt;GGUF&lt;/code&gt; file on Hugging Face. You will usually see multiple quantized versions, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the &lt;code&gt;.gguf&lt;/code&gt; file in a fixed directory so you can reference it from the &lt;code&gt;Modelfile&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;step-2-write-the-modelfile&#34;&gt;Step 2: Write the Modelfile
&lt;/h2&gt;&lt;p&gt;Create a &lt;code&gt;Modelfile&lt;/code&gt; in the same directory as the model file. The most basic version looks like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./model.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the filename is different, replace it with the actual filename, for example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FROM ./gemma-3-12b-it-q4_k_m.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If your goal is just to get it running, this single &lt;code&gt;FROM&lt;/code&gt; line is usually enough.&lt;/p&gt;
&lt;h2 id=&#34;step-3-import-it-into-ollama&#34;&gt;Step 3: Import it into Ollama
&lt;/h2&gt;&lt;p&gt;Then run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama create myModelName -f Modelfile
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;myModelName&lt;/code&gt; is the local model name you want to use inside Ollama&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-f Modelfile&lt;/code&gt; tells Ollama to create the model from that file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once the creation succeeds, the GGUF file becomes a local model that you can call directly.&lt;/p&gt;
&lt;h2 id=&#34;step-4-run-the-model&#34;&gt;Step 4: Run the model
&lt;/h2&gt;&lt;p&gt;After creation, run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run myModelName
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;From that point on, it works much like a model pulled with &lt;code&gt;ollama pull&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-inspect-an-existing-models-modelfile&#34;&gt;How to inspect an existing model&amp;rsquo;s Modelfile
&lt;/h2&gt;&lt;p&gt;If you are not sure how to write a &lt;code&gt;Modelfile&lt;/code&gt;, you can inspect the configuration of an existing model directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama show --modelfile llama3.2
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This command prints the &lt;code&gt;Modelfile&lt;/code&gt; for &lt;code&gt;llama3.2&lt;/code&gt;, which is useful as a reference for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How &lt;code&gt;FROM&lt;/code&gt; should be written&lt;/li&gt;
&lt;li&gt;How the template and system prompt are structured&lt;/li&gt;
&lt;li&gt;How parameters are declared&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;when-this-approach-makes-sense&#34;&gt;When this approach makes sense
&lt;/h2&gt;&lt;p&gt;This manual Hugging Face import flow is useful when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model you want is not available in Ollama&amp;rsquo;s official library&lt;/li&gt;
&lt;li&gt;You want a specific quantized variant&lt;/li&gt;
&lt;li&gt;You have already downloaded the &lt;code&gt;GGUF&lt;/code&gt; file manually&lt;/li&gt;
&lt;li&gt;You want finer control over how the model is packaged&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If Ollama already provides an official version, using &lt;code&gt;pull&lt;/code&gt; is usually simpler. But when you need a specific quantization or a custom wrapper, &lt;code&gt;GGUF + Modelfile&lt;/code&gt; gives you more flexibility.&lt;/p&gt;
&lt;h2 id=&#34;common-notes&#34;&gt;Common notes
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;The path after &lt;code&gt;FROM&lt;/code&gt; must match the actual location of the &lt;code&gt;.gguf&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;If the filename contains spaces or special characters, it is better to rename it first.&lt;/li&gt;
&lt;li&gt;Different &lt;code&gt;GGUF&lt;/code&gt; quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.&lt;/li&gt;
&lt;li&gt;If the model is a chat model, you may still need to adjust the prompt template later for better results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Downloading a &lt;code&gt;GGUF&lt;/code&gt; file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal &lt;code&gt;Modelfile&lt;/code&gt;, then run &lt;code&gt;ollama create&lt;/code&gt;, and you can bring a third-party &lt;code&gt;GGUF&lt;/code&gt; model into your Ollama workflow.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Troubleshoot Slow `ollama pull` Model Downloads</title>
        <link>https://knightli.com/en/2026/04/09/ollama-download-slow-troubleshooting/</link>
        <pubDate>Thu, 09 Apr 2026 10:42:39 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/09/ollama-download-slow-troubleshooting/</guid>
        <description>&lt;p&gt;&lt;code&gt;ollama pull model_name:tag&lt;/code&gt; can be very slow in some regions, and the download process is not always stable.&lt;/p&gt;
&lt;p&gt;If your issue looks like repeated interruptions halfway through a large model download, with errors such as &lt;code&gt;TLS handshake timeout&lt;/code&gt; or &lt;code&gt;unexpected EOF&lt;/code&gt;, the bottleneck may not be &lt;code&gt;registry.ollama.ai&lt;/code&gt; itself, but the actual download path after the redirect.&lt;/p&gt;
&lt;p&gt;This article walks through a simple troubleshooting approach: first get the real model file URLs, then confirm where the traffic actually ends up, and finally optimize only the domains that matter.&lt;/p&gt;
&lt;h2 id=&#34;get-the-model-file-download-urls&#34;&gt;Get the model file download URLs
&lt;/h2&gt;&lt;p&gt;You can use the following project to extract the manifest and blob download URLs for an Ollama model directly:&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/Gholamrezadar/ollama-direct-downloader&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/Gholamrezadar/ollama-direct-downloader&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Using &lt;code&gt;gemma4:latest&lt;/code&gt; as an example, you can extract links like the following.&lt;/p&gt;
&lt;h3 id=&#34;manifest-url&#34;&gt;Manifest URL
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/manifests/latest
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;blob-urls&#34;&gt;Blob URLs
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you only want a quick verification, you can also download the manifest and blobs directly with &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/manifests/latest&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;latest&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;sha256-f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl -L &lt;span class=&#34;s2&#34;&gt;&amp;#34;https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2&amp;#34;&lt;/span&gt; -o &lt;span class=&#34;s2&#34;&gt;&amp;#34;sha256-7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;the-real-download-url-after-the-redirect&#34;&gt;The real download URL after the redirect
&lt;/h2&gt;&lt;p&gt;If you try downloading one of the blobs with &lt;code&gt;wget&lt;/code&gt;, you will notice that the request does not stay on &lt;code&gt;registry.ollama.ai&lt;/code&gt;. It gets redirected to a &lt;code&gt;Cloudflare R2&lt;/code&gt; object storage URL:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;There are a few key details in the log:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;registry.ollama.ai&lt;/code&gt; returns &lt;code&gt;307 Temporary Redirect&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The final download URL lands on &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The large file transfer is actually being served by the object storage domain behind the redirect&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This matters because if your proxy or routing rules only cover &lt;code&gt;registry.ollama.ai&lt;/code&gt; but not &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt;, downloads can still be slow or repeatedly interrupted.&lt;/p&gt;
&lt;p&gt;Here is one example of an actual redirect log:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--2026-04-09 09:22:04--  https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Resolving registry.ollama.ai (registry.ollama.ai)... 104.21.75.227, 172.67.182.229, 2606:4700:3034::ac43:b6e5, ...
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Connecting to registry.ollama.ai (registry.ollama.ai)|104.21.75.227|:443... connected.
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;HTTP request sent, awaiting response... 307 Temporary Redirect
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Location: https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?... [following]
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--2026-04-09 09:22:05--  https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?...
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Resolving dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com (dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com)... 172.64.66.1, 2606:4700:2ff9::1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Connecting to dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com|172.64.66.1|:443... connected.
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;HTTP request sent, awaiting response... 200 OK
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Length: 9608338848 (8.9G) [application/octet-stream]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;adjust-your-network-settings&#34;&gt;Adjust your network settings
&lt;/h2&gt;&lt;p&gt;Once you confirm the real download path, the troubleshooting direction becomes much clearer.&lt;/p&gt;
&lt;p&gt;If you are using a proxy, split routing, or custom DNS, check these first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether &lt;code&gt;registry.ollama.ai&lt;/code&gt; and &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt; are using the same stable route&lt;/li&gt;
&lt;li&gt;Whether your proxy rules cover only the former but miss the latter&lt;/li&gt;
&lt;li&gt;Whether your current outbound path is suitable for sustained multi-GB downloads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key issue here is not simply whether the official site opens, but whether the redirected object storage path is stable enough for long-running large-file transfers. In many cases, the real bottleneck is the &lt;code&gt;Cloudflare R2&lt;/code&gt; layer rather than the registry domain in front of it.&lt;/p&gt;
&lt;h2 id=&#34;before-and-after-comparison&#34;&gt;Before-and-after comparison
&lt;/h2&gt;&lt;p&gt;Here is one real-world example while downloading &lt;code&gt;gemma4:31b-it-q8_0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Before adjusting the network path, the download was slow and failed midway:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;PS C:\Users\knightli&amp;gt; ollama run gemma4:31b-it-q8_0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling manifest
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling a0feadb736f5:  38% ▕██████████████████████                                    ▏  12 GB/ 33 GB  1.2 MB/s   4h40m
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Error: max retries exceeded: unexpected EOF
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After the adjustment, the same model download became noticeably faster and more stable:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;PS C:\Users\knightli&amp;gt; ollama run gemma4:31b-it-q8_0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling manifest
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pulling a0feadb736f5:  46% ▕████████████████████████████████████████████████████████████████▏ 15 GB/ 33 GB  8.5 MB/s  35m23s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This does not mean every network environment will see the same improvement, but it does support one useful conclusion: the bottleneck may be the actual large-file download path rather than the Ollama client itself.&lt;/p&gt;
&lt;h2 id=&#34;a-more-practical-troubleshooting-order&#34;&gt;A more practical troubleshooting order
&lt;/h2&gt;&lt;p&gt;If you run into the same issue, this order usually works well:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;ollama pull&lt;/code&gt; or &lt;code&gt;ollama run&lt;/code&gt; once and confirm the issue is reproducible.&lt;/li&gt;
&lt;li&gt;Test a blob URL with &lt;code&gt;wget&lt;/code&gt; or &lt;code&gt;curl -L&lt;/code&gt; and confirm whether it redirects to &lt;code&gt;*.r2.cloudflarestorage.com&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Adjust your proxy or routing only for the real download domain, then test speed and stability again.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The benefit of this order is that each step validates one clear hypothesis, so you do not have to troubleshoot blindly.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;When &lt;code&gt;ollama pull&lt;/code&gt; is slow, the problem is often not that &lt;code&gt;registry.ollama.ai&lt;/code&gt; is unreachable, but that the &lt;code&gt;Cloudflare R2&lt;/code&gt; path actually serving the large files is unstable.&lt;/p&gt;
&lt;p&gt;So instead of retrying over and over, a better approach is to identify the real download path first and optimize the network route where the traffic actually lands.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Connect OpenClaw to Local Gemma 4: Complete Setup Guide</title>
        <link>https://knightli.com/en/2026/04/08/openclaw-connect-gemma4-local/</link>
        <pubDate>Wed, 08 Apr 2026 18:18:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/08/openclaw-connect-gemma4-local/</guid>
        <description>&lt;p&gt;This guide shows how to connect &lt;code&gt;OpenClaw&lt;/code&gt; to a local &lt;code&gt;Gemma 4&lt;/code&gt; model through &lt;code&gt;Ollama&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you have not deployed Gemma 4 locally yet, start here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;step-1-start-the-ollama-api-service&#34;&gt;Step 1: Start the Ollama API Service
&lt;/h2&gt;&lt;p&gt;Start Ollama first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then verify the API quickly with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl http://localhost:11434/api/generate -d &lt;span class=&#34;s1&#34;&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;model&amp;#34;: &amp;#34;gemma4:12b&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  &amp;#34;prompt&amp;#34;: &amp;#34;Hello&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you get a model response, your local API is ready.&lt;/p&gt;
&lt;h2 id=&#34;step-2-configure-openclaw-to-use-ollama&#34;&gt;Step 2: Configure OpenClaw to Use Ollama
&lt;/h2&gt;&lt;p&gt;The OpenClaw config file is usually located at:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;~/.openclaw/config.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Edit &lt;code&gt;config.yaml&lt;/code&gt; and add a local model entry under &lt;code&gt;models&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;models&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;c&#34;&gt;# Your existing model config...&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;gemma4-local&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;provider&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;ollama&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;base_url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;http://localhost:11434&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;gemma4:12b&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;timeout&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;120s&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;step-3-set-default-model-optional&#34;&gt;Step 3: Set Default Model (Optional)
&lt;/h2&gt;&lt;p&gt;If you want Gemma 4 as the default model:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;default_model&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;l&#34;&gt;gemma4-local&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;step-4-restart-and-verify-openclaw&#34;&gt;Step 4: Restart and Verify OpenClaw
&lt;/h2&gt;&lt;p&gt;Restart OpenClaw:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;openclaw restart
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;List available models:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;openclaw models list
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Run a quick chat test:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;openclaw chat --model gemma4-local &lt;span class=&#34;s2&#34;&gt;&amp;#34;Hello&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the chat returns normally, OpenClaw is successfully connected to local Gemma 4.&lt;/p&gt;
&lt;h2 id=&#34;common-troubleshooting&#34;&gt;Common Troubleshooting
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;connection refused&lt;/code&gt;: make sure &lt;code&gt;ollama serve&lt;/code&gt; is running.&lt;/li&gt;
&lt;li&gt;Model not found: check model name with &lt;code&gt;ollama list&lt;/code&gt; (for example &lt;code&gt;gemma4:12b&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Timeout: increase &lt;code&gt;timeout&lt;/code&gt; and test a smaller model first.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide</title>
        <link>https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/</link>
        <pubDate>Wed, 08 Apr 2026 18:06:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/</guid>
        <description>&lt;p&gt;If you want to run Gemma 4 locally on a laptop, &lt;code&gt;Ollama&lt;/code&gt; is one of the fastest and simplest options. Even without complex setup, you can usually get it running in about five minutes.&lt;/p&gt;
&lt;h2 id=&#34;step-1-install-ollama&#34;&gt;Step 1: Install Ollama
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Open &lt;code&gt;https://ollama.com&lt;/code&gt; and download the installer for your OS.&lt;/li&gt;
&lt;li&gt;Complete installation based on your system:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;macOS: drag it to &lt;code&gt;Applications&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Windows: run the &lt;code&gt;.exe&lt;/code&gt; installer.&lt;/li&gt;
&lt;li&gt;Linux: use the install script from the official site.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After installation, Ollama runs as a background service. Beyond initial setup, daily usage is mostly simple commands.&lt;/p&gt;
&lt;h2 id=&#34;step-2-download-a-gemma-4-model&#34;&gt;Step 2: Download a Gemma 4 Model
&lt;/h2&gt;&lt;p&gt;Open a terminal and run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama pull gemma4:4b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If your machine is stronger, you can switch to &lt;code&gt;12b&lt;/code&gt; or &lt;code&gt;27b&lt;/code&gt;. Once downloaded, the model is stored locally.&lt;/p&gt;
&lt;p&gt;Check downloaded models with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama list
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;step-3-run-the-model&#34;&gt;Step 3: Run the Model
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4:4b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This opens an interactive chat session in your terminal. Type your prompt and press Enter. To exit, type:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;/bye
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you prefer a browser chat UI, you can pair it with &lt;code&gt;Open WebUI&lt;/code&gt;. It wraps Ollama with a local web interface and is usually quick to set up with Docker.&lt;/p&gt;
&lt;h2 id=&#34;laptop-performance-tips&#34;&gt;Laptop Performance Tips
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Apple Silicon (M2/M3/M4): Metal acceleration is enabled by default, and &lt;code&gt;12B&lt;/code&gt; can run well.&lt;/li&gt;
&lt;li&gt;NVIDIA GPU: CUDA is used automatically when a compatible GPU is detected. Keep drivers updated.&lt;/li&gt;
&lt;li&gt;CPU-only inference: works, but larger models will be slower. For most CPU-only setups, &lt;code&gt;4B&lt;/code&gt; is the practical default.&lt;/li&gt;
&lt;li&gt;Free memory before loading large models: as a rough rule, each billion parameters needs about &lt;code&gt;0.5GB to 1GB&lt;/code&gt; RAM.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;how-to-choose-a-model&#34;&gt;How to Choose a Model
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 1B&lt;/code&gt;: good for lightweight Q&amp;amp;A, simple summarization, and quick lookups; limited on complex reasoning.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 4B&lt;/code&gt;: best for most daily tasks (writing help, coding help, document summarization) with strong speed/quality balance.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 12B&lt;/code&gt;: better for longer context and more complex tasks, especially coding and reasoning.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Gemma 4 27B&lt;/code&gt;: better for high-demand workloads and closer to frontier-cloud quality, but needs significantly stronger hardware.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Check Whether an Ollama Model Is Loaded on GPU</title>
        <link>https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/</link>
        <pubDate>Mon, 06 Apr 2026 10:15:18 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/</guid>
        <description>&lt;p&gt;If you want to confirm whether an Ollama model is actually running on GPU, the most direct way is checking processor allocation for currently loaded models.&lt;/p&gt;
&lt;h2 id=&#34;command&#34;&gt;Command
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;example-output&#34;&gt;Example Output
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;NAME        ID            SIZE    PROCESSOR   UNTIL
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama3:70b  bcfb190ca3a7  42 GB   100% GPU    4 minutes from now
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;how-to-read-the-processor-column&#34;&gt;How to Read the &lt;code&gt;PROCESSOR&lt;/code&gt; Column
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;100% GPU&lt;/code&gt;: The model is fully loaded into GPU VRAM.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;100% CPU&lt;/code&gt;: The model is fully loaded in system memory (no GPU inference).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;: The model is split between system memory and GPU VRAM.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;practical-tips&#34;&gt;Practical Tips
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;If you expect GPU usage but see &lt;code&gt;100% CPU&lt;/code&gt;, first check GPU drivers, CUDA/ROCm environment, and Ollama runtime settings.&lt;/li&gt;
&lt;li&gt;With larger models and limited VRAM, CPU/GPU mixed loading is common.&lt;/li&gt;
&lt;li&gt;For performance troubleshooting, run &lt;code&gt;ollama ps&lt;/code&gt; before checking speed metrics to locate bottlenecks faster.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ollama ps&lt;/code&gt; is the first step to verify real GPU usage. Focus on the &lt;code&gt;PROCESSOR&lt;/code&gt; column to quickly identify where the model is loaded and decide your next optimization action.&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Gemma 4 Model Comparison and Selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/&#34; &gt;LLM Quantization Guide (FP16/Q8/Q5/Q4/Q2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        <item>
        <title>Ollama Default Model Storage Path and Migration Guide (Avoid Filling Up C Drive)</title>
        <link>https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/</link>
        <pubDate>Mon, 06 Apr 2026 09:38:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/</guid>
        <description>&lt;p&gt;When running local LLMs, the system drive is often the first thing to run out of space. Ollama stores models in user or system directories by default, so your C drive can fill up quickly without path planning.&lt;/p&gt;
&lt;h2 id=&#34;common-default-ollama-model-directories&#34;&gt;Common Default Ollama Model Directories
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Windows: &lt;code&gt;C:\Users\&amp;lt;username&amp;gt;\.ollama\models&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;macOS: &lt;code&gt;~/.ollama/models&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Linux: &lt;code&gt;/usr/share/ollama/.ollama/models&lt;/code&gt; (may vary by installation method)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;windows-move-the-model-directory-to-a-non-system-drive&#34;&gt;Windows: Move the Model Directory to a Non-System Drive
&lt;/h2&gt;&lt;p&gt;A practical choice is moving model storage to a path like &lt;code&gt;D:\OllamaModels&lt;/code&gt;. The key is setting the &lt;code&gt;OLLAMA_MODELS&lt;/code&gt; system environment variable.&lt;/p&gt;
&lt;h2 id=&#34;1-create-the-target-directory&#34;&gt;1. Create the Target Directory
&lt;/h2&gt;&lt;p&gt;For example, create: &lt;code&gt;D:\OllamaModels&lt;/code&gt;&lt;/p&gt;
&lt;h2 id=&#34;2-configure-the-system-environment-variable&#34;&gt;2. Configure the System Environment Variable
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Variable name: &lt;code&gt;OLLAMA_MODELS&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Variable value: &lt;code&gt;D:\OllamaModels&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can set it in &amp;ldquo;System Properties -&amp;gt; Advanced -&amp;gt; Environment Variables&amp;rdquo;, or with an admin PowerShell command:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-powershell&#34; data-lang=&#34;powershell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;no&#34;&gt;System.Environment&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]::&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SetEnvironmentVariable&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;OLLAMA_MODELS&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;D:\OllamaModels&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Machine&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;3-restart-ollama-or-reboot-the-system&#34;&gt;3. Restart Ollama (or Reboot the System)
&lt;/h2&gt;&lt;p&gt;After setting the variable, restart the Ollama service/app. If you&amp;rsquo;re unsure whether it has taken effect, rebooting the PC is the most reliable option.&lt;/p&gt;
&lt;h2 id=&#34;4-verify-the-new-path-is-active&#34;&gt;4. Verify the New Path Is Active
&lt;/h2&gt;&lt;p&gt;Pull any model and check whether new files appear under &lt;code&gt;D:\OllamaModels&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;5-clean-up-the-old-directory-after-confirmation&#34;&gt;5. Clean Up the Old Directory (After Confirmation)
&lt;/h2&gt;&lt;p&gt;Once models work correctly in the new location, remove old files to reclaim C drive space.&lt;/p&gt;
&lt;h2 id=&#34;faq&#34;&gt;FAQ
&lt;/h2&gt;&lt;h3 id=&#34;still-writing-to-c-drive-after-configuration&#34;&gt;Still Writing to C Drive After Configuration
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Confirm the variable is a system variable, not a temporary session variable.&lt;/li&gt;
&lt;li&gt;Confirm the Ollama process was restarted.&lt;/li&gt;
&lt;li&gt;Verify the variable name is exactly &lt;code&gt;OLLAMA_MODELS&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;do-i-need-to-migrate-existing-model-files&#34;&gt;Do I Need to Migrate Existing Model Files
&lt;/h3&gt;&lt;p&gt;If you want to avoid re-downloading, stop Ollama, copy existing model files to the new directory, then restart Ollama and verify.&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Gemma 4 Model Comparison and Selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/&#34; &gt;LLM Quantization Guide (FP16/Q8/Q5/Q4/Q2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/&#34; &gt;How to Check Whether Ollama Uses GPU&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        <item>
        <title>Completely Uninstall Ollama on Linux (Including Leftover Cleanup)</title>
        <link>https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/</link>
        <pubDate>Mon, 06 Apr 2026 09:16:29 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/</guid>
        <description>&lt;p&gt;If you need to remove Ollama completely from Linux, follow the steps below in order. This guide cleans up the service, executable, model directory, and the &lt;code&gt;ollama&lt;/code&gt; user/group.&lt;/p&gt;
&lt;h2 id=&#34;before-you-uninstall&#34;&gt;Before You Uninstall
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;The commands below will delete local Ollama model files (usually in &lt;code&gt;/usr/share/ollama&lt;/code&gt;). Back up first if needed.&lt;/li&gt;
&lt;li&gt;These commands use &lt;code&gt;sudo&lt;/code&gt; by default. Make sure your account has administrator privileges.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;1-stop-and-remove-the-systemd-service&#34;&gt;1. Stop and Remove the systemd Service
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl stop ollama
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl disable ollama
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo rm -f /etc/systemd/system/ollama.service
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl daemon-reload
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;2-remove-the-ollama-binary&#34;&gt;2. Remove the Ollama Binary
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_BIN&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;$(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;command&lt;/span&gt; -v ollama&lt;span class=&#34;k&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt; -n &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$OLLAMA_BIN&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;then&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  sudo rm -f &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$OLLAMA_BIN&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;fi&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;3-remove-ollama-library-directories-if-present&#34;&gt;3. Remove Ollama Library Directories (If Present)
&lt;/h2&gt;&lt;p&gt;If your installation wrote Ollama files into a &lt;code&gt;lib&lt;/code&gt; directory, clean them up with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; d in /usr/local/lib/ollama /usr/lib/ollama /lib/ollama&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;do&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt; -d &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$d&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&amp;amp;&lt;/span&gt; sudo rm -rf &lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$d&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;done&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;4-remove-model-and-data-directory&#34;&gt;4. Remove Model and Data Directory
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo rm -rf /usr/share/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;5-remove-system-user-and-group-if-present&#34;&gt;5. Remove System User and Group (If Present)
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;id -u ollama &amp;gt;/dev/null 2&amp;gt;&lt;span class=&#34;p&#34;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&amp;amp;&lt;/span&gt; sudo userdel ollama
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;getent group ollama &amp;gt;/dev/null 2&amp;gt;&lt;span class=&#34;p&#34;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;amp;&amp;amp;&lt;/span&gt; sudo groupdel ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;6-verify-uninstall-completion&#34;&gt;6. Verify Uninstall Completion
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;command&lt;/span&gt; -v ollama &lt;span class=&#34;o&#34;&gt;||&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;ollama binary not found&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;systemctl status ollama &lt;span class=&#34;o&#34;&gt;||&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If &lt;code&gt;ollama&lt;/code&gt; is no longer found in the checks above, the uninstall is complete.&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Gemma 4 Model Comparison and Selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/&#34; &gt;LLM Quantization Guide (FP16/Q8/Q5/Q4/Q2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/&#34; &gt;How to Check Whether Ollama Uses GPU&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        <item>
        <title>LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2</title>
        <link>https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</link>
        <pubDate>Sun, 05 Apr 2026 22:09:11 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</guid>
        <description>&lt;p&gt;The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.&lt;br&gt;
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.&lt;/p&gt;
&lt;h2 id=&#34;what-is-quantization&#34;&gt;What Is Quantization
&lt;/h2&gt;&lt;p&gt;Quantization means compressing model parameters from higher-precision formats (such as &lt;code&gt;FP16&lt;/code&gt;) into lower-bit formats (such as &lt;code&gt;Q8&lt;/code&gt; and &lt;code&gt;Q4&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;A simple analogy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Original model: like a high-quality photo, clear but large.&lt;/li&gt;
&lt;li&gt;Quantized model: like a compressed photo, slightly less detail but lighter and faster.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-quantization-formats&#34;&gt;Common Quantization Formats
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th&gt;Precision / Bit Width&lt;/th&gt;
          &lt;th&gt;Size&lt;/th&gt;
          &lt;th&gt;Quality Loss&lt;/th&gt;
          &lt;th&gt;Recommended Use&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;FP16&lt;/td&gt;
          &lt;td&gt;16-bit float&lt;/td&gt;
          &lt;td&gt;Largest&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;Research, evaluation, max quality&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q8_0&lt;/td&gt;
          &lt;td&gt;8-bit integer&lt;/td&gt;
          &lt;td&gt;Larger&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;High-end PCs, quality + performance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q5_K_M&lt;/td&gt;
          &lt;td&gt;5-bit mixed&lt;/td&gt;
          &lt;td&gt;Medium&lt;/td&gt;
          &lt;td&gt;Slight&lt;/td&gt;
          &lt;td&gt;Daily driver, balanced choice&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;4-bit mixed&lt;/td&gt;
          &lt;td&gt;Smaller&lt;/td&gt;
          &lt;td&gt;Acceptable&lt;/td&gt;
          &lt;td&gt;General default, strong value&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q3_K_M&lt;/td&gt;
          &lt;td&gt;3-bit mixed&lt;/td&gt;
          &lt;td&gt;Very small&lt;/td&gt;
          &lt;td&gt;Noticeable&lt;/td&gt;
          &lt;td&gt;Low-spec devices, run-first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q2_K&lt;/td&gt;
          &lt;td&gt;2-bit mixed&lt;/td&gt;
          &lt;td&gt;Smallest&lt;/td&gt;
          &lt;td&gt;Significant&lt;/td&gt;
          &lt;td&gt;Extreme resource limits, fallback&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;quantization-naming-rules&#34;&gt;Quantization Naming Rules
&lt;/h2&gt;&lt;p&gt;Take &lt;code&gt;gemma-4:4b-q4_k_m&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gemma-4:4b&lt;/code&gt;: model name and parameter scale.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;q4&lt;/code&gt;: 4-bit quantization.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k&lt;/code&gt;: K-quants (an improved quantization method).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m&lt;/code&gt;: medium level (common options also include &lt;code&gt;s&lt;/code&gt;/small and &lt;code&gt;l&lt;/code&gt;/large).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-selection-by-vram&#34;&gt;Quick Selection by VRAM
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;RAM / VRAM&lt;/th&gt;
          &lt;th&gt;Recommended Quantization&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;4 GB&lt;/td&gt;
          &lt;td&gt;Q3_K_M / Q2_K&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;8 GB&lt;/td&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16 GB&lt;/td&gt;
          &lt;td&gt;Q5_K_M / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32 GB+&lt;/td&gt;
          &lt;td&gt;FP16 / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.&lt;/p&gt;
&lt;h2 id=&#34;practical-tips&#34;&gt;Practical Tips
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;Q4_K_M&lt;/code&gt; by default and test real tasks first.&lt;/li&gt;
&lt;li&gt;If response quality is not enough, move up to &lt;code&gt;Q5_K_M&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If VRAM or speed is the main bottleneck, move down to &lt;code&gt;Q3_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Use the same test set every time you switch quantization formats.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Quality first: &lt;code&gt;FP16&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Balance first: &lt;code&gt;Q5_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;General default: &lt;code&gt;Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Low-spec fallback: &lt;code&gt;Q3_K_M&lt;/code&gt; or &lt;code&gt;Q2_K&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key is not &amp;ldquo;bigger is always better&amp;rdquo;, but &amp;ldquo;the most stable and usable result under your hardware limits.&amp;rdquo;&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Gemma 4 Model Comparison and Selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/&#34; &gt;How to Check Whether Ollama Uses GPU&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        <item>
        <title>Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B</title>
        <link>https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/</link>
        <pubDate>Sun, 05 Apr 2026 08:30:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/</guid>
        <description>&lt;p&gt;Gemma 4 focuses on &lt;code&gt;multimodality&lt;/code&gt; and &lt;code&gt;local offline inference&lt;/code&gt;, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-model-comparison&#34;&gt;Gemma 4 Model Comparison
&lt;/h2&gt;&lt;blockquote&gt;
&lt;p&gt;The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Parameter Size&lt;/th&gt;
          &lt;th&gt;Positioning&lt;/th&gt;
          &lt;th&gt;Key Strengths&lt;/th&gt;
          &lt;th&gt;Main Limitations&lt;/th&gt;
          &lt;th&gt;Recommended Scenarios&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 2B&lt;/td&gt;
          &lt;td&gt;2B&lt;/td&gt;
          &lt;td&gt;Ultra-lightweight&lt;/td&gt;
          &lt;td&gt;Low latency, low resource usage, lowest deployment barrier&lt;/td&gt;
          &lt;td&gt;Limited performance on complex reasoning and long task chains&lt;/td&gt;
          &lt;td&gt;Mobile, IoT, lightweight Q&amp;amp;A, simple automation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 4B&lt;/td&gt;
          &lt;td&gt;4B&lt;/td&gt;
          &lt;td&gt;Lightweight enhanced&lt;/td&gt;
          &lt;td&gt;Stronger understanding and generation than 2B, still easy to deploy locally&lt;/td&gt;
          &lt;td&gt;Limited ceiling for heavy coding and complex agent tasks&lt;/td&gt;
          &lt;td&gt;Local assistant, basic document work, multilingual daily tasks&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 26B&lt;/td&gt;
          &lt;td&gt;26B&lt;/td&gt;
          &lt;td&gt;High-performance (MoE)&lt;/td&gt;
          &lt;td&gt;Better reasoning and tool use, suitable for production workflows&lt;/td&gt;
          &lt;td&gt;Significantly higher VRAM requirement and hardware threshold&lt;/td&gt;
          &lt;td&gt;Coding assistant, complex workflows, enterprise internal agents&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Gemma 4 31B&lt;/td&gt;
          &lt;td&gt;31B&lt;/td&gt;
          &lt;td&gt;High-performance (dense)&lt;/td&gt;
          &lt;td&gt;Best overall capability and stronger stability on complex tasks&lt;/td&gt;
          &lt;td&gt;Highest resource cost and tuning complexity&lt;/td&gt;
          &lt;td&gt;Advanced reasoning, complex coding tasks, heavy automation&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;how-to-choose-start-from-hardware-and-tasks&#34;&gt;How to Choose: Start from Hardware and Tasks
&lt;/h2&gt;&lt;p&gt;If your top concern is whether it runs smoothly, use this guideline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8GB&lt;/code&gt; VRAM: prioritize &lt;code&gt;2B/4B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;12GB&lt;/code&gt; VRAM: prioritize &lt;code&gt;4B&lt;/code&gt; or quantized variants of larger models.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;24GB&lt;/code&gt; VRAM: focus on &lt;code&gt;26B&lt;/code&gt;, and evaluate quantized &lt;code&gt;31B&lt;/code&gt; based on workload.&lt;/li&gt;
&lt;li&gt;Higher VRAM or multi-GPU: consider high-precision &lt;code&gt;31B&lt;/code&gt; setups.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prioritize stability and inference speed first, then scale up model size gradually.&lt;/p&gt;
&lt;h2 id=&#34;four-typical-use-cases&#34;&gt;Four Typical Use Cases
&lt;/h2&gt;&lt;h3 id=&#34;1-local-general-assistant&#34;&gt;1) Local General Assistant
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;4B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: strong balance between cost and quality, suitable for long-running local use.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-coding-and-automation&#34;&gt;2) Coding and Automation
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;26B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: more stable in multi-step tasks, tool calls, and script generation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;3-advanced-reasoning-and-complex-agents&#34;&gt;3) Advanced Reasoning and Complex Agents
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;31B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: stronger robustness under complex context.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;4-edge-devices-and-lightweight-offline-use&#34;&gt;4) Edge Devices and Lightweight Offline Use
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Preferred model: &lt;code&gt;2B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Why: easiest to deploy on resource-constrained devices.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;deployment-suggestions-ollama&#34;&gt;Deployment Suggestions (Ollama)
&lt;/h2&gt;&lt;p&gt;A practical approach is to iterate in small steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;4B&lt;/code&gt; to establish a baseline (latency, memory, quality).&lt;/li&gt;
&lt;li&gt;Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).&lt;/li&gt;
&lt;li&gt;Compare &lt;code&gt;26B/31B&lt;/code&gt; against that set for accuracy, latency, and VRAM cost.&lt;/li&gt;
&lt;li&gt;Upgrade only when the gain is clear.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For low-cost fast rollout: start with &lt;code&gt;2B/4B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For production-grade local AI workflows: prioritize &lt;code&gt;26B&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For advanced reasoning and heavy automation: move to &lt;code&gt;31B&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/&#34; &gt;LLM Quantization Guide (FP16/Q8/Q5/Q4/Q2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/&#34; &gt;How to Check Whether Ollama Uses GPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android (Chinese)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        
    </channel>
</rss>
