Ollama on KnightLi Blog

Claude Code + Ollama Local Deployment Guide: Build a Free AI Coding Assistant with CC Switch

Fri, 15 May 2026 23:27:50 +0800

Claude Code has become a popular AI coding assistant recently. Its appeal is not just that it can chat about code, but that it can read a project, modify files, run commands, install dependencies, and keep fixing errors in an agent-like workflow.

The hard part is cost. Once a project grows, long context and repeated agent turns can burn through API quota quickly. If you just want to experiment, refactor small utilities, generate scripts, or work on a private local project, it is natural to ask: can Claude Code’s workflow be kept while the model runs locally?

The key tool in this setup is CC Switch. It lets Claude Code connect to the local Ollama service through an OpenAI-compatible API endpoint, so requests can be forwarded to a local model instead of the official Claude API.

What This Setup Solves

You can think of the whole setup as:

1
2
3

Claude Code desktop
+ CC Switch API forwarding layer
+ Ollama local model

Claude Code is still responsible for the coding workflow and project operations. CC Switch handles model provider configuration and API compatibility. Ollama runs the model locally.

This does not make a local model suddenly become Claude. Its real value is that it makes Claude Code’s agent workflow usable in lower-cost, offline, and private local scenarios.

Basic Preparation

Before you start, prepare these pieces:

Install Git.
Install Ollama.
Pull a local model suitable for coding.
Install CC Switch.
Have Claude Code available on your machine.

For the model side, you can start with coding-oriented models, such as Qwen Coder, DeepSeek Coder, or other models with decent tool-calling and code generation behavior. The larger the model, the better the result may be, but memory and GPU pressure will also rise.

If your machine only has limited memory, start with a smaller model first. Confirm that the workflow runs smoothly before trying a larger one.

Key CC Switch Configuration

After Ollama starts, its default local API address is usually:

`1`	`http://127.0.0.1:11434/v1`

In CC Switch, choose an OpenAI-compatible provider type, commonly:

`1`	`OpenAI Chat Completions`

Then point the base URL to Ollama’s local address.

For the API key field, local Ollama normally does not need a real key, but many tools still require an environment variable or placeholder. You can use:

`1`	`ANTHROPIC_API_KEY`

or another placeholder variable accepted by your local setup.

One configuration item is worth special attention:

`1`	`"inferenceModels"="[\"haiku\",\"sonnet\",\"opus\"]"`

This means mapping Claude Code’s expected model roles to the local provider. In practice, you need to bind haiku, sonnet, and opus to the model names exposed by Ollama or CC Switch. If this mapping is wrong, Claude Code may fail to call the model or may keep falling back to an unexpected configuration.

Where Claude Code Is Strong

Claude Code’s biggest advantage is not raw completion. It is the full coding workflow:

reading and understanding project structure;
locating related files based on a task;
editing code directly;
running commands and tests;
observing errors and iterating;
completing multi-step tasks in one session.

This is why many people want to keep Claude Code even when switching to a local model. A normal chat UI can generate code snippets, but it does not naturally operate inside a repository. Claude Code is closer to an executable development assistant.

What Role Ollama Plays Here

Ollama is responsible for local model runtime and management. It handles model downloading, loading, and local inference.

The advantage is clear: requests stay on your machine, repeated use does not create API bills, and you can use it when the network is limited. For private code, this is also easier to accept than sending every context window to a cloud model.

The trade-off is also clear. Local models depend heavily on your hardware and on model quality. A smaller model can handle simple edits, explanations, and script generation, but it may struggle with large cross-file refactors or subtle architectural decisions.

Where The Experience Has Boundaries

This setup should not be treated as a full replacement for Claude’s strongest cloud models.

You may run into these issues:

weaker long-context understanding;
unstable tool-calling behavior in complex tasks;
slower inference on CPU-only machines;
more hallucinated file paths or APIs;
less reliable multi-round planning;
lower success rate on large repository refactors.

So the better expectation is: use it as a free local development assistant, not as a perfect substitute for a top-tier cloud model.

Multimodal Compatibility Is Still Unstable

Some users want Claude Code to handle screenshots, UI images, diagrams, or other multimodal inputs. This part depends on the local model and the forwarding layer.

If the selected Ollama model does not support vision, or CC Switch does not translate the request format correctly, multimodal features may fail. Even with a vision model, behavior may differ from Claude’s official API.

For now, this setup is more suitable for text and code workflows. Treat multimodal support as experimental.

Who Should Try It

This setup is suitable for:

developers who want to try Claude Code’s workflow at low cost;
users who frequently write scripts, small tools, and automation snippets;
teams that want to keep code on local machines;
learners who want an AI coding assistant without constant API spend;
people testing different local coding models.

It is less suitable if you rely heavily on long context, large monorepos, strict code review quality, or complex full-project refactors.

Usage Advice

Start with small tasks.

For example:

explain a single file;
refactor a small function;
generate a shell script;
fix a simple error;
add a small feature;
write unit tests for a narrow module.

After each change, run tests or at least review the diff yourself. A local model can be useful, but you should not blindly accept every generated edit.

If the model keeps losing context, reduce the task scope. Instead of asking it to “refactor the whole project”, ask it to “refactor this function” or “add validation in this file”.

Summary

Claude Code + CC Switch + Ollama is an interesting combination. It keeps Claude Code’s agent-style development workflow while moving inference to a local model.

Its biggest strengths are lower cost, local privacy, and a smooth development workflow. Its limits are also obvious: model quality, hardware performance, long context, and tool-calling stability all affect the final experience.

If you already use Ollama and want a more practical local AI coding workflow, this setup is worth trying. Just remember to start small, verify every change, and treat the local model as an assistant rather than an automatic engineer.

Local LLM Models Recommended for an RTX 3060 GPU

Fri, 08 May 2026 09:25:24 +0800

The most common RTX 3060 variant has 12GB of VRAM. It is not a top-tier AI GPU, but it is a very usable card for local LLMs, especially 7B, 8B, 9B, and 12B models.

If you only want a quick rule of thumb, remember this:

On an RTX 3060 12GB, prioritize around-8B models in Q4_K_M or Q5_K_M quantization. Choose Q4 for stability, and try Q5 if you want better quality.

Do not start by chasing 32B or 70B models. Even if they can run with low-bit quantization and CPU offloading, their speed and experience are usually not suitable for daily use.

Start With the VRAM Limit

For local LLMs on an RTX 3060 12GB, the real limit is VRAM.

Model Size	Recommended Quantization	RTX 3060 12GB Experience
3B / 4B	Q4, Q5, Q8	Very easy, fast
7B / 8B / 9B	Q4_K_M, Q5_K_M	Best balance of quality and speed
12B / 14B	Q4_K_M	Usable, but avoid huge context
30B+	Q2 / Q3 or partial offload	Possible to tinker with, not recommended daily
70B+	Very low quantization or heavy CPU/RAM use	More like an experiment

Local LLMs do not only consume VRAM for the model file. Context length, KV cache, batch size, inference framework, and drivers all consume resources.

So 12GB of VRAM does not mean you can load a 12GB model file directly. It is better to leave room for the system and context.

Recommendation 1: Qwen3 8B

If you mainly use Chinese, Qwen3 8B is one of the first models worth trying on an RTX 3060.

Good for:

Chinese Q&A.
Summarization and rewriting.
Everyday knowledge assistant work.
Simple code explanation.
Local RAG.
Lightweight Agent flows.

Recommended choice:

1
2
3

Qwen3 8B GGUF
Q4_K_M: first choice
Q5_K_M: better quality, more VRAM pressure

Qwen models are friendly to Chinese usage. For daily writing, information organization, and Chinese instruction following, Qwen3 8B is a good first model.

Recommendation 2: Llama 3.1 8B Instruct

Llama 3.1 8B Instruct is a stable general-purpose model with mature English capability and ecosystem support.

Good for:

English Q&A.
Lightweight coding help.
General chat.
Document summarization.
Prompt testing.
Comparing different inference tools.

Recommended choice:

1
2
3

Llama 3.1 8B Instruct GGUF
Q4_K_M: better speed and VRAM stability
Q5_K_M: better answer quality

If you mainly process English materials, or want a model with many tutorials and broad compatibility, Llama 3.1 8B is still a good baseline.

Recommendation 3: Gemma 3 12B

Gemma 3 12B is closer to the upper practical limit for an RTX 3060 12GB.

It uses more VRAM than 8B models, but Q4 quantization can still make it usable on a 12GB card. It is a good option if you want to try a slightly larger model on one GPU.

Good for:

Higher-quality general Q&A.
English content processing.
More complex summarization and analysis.
Trying an upgrade over 8B models.

Recommended choice:

1
2
3

Gemma 3 12B GGUF
Q4_K_M or official QAT Q4
Keep context modest

If you run out of VRAM, reduce context length first, or return to an 8B model. For an RTX 3060, 12B is “worth trying,” not a no-brainer default.

Recommendation 4: DeepSeek R1 Distill Qwen 8B

If you want to experience reasoning-style local models, try models like DeepSeek R1 Distill Qwen 8B.

Good for:

Simple reasoning tasks.
Step-by-step analysis.
Learning reasoning-model output style.
Low-cost local experiments.

Recommended choice:

1
2

DeepSeek R1 Distill Qwen 8B GGUF
Q4_K_M

These models may produce longer reasoning-style outputs, so speed and context usage can be heavier than ordinary instruction models. They are not always more comfortable for daily chat, but they are useful for reasoning experiments.

Recommendation 5: Phi / MiniCPM / Smaller Models

If your RTX 3060 is an 8GB variant, or your system RAM is limited, consider 3B and 4B models first.

Good for:

Fast Q&A.
Simple summaries.
Embedding into local tools.
Low-latency chat.
Testing on older machines.

These models may not match 8B or 12B quality, but they are light, fast, and easy to deploy.

Which Quantization to Use

Local models commonly use GGUF, with quantization types such as Q4, Q5, Q6, and Q8.

Quantization	Traits	Best For
Q4_K_M	Small, fast, good enough	RTX 3060 first choice
Q5_K_M	Better quality, higher usage	Try with 8B models
Q6 / Q8	Closer to original quality, larger	Small models or more VRAM
Q2 / Q3	Saves VRAM but quality drops	Large-model tinkering

For RTX 3060 12GB, the practical choices are:

1
2
3

8B models: Q4_K_M or Q5_K_M
12B models: Q4_K_M first
Larger models: not recommended as daily drivers

Which Tool to Use

Beginners can start with Ollama, because installation and running models are simple.

Common commands:

1
2

ollama run qwen3:8b
ollama run llama3.1:8b

If you want finer control over GGUF files, GPU layers, and context length, use llama.cpp or GUI tools based on it.

Common choices:

Ollama: easiest, best for beginners.
LM Studio: friendly GUI, good for downloading and switching models.
llama.cpp: most control, best for performance tuning.
text-generation-webui: many features, good for backend testing.

For local chat and simple Q&A, Ollama or LM Studio is enough.

Do Not Set Context Too High

Many models advertise long-context support, but do not blindly set context to the maximum on an RTX 3060.

Longer context uses more KV cache and increases VRAM pressure. Even if the model loads, long context can slow generation down.

Suggested settings:

1
2
3

Normal chat: 4K to 8K
Document summaries: 8K to 16K
Long-document RAG: chunk first; do not paste everything at once

An RTX 3060 is better suited to “moderate context + good model + good retrieval” than forcing hundreds of thousands of tokens into one prompt.

Choose by Use Case

If you mainly write Chinese:

1
2

First choice: Qwen3 8B Q4_K_M
Alternative: DeepSeek R1 Distill Qwen 8B

If you mainly write English:

1
2

First choice: Llama 3.1 8B Instruct Q4_K_M
Alternative: Gemma 3 12B Q4_K_M

If you want speed:

1
2
3

3B / 4B models
8B Q4_K_M
Keep context at 4K to 8K

If you want better quality:

1
2
3

8B Q5_K_M
12B Q4_K_M
Accept slower speed

If you want coding help:

1
2

8B coding models can help with explanations and small edits
For complex engineering tasks, use stronger cloud models

Local RTX 3060 models are good for code explanation, function completion, small scripts, and offline assistance. For large refactors, difficult bugs, and cross-file Agent work, do not expect Claude Sonnet or GPT-5-level performance.

Reasonable Expectations

The RTX 3060 12GB is good enough to turn local LLMs from toys into daily tools, but it will not recreate top cloud models at home.

Its strengths:

Low cost.
More VRAM than 8GB cards.
Good 8B model experience.
Offline use.
Local processing for privacy-sensitive materials.

Its limits:

Large models are hard to run smoothly.
Long context consumes VRAM.
Slower than high-end GPUs.
Small local models have limited complex reasoning.
Multimodal and Agent workflows need more resources.

The stable route is: use 8B models as everyday local assistants, try 12B models for quality, and leave complex tasks to cloud models.

Summary

Recommended local LLM choices for RTX 3060 12GB:

Chinese general use: Qwen3 8B Q4_K_M
English general use: Llama 3.1 8B Instruct Q4_K_M
Higher-quality experiment: Gemma 3 12B Q4_K_M
Reasoning experiment: DeepSeek R1 Distill Qwen 8B Q4_K_M
Low-VRAM fast use: 3B / 4B small models

Choose Q4_K_M first. Try Q5_K_M for 8B models if you want better quality. Start with Ollama or LM Studio.

Do not treat the RTX 3060 as a large-model server. Treat it as a local knowledge assistant, privacy document processor, lightweight coding helper, and model experiment card, and it will fit its real capabilities much better.

References

Qwen3 8B GGUF: https://huggingface.co/Qwen/Qwen3-8B-GGUF
Llama 3.1 8B GGUF: https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF
Gemma 3 12B GGUF: https://huggingface.co/unsloth/gemma-3-12b-it-GGUF
llama.cpp: https://github.com/ggml-org/llama.cpp
Ollama: https://ollama.com

How to Fix Ollama Using CPU Instead of GPU

Fri, 24 Apr 2026 18:30:00 +0800

When running local LLMs, one of the most frustrating problems is this: your machine clearly has a GPU, yet Ollama still leans heavily on the CPU, and performance is painfully slow.

The short version is that this is usually not caused by one single issue. The most common causes are:

Ollama is not detecting any usable GPU
The driver, ROCm, or CUDA environment is not set up correctly
The Ollama service was started without the right environment variables
The model is too large and has fallen back to CPU or mixed CPU/GPU loading
On AMD platforms, there may be extra compatibility issues such as ROCm version mismatch, gfx settings, or device visibility problems

The fastest way to troubleshoot it is to go through the checks below in order.

1. First, confirm whether Ollama is really not using the GPU

The most direct check is:

`1`	`ollama ps`

Focus on the PROCESSOR column.

100% GPU: the model is fully running on the GPU
100% CPU: the GPU is not being used at all
Results like 48%/52% CPU/GPU: part of the model is in VRAM, and part has spilled into system memory

If you see 100% CPU, the next step is to focus on environment and service configuration.
If you see mixed loading, that does not necessarily mean the GPU is broken. In many cases, it simply means VRAM is not enough.

2. Rule out the most common misunderstanding first: the model does not fit into VRAM

Many people assume that once a GPU is installed, Ollama will always run fully on it. That is not how it works.

If the model is too large, the context is too long, or some other loaded model is already occupying VRAM, Ollama may fall back to:

Partial GPU + partial CPU
Full 100% CPU

At this point, the two simplest tests are:

Try a smaller model first
For example, test with a 4B or 7B model before jumping straight to much larger ones.
Unload other active models and test again
Run ollama ps first and make sure nothing else is occupying VRAM.

If smaller models use the GPU but larger ones do not, the real problem is usually VRAM capacity rather than the driver.

3. Check whether the GPU driver and the lower-level runtime are actually working

If even small models run only on CPU, the next step is to check the underlying environment.

NVIDIA

First confirm that the driver is working and the system can see the GPU. A common check is:

`1`	`nvidia-smi`

If this already fails, Ollama is very unlikely to use the GPU correctly.

AMD / ROCm

If you are using an AMD GPU, especially with ROCm, start with:

1
2

rocminfo
rocm-smi

If these tools cannot list the device properly, the problem is still below Ollama, so there is no point debugging the application layer yet.

On AMD, the most common issue is not simply “is the driver installed,” but rather:

The ROCm version does not match the OS version
The current GPU architecture has incomplete support
The device exists, but the runtime is not being exposed correctly to Ollama

4. Restart the Ollama service, not just your terminal

This is a very common trap.

Many people install drivers, change environment variables, fix ROCm, then just open a new terminal and continue with ollama run. But if Ollama is running as a background service, it may still be using the old environment.

So the safer approach is:

Fully restart the Ollama service
Reboot the machine if necessary

If you are running it as a service on Linux, make sure the service process was actually restarted instead of reusing the old one.

5. Check whether the environment variables are really reaching the service

This matters especially on AMD ROCm systems.

Some machines work fine when commands are run manually in a shell, but the Ollama service still uses only CPU. In that case, the usual reason is that the service process never received the variables you set in your shell.

Common variables to look at include:

1
2

ROCR_VISIBLE_DEVICES
HSA_OVERRIDE_GFX_VERSION

Specifically:

ROCR_VISIBLE_DEVICES limits or selects which GPUs ROCm can see
HSA_OVERRIDE_GFX_VERSION is often used as a compatibility workaround on some AMD platforms

If you only export these variables in the current terminal, but Ollama is started by systemd, a desktop background service, or another daemon, they may not take effect.

In other words, “it looks set in my terminal” does not mean Ollama is actually using it.

6. On AMD platforms, focus on ROCm compatibility

Based on the public page metadata, the original video for this topic is tied to AMD Max+ 395, strix halo, and AMD ROCm.
In setups like these, Ollama failing to use the GPU is often more dependent on version matching than on NVIDIA systems.

Start by checking these:

Whether the installed ROCm version fits the current OS and GPU
Whether the GPU belongs to an architecture with solid ROCm support
Whether you need to set HSA_OVERRIDE_GFX_VERSION
Whether an older Ollama build or older inference runtime is causing compatibility issues

If rocminfo works and the GPU is visible to the system, but Ollama still runs only on CPU, the issue is often in the version combination rather than in model parameters.

7. In Docker, WSL, or remote environments, also check device mapping

If you are not running on bare metal but inside:

Docker
WSL
Remote containers
Virtualized environments

then you need to check one more layer: whether the GPU device is actually being exposed inside that environment.

A typical symptom looks like this:

The host machine can see the GPU
Ollama inside the container or subsystem still uses only CPU

In that case, the issue may not be Ollama itself. The container or subsystem may simply not have GPU access.

8. Check logs last, but check them for the right reason

If you have already gone through the earlier steps, the most effective next move is not endless reinstalling, but looking directly at the Ollama startup and runtime logs.

Focus on two kinds of messages:

Whether a GPU was detected at all
Whether there are driver, library loading, or device initialization errors

If the logs clearly say something like “no compatible GPU found” or “failed to initialize ROCm/CUDA,” the troubleshooting direction becomes much clearer immediately.

Troubleshooting Order

If you only want the shortest path, use this order:

Run ollama ps and confirm whether it is GPU, CPU, or mixed loading
Try a smaller model to rule out VRAM limits
Use nvidia-smi, rocminfo, and rocm-smi to verify the lower-level environment first
Fully restart the Ollama service
Check service environment variables, especially ROCR_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION on AMD
If you are in Docker or WSL, verify device mapping
Finally, inspect logs for the exact error

Conclusion

When Ollama uses CPU instead of GPU, the root cause usually falls into one of three groups:

The GPU is not being detected at all
The GPU is detectable, but the runtime environment is not reaching Ollama
The GPU is working, but the model is too large and falls back to CPU or mixed memory

Once you separate those three cases, troubleshooting becomes much faster.
If you are on an AMD platform, pay special attention to ROCm version matching, device visibility, and compatibility variables instead of focusing only on the Ollama command itself.

Original video: https://www.bilibili.com/video/BV1cHoYBqE8k/

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Sun, 19 Apr 2026 00:18:00 +0800

When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?

This note summarizes how Ollama behaves with multiple GPUs. The short version:

Ollama supports multiple GPUs.
The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.
By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.
If a model does not fit on one GPU, Ollama can spread it across available GPUs.
Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.
SLI / NVLink is not required for multi-GPU use.
To limit which GPUs Ollama can use, use CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, or GGML_VK_VISIBLE_DEVICES.

Official Behavior: Single GPU First, Multi-GPU When Needed

Ollama’s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.

The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.

So do not think of Ollama multi-GPU as “more cards automatically means several times faster.” A more accurate model is:

Small model fits on one GPU: usually runs on one GPU.
Large model does not fit on one GPU: split across multiple GPUs.
Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.

Use this command to see where the model is loaded:

`1`	`ollama ps`

The PROCESSOR column may show something like:

1
2
3

100% GPU
48%/52% CPU/GPU
100% CPU

If you see 48%/52% CPU/GPU, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.

Multi-GPU Is Not Simple Compute Stacking

Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.

So multi-GPU benefits usually fall into two categories:

VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.
Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.

If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama’s default “use one GPU when it fits” strategy avoids that unnecessary PCIe cost.

SLI or NVLink Is Not Required

Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.

NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.

What you should pay attention to is PCIe bandwidth. The difference between x1, x4, x8, and x16 affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.

Safer rules:

Prefer x16 / x8 over mining-style x1 risers.
PCIe bandwidth matters more when switching large models frequently.
If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.
For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.

Limit Which NVIDIA GPUs Ollama Uses

On NVIDIA multi-GPU systems, use CUDA_VISIBLE_DEVICES to control which GPUs Ollama can see.

Temporary run:

`1`	`CUDA_VISIBLE_DEVICES=0,1 ollama serve`

Use only the second GPU:

`1`	`CUDA_VISIBLE_DEVICES=1 ollama serve`

Force Ollama not to use NVIDIA GPUs:

`1`	`CUDA_VISIBLE_DEVICES=-1 ollama serve`

The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:

`1`	`nvidia-smi -L`

Example output:

1
2

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)

Then specify the UUID:

`1`	`CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve`

If Ollama is installed as a Linux systemd service, put the variable into the service environment:

`1`	`sudo systemctl edit ollama.service`

Add:

1
2

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Reload and restart:

1
2

sudo systemctl daemon-reload
sudo systemctl restart ollama

AMD and Vulkan Device Selection

For AMD ROCm, use ROCR_VISIBLE_DEVICES to control visible GPUs:

`1`	`ROCR_VISIBLE_DEVICES=0,1 ollama serve`

To force Ollama not to use ROCm GPUs, use an invalid ID:

`1`	`ROCR_VISIBLE_DEVICES=-1 ollama serve`

Ollama’s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use GGML_VK_VISIBLE_DEVICES:

`1`	`OLLAMA_VULKAN=1 GGML_VK_VISIBLE_DEVICES=0 ollama serve`

If Vulkan devices cause problems, disable them:

`1`	`GGML_VK_VISIBLE_DEVICES=-1 ollama serve`

AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as HSA_OVERRIDE_GFX_VERSION. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.

Exposing Multiple GPUs in Docker

If you run Ollama in Docker, NVIDIA setups usually require nvidia-container-toolkit, then --gpus to expose devices.

Expose all GPUs:

docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Expose specific GPUs:

docker run -d \
  --gpus '"device=0,1"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

You can also combine this with environment variables:

docker run -d \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If nvidia-smi cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.

What Is `OLLAMA_SCHED_SPREAD`

In some multi-GPU configuration discussions, you may see OLLAMA_SCHED_SPREAD=1 or OLLAMA_SCHED_SPREAD=true. It is related to Ollama’s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.

Example:

`1`	`OLLAMA_SCHED_SPREAD=1 ollama serve`

Or with systemd:

1
2

[Service]
Environment="OLLAMA_SCHED_SPREAD=true"

But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.

Treat OLLAMA_SCHED_SPREAD as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on ollama ps, logs, and nvidia-smi.

How to Check Whether Multiple GPUs Are Being Used

Useful commands:

`1`	`ollama ps`

`1`	`watch -n 0.5 nvidia-smi`

View the Ollama service logs:

`1`	`journalctl -u ollama -f`

If using Docker:

`1`	`docker logs -f ollama`

Watch for:

Whether Ollama discovers compatible GPUs.
Whether the model shows 100% GPU or a CPU/GPU split.
Whether each GPU has VRAM allocated.
Whether VRAM grows on multiple GPUs during model loading.
Whether generation token/s improves compared with CPU/RAM spillover.
Whether OOM or model unloading happens frequently.

GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the “does not fit” problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.

Misunderstanding 4: NVLink / SLI Is Required

No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.

GPU Selection Suggestions

For Ollama local inference, the rough priority is:

Larger single-GPU VRAM is usually easier to manage.
Identical GPUs are easier to troubleshoot than mixed GPUs.
More complete PCIe lanes make large-model loading smoother.
Older cards should be checked for CUDA compute capability or ROCm support first.
Multi-GPU power, cooling, and chassis airflow must be planned ahead.

For budget second-hand platforms:

Dual RTX 3090 remains a common high-VRAM option.
Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.
Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.
Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.

Summary

Ollama multi-GPU support is best understood as “VRAM expansion first, performance acceleration second.” If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.

In practice, use ollama ps to check where the model is loaded, then use nvidia-smi or ROCm tools to observe VRAM allocation. For GPU selection, use CUDA_VISIBLE_DEVICES on NVIDIA, ROCR_VISIBLE_DEVICES on AMD ROCm, and GGML_VK_VISIBLE_DEVICES for Vulkan. If running in Docker, first make sure the container can see the GPUs.

Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.

References

Ollama FAQ: How does Ollama load models on multiple GPUs?: https://github.com/ollama/ollama/blob/main/docs/faq.mdx
Ollama GPU docs: Hardware support / GPU Selection: https://github.com/ollama/ollama/blob/main/docs/gpu.mdx
Ollama Docker Hub: https://hub.docker.com/r/ollama/ollama
NVIDIA Container Toolkit: https://github.com/NVIDIA/nvidia-container-toolkit

Deploy Hermes Agent Locally on Windows with WSL + Ollama and Connect Telegram

Sat, 18 Apr 2026 00:48:22 +0800

If you want to run Hermes Agent on Windows with as little friction as possible, a practical path is:

keep Windows as the host system
run Ubuntu inside WSL
use Ollama to serve the local model
let Hermes Agent connect directly to the local Ollama endpoint

This approach keeps the environment relatively clean, lets you run most commands in a Linux-style workflow, and avoids preparing a separate Linux machine.

Overall flow

You can split the setup into 4 steps:

Enable WSL and install Ubuntu
Install Python, Node.js, Git, and other basics inside Ubuntu
Install Ollama and pull a local model
Install Hermes Agent, then connect Telegram

If your goal is simply to get Hermes Agent running first, by the end of step 3 you are already close.

1. Install WSL and Ubuntu

Run this in PowerShell with administrator privileges:

`1`	`wsl --install`

After the installation finishes, restart the PC, then continue with Ubuntu:

`1`	`wsl --install -d Ubuntu`

After that, open Ubuntu in WSL. Most of the remaining commands are run there.

2. Update Ubuntu and install the base environment

Update the system first:

1
2

sudo apt update
sudo apt upgrade -y

Then install Python, extraction tools, Node.js, and Git.

Install Python

`1`	`sudo apt install python3-pip python3-venv -y`

Install zstd

`1`	`sudo apt install -y zstd`

Install Node.js

1
2

curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs

Install Git

1
2

sudo apt update
sudo apt install -y git

You can quickly verify the installation with:

1
2
3

node -v
npm -v
git --version

3. Install Ollama and pull Gemma 4

Install Ollama:

`1`	`curl -fsSL https://ollama.com/install.sh \| sh`

If you want a local model for Hermes Agent, starting with Gemma 4 is reasonable.

For example:

`1`	`ollama run gemma4:e4b`

If your machine is weaker, you can also try:

`1`	`ollama run gemma4:e2b`

Larger variants include:

1
2

ollama run gemma4:26b
ollama run gemma4:31b

For most normal Windows + WSL setups, gemma4:e4b is usually the more practical starting point.

4. Install and configure Hermes Agent

Install it with:

`1`	`curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh \| bash`

After installation, point it to the local Ollama endpoint:

`1`	`http://127.0.0.1:11434`

Use the local model name you actually installed, for example:

`1`	`gemma4:e4b`

If the installer asks you to refresh the shell, run:

`1`	`source ~/.bashrc`

Common Hermes Agent commands

These are the commands you will use most often:

Start

hermes

Re-enter setup

`1`	`hermes setup`

Configure the chat gateway

`1`	`hermes setup gateway`

Update

`1`	`hermes update`

Basic Telegram connection steps

If you want Hermes Agent to send and receive messages through Telegram, the core step is still:

`1`	`hermes setup gateway`

Then prepare the two Telegram-side items you need:

create a bot with BotFather
get your User ID with @userinfobot

Once you have those basics, continue filling them into the Hermes Agent gateway setup.

Who this setup fits

This workflow is a good fit if:

Windows is your main desktop system
you do not want to maintain a separate Linux host
you want to get a local Agent running first, then expand to chat platforms
you prefer local models instead of depending on cloud APIs

If you mainly want to experience a local Agent rather than build a full production deployment immediately, this path is already practical enough.

A few things to keep in mind

WSL is still a compatibility layer, so in extreme cases it may not behave exactly like native Linux
whether a large model runs smoothly still depends on your RAM, VRAM, and CPU / GPU
gemma4:e4b is a realistic starting point, but actual experience still depends on the machine
Hermes Agent platform integration is an extension step; getting the local model path working first, then adding Telegram, is usually more stable

Conclusion

If you want to deploy Hermes Agent locally on Windows with as little friction as possible, the smoother order is:

WSL -> Ubuntu -> Ollama -> Gemma 4 -> Hermes Agent -> Telegram

Get the local model running first, then add the gateway integration. That usually gives you a much higher success rate. For most users, this is easier to troubleshoot than piling on every component at the beginning, and it also leaves room for later expansion.

Original reference

This post is rewritten and organized based on:

Xchaoge Blog: 太简单了！Hermes Agent 本地部署（无需API）接入 Telegram + 微信

How to Access a Local Ollama API Over LAN on Windows

Sat, 11 Apr 2026 16:43:52 +0800

If you want other devices in the same LAN to access your local Ollama API, follow these steps.

Set the listening host

First, set Ollama to listen on all network interfaces:

OLLAMA_HOST=0.0.0.0:11434

Open the firewall

In Windows Firewall advanced settings, create an inbound rule and allow the target port (for example 8080):

Press Win + S, search and open “Windows Defender Firewall”.
Click “Advanced settings”.
Select “Inbound Rules” -> “New Rule…”.
Choose “Port”, then click “Next”.
Select protocol (usually TCP), enter the target port in “Specific local ports” (for example 8080), then click “Next”.
Choose “Allow the connection”, then click “Next”.
In “Profile”, select Domain, Private, and Public, then click “Next”.
Name the rule (for example OpenPort8080) and click “Finish”.

Run Ollama

Ollama run 模型

Access the model through API

curl http://192.168.x.xxx:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "这个是什么模型?"
}'

Gemma 4 Local Runtime Guide: From One-Command Start to Dev Integration

Fri, 10 Apr 2026 22:54:17 +0800

If you want to run Gemma 4 locally, you can choose from four practical paths depending on your goal and hardware.

1) Fastest start: Ollama (recommended)

This is the lowest-friction option for quick testing, daily chat, and local API usage.

`1`	`ollama run gemma4`

Highlights:

Works on Windows, macOS, and Linux
Handles hardware acceleration automatically
Offers OpenAI-style local API compatibility

2) GUI workflow: LM Studio / Unsloth Studio

If you prefer a desktop UI instead of terminal commands:

LM Studio: browse and run Gemma 4 quantized variants from Hugging Face (for example 4-bit, 8-bit), with resource visibility.
Unsloth Studio: supports both inference and low-VRAM fine-tuning, often friendlier on 6GB-8GB GPUs.

3) Low-spec and maximum control: llama.cpp

Good for older hardware, CPU-focused setups, or users who want deeper runtime control.

With .gguf model files and quantization, Gemma 4 can be made practical on much smaller hardware budgets.

4) Developer integration: Transformers / vLLM

If you need Gemma 4 inside your own application:

Transformers: straightforward Python integration
vLLM: high-throughput inference for stronger GPU environments

Quick selection

Need	Recommended tools	Hardware bar
I just want it running now	Ollama	Low
I want a ChatGPT-like UI	LM Studio	Medium
My VRAM is limited (6GB-8GB)	Unsloth / llama.cpp	Low
I am building local AI apps	Ollama / Transformers / vLLM	Medium to high
I need fine-tuning	Unsloth Studio	Medium to high

Model size suggestion

Gemma 4 comes in multiple sizes (for example E2B, E4B, 31B).

Start with quantized E2B/E4B on mainstream laptops
Move to larger variants only after your baseline pipeline is stable

What are Ollama cloud models and how do you use them

Thu, 09 Apr 2026 18:42:32 +0800

If you already use Ollama to run local models, cloud models are easy to understand.

There is only one core difference:
local models run on your own machine, while cloud models run on Ollama’s cloud infrastructure and return the result to you.

What are Ollama cloud models

Ollama cloud models keep the Ollama workflow, but move the actual computation from your local machine to the cloud.

The main benefits are:

Less pressure on local hardware
Easier access to larger models that your machine cannot run well
You can keep using the familiar Ollama workflow

How they differ from local models

Item	Local models	Cloud models
Runtime location	Your machine	Cloud
Hardware requirements	High	Low
Latency	Usually lower	Affected by network
Privacy	Stronger	Requests are sent to the cloud

If you care more about privacy, low latency, and offline use, local models are a better fit.
If your hardware is limited but you still want to use larger models, cloud models are more convenient.

How to identify a cloud model

At the moment, Ollama cloud models are typically labeled with a -cloud suffix, for example:

`1`	`gpt-oss:120b-cloud`

The available model list may change over time, so the official Ollama pages should be treated as the source of truth.

How to use them

First, sign in:

`1`	`ollama signin`

After that, run a cloud model directly:

`1`	`ollama run gpt-oss:120b-cloud`

If you are calling it from code, you can also configure an API key:

`1`	`export OLLAMA_API_KEY=your_api_key`

Python example:

import os
from ollama import Client

client = Client(
    host="https://ollama.com",
    headers={"Authorization": "Bearer " + os.environ["OLLAMA_API_KEY"]},
)

messages = [
    {"role": "user", "content": "Why is the sky blue?"}
]

for part in client.chat("gpt-oss:120b-cloud", messages=messages, stream=True):
    print(part["message"]["content"], end="", flush=True)

Summary

Ollama cloud models can be summarized in one sentence:

the commands are almost the same, but the model is no longer running on your local machine.

If your computer cannot handle large models well, but you still want to keep the Ollama workflow, cloud models are a very direct option.

How to Download a GGUF Model from Hugging Face and Import It into Ollama

Thu, 09 Apr 2026 11:00:07 +0800

If a model is not available in the official Ollama library, or if you want to use a specific GGUF file from Hugging Face, you can download it manually and then import it into Ollama.

Step 1: Download the GGUF file from Hugging Face

First, find the target model’s GGUF file on Hugging Face. You will usually see multiple quantized versions, such as:

Q4_K_M
Q5_K_M
Q8_0

Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the .gguf file in a fixed directory so you can reference it from the Modelfile.

Step 2: Write the Modelfile

Create a Modelfile in the same directory as the model file. The most basic version looks like this:

`1`	`FROM ./model.gguf`

If the filename is different, replace it with the actual filename, for example:

`1`	`FROM ./gemma-3-12b-it-q4_k_m.gguf`

If your goal is just to get it running, this single FROM line is usually enough.

Step 3: Import it into Ollama

Then run:

`1`	`ollama create myModelName -f Modelfile`

myModelName is the local model name you want to use inside Ollama
-f Modelfile tells Ollama to create the model from that file

Once the creation succeeds, the GGUF file becomes a local model that you can call directly.

Step 4: Run the model

After creation, run:

`1`	`ollama run myModelName`

From that point on, it works much like a model pulled with ollama pull.

How to inspect an existing model’s Modelfile

If you are not sure how to write a Modelfile, you can inspect the configuration of an existing model directly:

`1`	`ollama show --modelfile llama3.2`

This command prints the Modelfile for llama3.2, which is useful as a reference for:

How FROM should be written
How the template and system prompt are structured
How parameters are declared

When this approach makes sense

This manual Hugging Face import flow is useful when:

The model you want is not available in Ollama’s official library
You want a specific quantized variant
You have already downloaded the GGUF file manually
You want finer control over how the model is packaged

If Ollama already provides an official version, using pull is usually simpler. But when you need a specific quantization or a custom wrapper, GGUF + Modelfile gives you more flexibility.

Common notes

The path after FROM must match the actual location of the .gguf file.
If the filename contains spaces or special characters, it is better to rename it first.
Different GGUF quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.
If the model is a chat model, you may still need to adjust the prompt template later for better results.

Conclusion

Downloading a GGUF file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal Modelfile, then run ollama create, and you can bring a third-party GGUF model into your Ollama workflow.

How to Troubleshoot Slow `ollama pull` Model Downloads

Thu, 09 Apr 2026 10:42:39 +0800

ollama pull model_name:tag can be very slow in some regions, and the download process is not always stable.

If your issue looks like repeated interruptions halfway through a large model download, with errors such as TLS handshake timeout or unexpected EOF, the bottleneck may not be registry.ollama.ai itself, but the actual download path after the redirect.

This article walks through a simple troubleshooting approach: first get the real model file URLs, then confirm where the traffic actually ends up, and finally optimize only the domains that matter.

Get the model file download URLs

You can use the following project to extract the manifest and blob download URLs for an Ollama model directly:

https://github.com/Gholamrezadar/ollama-direct-downloader

Using gemma4:latest as an example, you can extract links like the following.

Manifest URL

`1`	`https://registry.ollama.ai/v2/library/gemma4/manifests/latest`

Blob URLs

https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3

If you only want a quick verification, you can also download the manifest and blobs directly with curl:

curl -L "https://registry.ollama.ai/v2/library/gemma4/manifests/latest" -o "latest"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11" -o "sha256-f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a" -o "sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2" -o "sha256-7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2"

The real download URL after the redirect

If you try downloading one of the blobs with wget, you will notice that the request does not stay on registry.ollama.ai. It gets redirected to a Cloudflare R2 object storage URL:

`1`	`wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a`

There are a few key details in the log:

registry.ollama.ai returns 307 Temporary Redirect
The final download URL lands on *.r2.cloudflarestorage.com
The large file transfer is actually being served by the object storage domain behind the redirect

This matters because if your proxy or routing rules only cover registry.ollama.ai but not *.r2.cloudflarestorage.com, downloads can still be slow or repeatedly interrupted.

Here is one example of an actual redirect log:

wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
--2026-04-09 09:22:04--  https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
Resolving registry.ollama.ai (registry.ollama.ai)... 104.21.75.227, 172.67.182.229, 2606:4700:3034::ac43:b6e5, ...
Connecting to registry.ollama.ai (registry.ollama.ai)|104.21.75.227|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?... [following]
--2026-04-09 09:22:05--  https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?...
Resolving dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com (dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com)... 172.64.66.1, 2606:4700:2ff9::1
Connecting to dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com|172.64.66.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9608338848 (8.9G) [application/octet-stream]

Adjust your network settings

Once you confirm the real download path, the troubleshooting direction becomes much clearer.

If you are using a proxy, split routing, or custom DNS, check these first:

Whether registry.ollama.ai and *.r2.cloudflarestorage.com are using the same stable route
Whether your proxy rules cover only the former but miss the latter
Whether your current outbound path is suitable for sustained multi-GB downloads

The key issue here is not simply whether the official site opens, but whether the redirected object storage path is stable enough for long-running large-file transfers. In many cases, the real bottleneck is the Cloudflare R2 layer rather than the registry domain in front of it.

Before-and-after comparison

Here is one real-world example while downloading gemma4:31b-it-q8_0.

Before adjusting the network path, the download was slow and failed midway:

PS C:\Users\knightli> ollama run gemma4:31b-it-q8_0
pulling manifest
pulling a0feadb736f5:  38% ▕██████████████████████                                    ▏  12 GB/ 33 GB  1.2 MB/s   4h40m
Error: max retries exceeded: unexpected EOF

After the adjustment, the same model download became noticeably faster and more stable:

1
2
3

PS C:\Users\knightli> ollama run gemma4:31b-it-q8_0
pulling manifest
pulling a0feadb736f5:  46% ▕████████████████████████████████████████████████████████████████▏ 15 GB/ 33 GB  8.5 MB/s  35m23s

This does not mean every network environment will see the same improvement, but it does support one useful conclusion: the bottleneck may be the actual large-file download path rather than the Ollama client itself.

A more practical troubleshooting order

If you run into the same issue, this order usually works well:

Run ollama pull or ollama run once and confirm the issue is reproducible.
Test a blob URL with wget or curl -L and confirm whether it redirects to *.r2.cloudflarestorage.com.
Adjust your proxy or routing only for the real download domain, then test speed and stability again.

The benefit of this order is that each step validates one clear hypothesis, so you do not have to troubleshoot blindly.

Conclusion

When ollama pull is slow, the problem is often not that registry.ollama.ai is unreachable, but that the Cloudflare R2 path actually serving the large files is unstable.

So instead of retrying over and over, a better approach is to identify the real download path first and optimize the network route where the traffic actually lands.

Connect OpenClaw to Local Gemma 4: Complete Setup Guide

Wed, 08 Apr 2026 18:18:00 +0800

This guide shows how to connect OpenClaw to a local Gemma 4 model through Ollama.

If you have not deployed Gemma 4 locally yet, start here:

How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide

Step 1: Start the Ollama API Service

Start Ollama first:

`1`	`ollama serve`

Then verify the API quickly with:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:12b",
  "prompt": "Hello"
}'

If you get a model response, your local API is ready.

Step 2: Configure OpenClaw to Use Ollama

The OpenClaw config file is usually located at:

`1`	`~/.openclaw/config.yaml`

Edit config.yaml and add a local model entry under models:

models:
  # Your existing model config...

  gemma4-local:
    provider: ollama
    base_url: http://localhost:11434
    model: gemma4:12b
    timeout: 120s

Step 3: Set Default Model (Optional)

If you want Gemma 4 as the default model:

`1`	`default_model: gemma4-local`

Step 4: Restart and Verify OpenClaw

Restart OpenClaw:

`1`	`openclaw restart`

List available models:

`1`	`openclaw models list`

Run a quick chat test:

`1`	`openclaw chat --model gemma4-local "Hello"`

If the chat returns normally, OpenClaw is successfully connected to local Gemma 4.

Common Troubleshooting

connection refused: make sure ollama serve is running.
Model not found: check model name with ollama list (for example gemma4:12b).
Timeout: increase timeout and test a smaller model first.

How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide

Wed, 08 Apr 2026 18:06:00 +0800

If you want to run Gemma 4 locally on a laptop, Ollama is one of the fastest and simplest options. Even without complex setup, you can usually get it running in about five minutes.

Step 1: Install Ollama

Open https://ollama.com and download the installer for your OS.
Complete installation based on your system:

macOS: drag it to Applications.
Windows: run the .exe installer.
Linux: use the install script from the official site.

After installation, Ollama runs as a background service. Beyond initial setup, daily usage is mostly simple commands.

Step 2: Download a Gemma 4 Model

Open a terminal and run:

`1`	`ollama pull gemma4:4b`

If your machine is stronger, you can switch to 12b or 27b. Once downloaded, the model is stored locally.

Check downloaded models with:

`1`	`ollama list`

Step 3: Run the Model

`1`	`ollama run gemma4:4b`

This opens an interactive chat session in your terminal. Type your prompt and press Enter. To exit, type:

/bye

If you prefer a browser chat UI, you can pair it with Open WebUI. It wraps Ollama with a local web interface and is usually quick to set up with Docker.

Laptop Performance Tips

Apple Silicon (M2/M3/M4): Metal acceleration is enabled by default, and 12B can run well.
NVIDIA GPU: CUDA is used automatically when a compatible GPU is detected. Keep drivers updated.
CPU-only inference: works, but larger models will be slower. For most CPU-only setups, 4B is the practical default.
Free memory before loading large models: as a rough rule, each billion parameters needs about 0.5GB to 1GB RAM.

How to Choose a Model

Gemma 4 1B: good for lightweight Q&A, simple summarization, and quick lookups; limited on complex reasoning.
Gemma 4 4B: best for most daily tasks (writing help, coding help, document summarization) with strong speed/quality balance.
Gemma 4 12B: better for longer context and more complex tasks, especially coding and reasoning.
Gemma 4 27B: better for high-demand workloads and closer to frontier-cloud quality, but needs significantly stronger hardware.

How to Check Whether an Ollama Model Is Loaded on GPU

Mon, 06 Apr 2026 10:15:18 +0800

If you want to confirm whether an Ollama model is actually running on GPU, the most direct way is checking processor allocation for currently loaded models.

Command

`1`	`ollama ps`

Example Output

1
2

NAME        ID            SIZE    PROCESSOR   UNTIL
llama3:70b  bcfb190ca3a7  42 GB   100% GPU    4 minutes from now

How to Read the `PROCESSOR` Column

100% GPU: The model is fully loaded into GPU VRAM.
100% CPU: The model is fully loaded in system memory (no GPU inference).
48%/52% CPU/GPU: The model is split between system memory and GPU VRAM.

Practical Tips

If you expect GPU usage but see 100% CPU, first check GPU drivers, CUDA/ROCm environment, and Ollama runtime settings.
With larger models and limited VRAM, CPU/GPU mixed loading is common.
For performance troubleshooting, run ollama ps before checking speed metrics to locate bottlenecks faster.

Summary

ollama ps is the first step to verify real GPU usage. Focus on the PROCESSOR column to quickly identify where the model is loaded and decide your next optimization action.

Ollama Default Model Storage Path and Migration Guide (Avoid Filling Up C Drive)

Mon, 06 Apr 2026 09:38:00 +0800

When running local LLMs, the system drive is often the first thing to run out of space. Ollama stores models in user or system directories by default, so your C drive can fill up quickly without path planning.

Common Default Ollama Model Directories

Windows: C:\Users\<username>\.ollama\models
macOS: ~/.ollama/models
Linux: /usr/share/ollama/.ollama/models (may vary by installation method)

Windows: Move the Model Directory to a Non-System Drive

A practical choice is moving model storage to a path like D:\OllamaModels. The key is setting the OLLAMA_MODELS system environment variable.

1. Create the Target Directory

For example, create: D:\OllamaModels

2. Configure the System Environment Variable

Variable name: OLLAMA_MODELS
Variable value: D:\OllamaModels

You can set it in “System Properties -> Advanced -> Environment Variables”, or with an admin PowerShell command:

`1`	`[System.Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\OllamaModels", "Machine")`

3. Restart Ollama (or Reboot the System)

After setting the variable, restart the Ollama service/app. If you’re unsure whether it has taken effect, rebooting the PC is the most reliable option.

4. Verify the New Path Is Active

Pull any model and check whether new files appear under D:\OllamaModels.

5. Clean Up the Old Directory (After Confirmation)

Once models work correctly in the new location, remove old files to reclaim C drive space.

FAQ

Still Writing to C Drive After Configuration

Confirm the variable is a system variable, not a temporary session variable.
Confirm the Ollama process was restarted.
Verify the variable name is exactly OLLAMA_MODELS.

Do I Need to Migrate Existing Model Files

If you want to avoid re-downloading, stop Ollama, copy existing model files to the new directory, then restart Ollama and verify.

Completely Uninstall Ollama on Linux (Including Leftover Cleanup)

Mon, 06 Apr 2026 09:16:29 +0800

If you need to remove Ollama completely from Linux, follow the steps below in order. This guide cleans up the service, executable, model directory, and the ollama user/group.

Before You Uninstall

The commands below will delete local Ollama model files (usually in /usr/share/ollama). Back up first if needed.
These commands use sudo by default. Make sure your account has administrator privileges.

1. Stop and Remove the systemd Service

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm -f /etc/systemd/system/ollama.service
sudo systemctl daemon-reload

2. Remove the Ollama Binary

OLLAMA_BIN="$(command -v ollama)"
if [ -n "$OLLAMA_BIN" ]; then
  sudo rm -f "$OLLAMA_BIN"
fi

3. Remove Ollama Library Directories (If Present)

If your installation wrote Ollama files into a lib directory, clean them up with:

1
2
3

for d in /usr/local/lib/ollama /usr/lib/ollama /lib/ollama; do
  [ -d "$d" ] && sudo rm -rf "$d"
done

4. Remove Model and Data Directory

`1`	`sudo rm -rf /usr/share/ollama`

5. Remove System User and Group (If Present)

1
2

id -u ollama >/dev/null 2>&1 && sudo userdel ollama
getent group ollama >/dev/null 2>&1 && sudo groupdel ollama

6. Verify Uninstall Completion

1
2

command -v ollama || echo "ollama binary not found"
systemctl status ollama || true

If ollama is no longer found in the checks above, the uninstall is complete.

LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

Sun, 05 Apr 2026 22:09:11 +0800

The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.

What Is Quantization

Quantization means compressing model parameters from higher-precision formats (such as FP16) into lower-bit formats (such as Q8 and Q4).

A simple analogy:

Original model: like a high-quality photo, clear but large.
Quantized model: like a compressed photo, slightly less detail but lighter and faster.

Common Quantization Formats

Quantization	Precision / Bit Width	Size	Quality Loss	Recommended Use
FP16	16-bit float	Largest	Almost none	Research, evaluation, max quality
Q8_0	8-bit integer	Larger	Almost none	High-end PCs, quality + performance
Q5_K_M	5-bit mixed	Medium	Slight	Daily driver, balanced choice
Q4_K_M	4-bit mixed	Smaller	Acceptable	General default, strong value
Q3_K_M	3-bit mixed	Very small	Noticeable	Low-spec devices, run-first
Q2_K	2-bit mixed	Smallest	Significant	Extreme resource limits, fallback

Quantization Naming Rules

Take gemma-4:4b-q4_k_m as an example:

gemma-4:4b: model name and parameter scale.
q4: 4-bit quantization.
k: K-quants (an improved quantization method).
m: medium level (common options also include s/small and l/large).

Quick Selection by VRAM

RAM / VRAM	Recommended Quantization
4 GB	Q3_K_M / Q2_K
8 GB	Q4_K_M
16 GB	Q5_K_M / Q8_0
32 GB+	FP16 / Q8_0

Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.

Practical Tips

Start with Q4_K_M by default and test real tasks first.
If response quality is not enough, move up to Q5_K_M or Q8_0.
If VRAM or speed is the main bottleneck, move down to Q3_K_M.
Use the same test set every time you switch quantization formats.

Conclusion

Quality first: FP16 or Q8_0.
Balance first: Q5_K_M.
General default: Q4_K_M.
Low-spec fallback: Q3_K_M or Q2_K.

The key is not “bigger is always better”, but “the most stable and usable result under your hardware limits.”

Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B

Sun, 05 Apr 2026 08:30:00 +0800

Gemma 4 focuses on multimodality and local offline inference, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.

Gemma 4 Model Comparison

The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.

Model	Parameter Size	Positioning	Key Strengths	Main Limitations	Recommended Scenarios
Gemma 4 2B	2B	Ultra-lightweight	Low latency, low resource usage, lowest deployment barrier	Limited performance on complex reasoning and long task chains	Mobile, IoT, lightweight Q&A, simple automation
Gemma 4 4B	4B	Lightweight enhanced	Stronger understanding and generation than 2B, still easy to deploy locally	Limited ceiling for heavy coding and complex agent tasks	Local assistant, basic document work, multilingual daily tasks
Gemma 4 26B	26B	High-performance (MoE)	Better reasoning and tool use, suitable for production workflows	Significantly higher VRAM requirement and hardware threshold	Coding assistant, complex workflows, enterprise internal agents
Gemma 4 31B	31B	High-performance (dense)	Best overall capability and stronger stability on complex tasks	Highest resource cost and tuning complexity	Advanced reasoning, complex coding tasks, heavy automation

How to Choose: Start from Hardware and Tasks

If your top concern is whether it runs smoothly, use this guideline:

8GB VRAM: prioritize 2B/4B.
12GB VRAM: prioritize 4B or quantized variants of larger models.
24GB VRAM: focus on 26B, and evaluate quantized 31B based on workload.
Higher VRAM or multi-GPU: consider high-precision 31B setups.

Prioritize stability and inference speed first, then scale up model size gradually.

Four Typical Use Cases

1) Local General Assistant

Preferred model: 4B
Why: strong balance between cost and quality, suitable for long-running local use.

2) Coding and Automation

Preferred model: 26B
Why: more stable in multi-step tasks, tool calls, and script generation.

3) Advanced Reasoning and Complex Agents

Preferred model: 31B
Why: stronger robustness under complex context.

4) Edge Devices and Lightweight Offline Use

Preferred model: 2B
Why: easiest to deploy on resource-constrained devices.

Deployment Suggestions (Ollama)

A practical approach is to iterate in small steps:

Start with 4B to establish a baseline (latency, memory, quality).
Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).
Compare 26B/31B against that set for accuracy, latency, and VRAM cost.
Upgrade only when the gain is clear.

This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.

Conclusion

The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:

For low-cost fast rollout: start with 2B/4B.
For production-grade local AI workflows: prioritize 26B.
For advanced reasoning and heavy automation: move to 31B.

In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.

Ollama on KnightLi Blog

Claude Code + Ollama Local Deployment Guide: Build a Free AI Coding Assistant with CC Switch

What This Setup Solves

Basic Preparation

Key CC Switch Configuration

Where Claude Code Is Strong

What Role Ollama Plays Here

Where The Experience Has Boundaries

Multimodal Compatibility Is Still Unstable

Who Should Try It

Usage Advice

Summary

Local LLM Models Recommended for an RTX 3060 GPU

Start With the VRAM Limit

Recommendation 1: Qwen3 8B

Recommendation 2: Llama 3.1 8B Instruct

Recommendation 3: Gemma 3 12B

Recommendation 4: DeepSeek R1 Distill Qwen 8B

Recommendation 5: Phi / MiniCPM / Smaller Models

Which Quantization to Use

Which Tool to Use

Do Not Set Context Too High

Choose by Use Case

Reasonable Expectations

Summary

References

How to Fix Ollama Using CPU Instead of GPU

1. First, confirm whether Ollama is really not using the GPU

2. Rule out the most common misunderstanding first: the model does not fit into VRAM

3. Check whether the GPU driver and the lower-level runtime are actually working

NVIDIA

AMD / ROCm

4. Restart the Ollama service, not just your terminal

5. Check whether the environment variables are really reaching the service

6. On AMD platforms, focus on ROCm compatibility

7. In Docker, WSL, or remote environments, also check device mapping

8. Check logs last, but check them for the right reason

Troubleshooting Order

Conclusion

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Official Behavior: Single GPU First, Multi-GPU When Needed

Multi-GPU Is Not Simple Compute Stacking

SLI or NVLink Is Not Required

Limit Which NVIDIA GPUs Ollama Uses

AMD and Vulkan Device Selection

Exposing Multiple GPUs in Docker

What Is OLLAMA_SCHED_SPREAD

How to Check Whether Multiple GPUs Are Being Used

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Misunderstanding 4: NVLink / SLI Is Required

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

GPU Selection Suggestions

Summary

References

Deploy Hermes Agent Locally on Windows with WSL + Ollama and Connect Telegram

Overall flow

1. Install WSL and Ubuntu

2. Update Ubuntu and install the base environment

Install Python

Install zstd

Install Node.js

Install Git

3. Install Ollama and pull Gemma 4

4. Install and configure Hermes Agent

Common Hermes Agent commands

Start

Re-enter setup

Configure the chat gateway

Update

Basic Telegram connection steps

Who this setup fits

A few things to keep in mind

Conclusion

Original reference

How to Access a Local Ollama API Over LAN on Windows

Set the listening host

Open the firewall

What Is `OLLAMA_SCHED_SPREAD`

How to Read the `PROCESSOR` Column