Gemma 4 on KnightLi Blog

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Fri, 01 May 2026 11:42:34 +0800

Gemma 4 currently has four main sizes for local deployment: E2B, E4B, 26B A4B, and 31B. E2B and E4B target lightweight and edge devices, 26B A4B uses an MoE architecture, and 31B is the larger dense model.

The easiest mistake in local inference is mixing up two numbers:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.

The tables below estimate VRAM requirements based on GGUF file size. The default assumption is local text inference with llama.cpp, LM Studio, Ollama, or similar runtimes, using short to medium context. If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.

Quick Summary

VRAM	Good Fit	Avoid
4GB	Low-bit E2B quantizations	E4B and above
6GB	E2B Q4/Q5, low-bit E4B	26B, 31B
8GB	E2B Q8, E4B Q4/Q5	26B Q4, 31B Q4
12GB	E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests	26B Q4 with long context, 31B Q4
16GB	Low-bit 26B, low-bit 31B	31B Q4 with long context, 26B Q5 and above
24GB	26B Q4/Q5, 31B Q4	31B Q8, BF16
32GB	26B Q6/Q8, 31B Q5/Q6	BF16
48GB	31B Q8 more comfortably, 26B Q8 with longer context	31B BF16
80GB+	26B/31B BF16	Single consumer GPU deployment

If you just want something usable locally, start with E4B Q4_K_M or E2B Q4_K_M. With 24GB VRAM, 26B A4B Q4_K_M and 31B Q4_K_M start to become realistic choices.

Gemma 4 E2B VRAM Table

E2B is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing. It is easy to run, but complex reasoning, coding, and long tasks are limited.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	2.29GB	4GB	6GB	Extreme low-VRAM tests
`UD-Q2_K_XL`	2.40GB	4GB	6GB	Low-VRAM usability
`Q3_K_M`	2.54GB	4GB	6GB	Lightweight chat and summaries
`IQ4_XS`	2.98GB	6GB	8GB	Balance of quality and size
`Q4_K_M`	3.11GB	6GB	8GB	Recommended E2B default
`Q5_K_M`	3.36GB	6GB	8GB	Slightly steadier than Q4
`Q6_K`	4.50GB	8GB	10GB	Higher-quality small model
`Q8_0`	5.05GB	8GB	10GB	Near-original precision for lightweight deployment
`BF16`	9.31GB	12GB	16GB	Debugging, comparison, research

For daily use, E2B Q4_K_M is already enough. With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.

Gemma 4 E4B VRAM Table

E4B is the more practical lightweight model. Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	3.53GB	6GB	8GB	Low-VRAM tests
`UD-Q2_K_XL`	3.74GB	6GB	8GB	Low-VRAM usability
`Q3_K_M`	4.06GB	6GB	10GB	Lightweight local assistant
`IQ4_XS`	4.72GB	8GB	12GB	Balance of quality and speed
`Q4_K_M`	4.98GB	8GB	12GB	Recommended E4B default
`Q5_K_M`	5.48GB	8GB	12GB	Steadier everyday use
`Q6_K`	7.07GB	10GB	16GB	Quality first
`Q8_0`	8.19GB	12GB	16GB	Near-original precision
`BF16`	15.05GB	20GB	24GB	Research, evaluation, precision comparison

If your GPU has 8GB VRAM, E4B Q4_K_M is a realistic starting point. With 12GB or 16GB VRAM, E4B Q8_0 is also worth considering.

Gemma 4 26B A4B VRAM Table

26B A4B is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference. It is better suited to more complex Q&A, coding, tool use, and agent workflows.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	9.97GB	14GB	16GB	Extreme 16GB GPU tests
`UD-Q2_K_XL`	10.55GB	14GB	16GB	Running 26B with low VRAM
`UD-Q3_K_M`	12.53GB	16GB	20GB	Better quality while still VRAM-conscious
`UD-IQ4_XS`	13.42GB	16GB	24GB	Balance of quality and size
`UD-Q4_K_M`	16.87GB	20GB	24GB	Recommended 26B default
`UD-Q5_K_M`	21.15GB	24GB	32GB	Higher-quality quantization
`UD-Q6_K`	23.17GB	28GB	32GB	Quality first
`Q8_0`	26.86GB	32GB	40GB	Near-original precision
`BF16`	50.51GB	64GB	80GB	Not realistic for most single consumer GPUs

24GB VRAM is the comfortable dividing line for 26B A4B. A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.

Gemma 4 31B VRAM Table

31B is the larger dense model. Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	8.53GB	12GB	16GB	Extreme low-VRAM tests with clear quality loss
`UD-IQ2_M`	10.75GB	14GB	18GB	Low-VRAM tests
`UD-Q2_K_XL`	11.77GB	16GB	20GB	16GB GPU experiments
`Q3_K_S`	13.21GB	16GB	24GB	More VRAM-efficient 3-bit
`Q3_K_M`	14.74GB	20GB	24GB	Common 3-bit compromise
`IQ4_XS`	16.37GB	20GB	24GB	Near-Q4 compromise
`Q4_K_M`	18.32GB	24GB	32GB	Recommended 31B default
`Q5_K_M`	21.66GB	28GB	32GB	Higher-quality quantization
`Q6_K`	25.20GB	32GB	40GB	Quality first
`Q8_0`	32.64GB	40GB	48GB	Near-original precision
`BF16`	61.41GB	80GB	96GB	Server or large-VRAM workstation

Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point. Q4_K_M is the balanced choice, while Q5_K_M and above make more sense with 32GB+ VRAM.

Why Actual Usage Is Higher Than File Size

The GGUF file size is only the weight size. Runtime usage also includes:

KV cache: longer context means higher memory use.
Batch size and concurrency: processing more tokens or more users increases VRAM.
Multimodal components: image, audio, or video input often requires mmproj or extra modules.
Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.
KV cache quantization: q8_0, q4_0, and similar modes can save VRAM, but may affect detail.

So the “minimum VRAM” column should be read as the threshold for startup and short-context inference. For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.

How to Choose

If you just want to try Gemma 4 locally:

4GB to 6GB VRAM: choose E2B Q3_K_M or E2B Q4_K_M.
8GB VRAM: prefer E4B Q4_K_M; E2B Q8_0 is also fine.
12GB VRAM: choose E4B Q8_0, or try low-bit 26B/31B variants.
16GB VRAM: try 26B A4B UD-Q3_K_M or 31B Q3_K_S, but do not expect long context to feel comfortable.
24GB VRAM: focus on 26B A4B UD-Q4_K_M and 31B Q4_K_M.
32GB and above: consider Q5_K_M, Q6_K, or longer context.

Most users do not need BF16. Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.

References

Gemma 4 E4B Uncensored vs Official: What Actually Changes

Sat, 18 Apr 2026 10:20:00 +0800

If you see a model like HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive, the most important point is this: it is not a new Google base model. It is a derivative release built on top of the official google/gemma-4-E4B-it, but with alignment behavior intentionally pushed toward fewer refusals.

That means the real difference is usually behavioral policy and response style, not a brand-new architecture.

What the derivative model explicitly claims

According to its Hugging Face model card, the HauhauCS release says:

it is based on google/gemma-4-E4B-it
it makes “no changes to datasets or capabilities”
it is “just without the refusals”
the Aggressive variant is “fully unlocked and won’t refuse prompts”

Those are the creator’s claims, not an independent benchmark. Still, they tell you the intended positioning very clearly: this is an unofficial derivative optimized to reduce safety refusals.

Official model vs “uncensored” derivative

Dimension	Official `google/gemma-4-E4B-it`	`Gemma-4-E4B-Uncensored-HauhauCS-Aggressive`
Source	Official Google release	Third-party derivative on Hugging Face
Base architecture	Gemma 4 E4B instruction-tuned model	Same base family, explicitly described as based on `google/gemma-4-E4B-it`
Main goal	General-purpose helpful assistant with responsible-use framing	Reduce refusals and keep answering even when the official model might decline
Safety posture	Aligned with Gemma family safety docs and prohibited-use policy	Intentionally weakened refusal behavior
Response style	More likely to refuse, redirect, or soften certain requests	More likely to answer directly, including prompts the official model may block
Risk profile	Lower misuse risk by default, but still not risk-free	Higher misuse risk, higher chance of unsafe or non-compliant output
Predictability in products	Easier to justify in normal apps and enterprise environments	Harder to justify in public-facing, business, or policy-sensitive deployments
Compliance burden	Still requires application-level safeguards	Requires even stronger downstream safeguards because the model itself is less restrictive

The core difference is alignment, not raw capability

Many users mistakenly treat “uncensored” as if it means “smarter.” That is usually the wrong frame.

For a derivative like this, what changes first is:

how often the model refuses
how strongly it follows harmful or policy-sensitive instructions
how much filtering remains in its final answers

What does not automatically change:

the underlying Gemma 4 family architecture
context window class
multimodal support class
general reasoning ceiling

In other words, an uncensored derivative is often better described as a different behavioral tuning of the same model family, not a higher-tier model.

Why the official version behaves differently

Google’s official Gemma materials frame the family as being built for responsible AI development. The Gemma model card highlights misuse, harmful content, privacy, and bias risks, and Google’s Gemma Prohibited Use Policy explicitly forbids using Gemma or model derivatives to:

facilitate dangerous, illegal, or malicious activities
generate harmful or deceptive content
override or circumvent safety filters

So the official model is not just “more conservative” by accident. Its surrounding policy and intended deployment posture are deliberately different.

When the official model is the better choice

Use the official google/gemma-4-E4B-it path if you care about:

product deployment
enterprise or team use
lower legal and policy exposure
fewer obviously unsafe outputs
easier documentation and review

For most normal applications, this is the safer default.

When people choose the uncensored derivative

Users usually choose an uncensored derivative for:

local private experimentation
testing where the official model refuses too early
roleplay or open-ended creative prompting
comparing alignment behavior across variants

But this comes with a real trade-off: you are moving more safety responsibility from the model provider to yourself.

Practical conclusion

The difference between a so-called “jailbroken” Gemma 4 E4B and the ordinary official version is mostly this:

the official version is optimized for usable capability with guardrails
the uncensored derivative is optimized for fewer refusals with weaker guardrails

That does not automatically make the uncensored model stronger. It mainly makes it more permissive.

If your goal is stable, explainable, and lower-risk deployment, use the official model first. If your goal is local experimentation and you understand the compliance and safety trade-offs, then an uncensored derivative is a behavior variant worth testing separately, not a drop-in “better” replacement.

Sources

Hugging Face: HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
Hugging Face: google/gemma-4-E4B-it
Google AI for Developers: Gemma Prohibited Use Policy
Google AI for Developers: Gemma model card

Deploy Hermes Agent Locally on Windows with WSL + Ollama and Connect Telegram

Sat, 18 Apr 2026 00:48:22 +0800

If you want to run Hermes Agent on Windows with as little friction as possible, a practical path is:

keep Windows as the host system
run Ubuntu inside WSL
use Ollama to serve the local model
let Hermes Agent connect directly to the local Ollama endpoint

This approach keeps the environment relatively clean, lets you run most commands in a Linux-style workflow, and avoids preparing a separate Linux machine.

Overall flow

You can split the setup into 4 steps:

Enable WSL and install Ubuntu
Install Python, Node.js, Git, and other basics inside Ubuntu
Install Ollama and pull a local model
Install Hermes Agent, then connect Telegram

If your goal is simply to get Hermes Agent running first, by the end of step 3 you are already close.

1. Install WSL and Ubuntu

Run this in PowerShell with administrator privileges:

`1`	`wsl --install`

After the installation finishes, restart the PC, then continue with Ubuntu:

`1`	`wsl --install -d Ubuntu`

After that, open Ubuntu in WSL. Most of the remaining commands are run there.

2. Update Ubuntu and install the base environment

Update the system first:

1
2

sudo apt update
sudo apt upgrade -y

Then install Python, extraction tools, Node.js, and Git.

Install Python

`1`	`sudo apt install python3-pip python3-venv -y`

Install zstd

`1`	`sudo apt install -y zstd`

Install Node.js

1
2

curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs

Install Git

1
2

sudo apt update
sudo apt install -y git

You can quickly verify the installation with:

1
2
3

node -v
npm -v
git --version

3. Install Ollama and pull Gemma 4

Install Ollama:

`1`	`curl -fsSL https://ollama.com/install.sh \| sh`

If you want a local model for Hermes Agent, starting with Gemma 4 is reasonable.

For example:

`1`	`ollama run gemma4:e4b`

If your machine is weaker, you can also try:

`1`	`ollama run gemma4:e2b`

Larger variants include:

1
2

ollama run gemma4:26b
ollama run gemma4:31b

For most normal Windows + WSL setups, gemma4:e4b is usually the more practical starting point.

4. Install and configure Hermes Agent

Install it with:

`1`	`curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh \| bash`

After installation, point it to the local Ollama endpoint:

`1`	`http://127.0.0.1:11434`

Use the local model name you actually installed, for example:

`1`	`gemma4:e4b`

If the installer asks you to refresh the shell, run:

`1`	`source ~/.bashrc`

Common Hermes Agent commands

These are the commands you will use most often:

Start

hermes

Re-enter setup

`1`	`hermes setup`

Configure the chat gateway

`1`	`hermes setup gateway`

Update

`1`	`hermes update`

Basic Telegram connection steps

If you want Hermes Agent to send and receive messages through Telegram, the core step is still:

`1`	`hermes setup gateway`

Then prepare the two Telegram-side items you need:

create a bot with BotFather
get your User ID with @userinfobot

Once you have those basics, continue filling them into the Hermes Agent gateway setup.

Who this setup fits

This workflow is a good fit if:

Windows is your main desktop system
you do not want to maintain a separate Linux host
you want to get a local Agent running first, then expand to chat platforms
you prefer local models instead of depending on cloud APIs

If you mainly want to experience a local Agent rather than build a full production deployment immediately, this path is already practical enough.

A few things to keep in mind

WSL is still a compatibility layer, so in extreme cases it may not behave exactly like native Linux
whether a large model runs smoothly still depends on your RAM, VRAM, and CPU / GPU
gemma4:e4b is a realistic starting point, but actual experience still depends on the machine
Hermes Agent platform integration is an extension step; getting the local model path working first, then adding Telegram, is usually more stable

Conclusion

If you want to deploy Hermes Agent locally on Windows with as little friction as possible, the smoother order is:

WSL -> Ubuntu -> Ollama -> Gemma 4 -> Hermes Agent -> Telegram

Get the local model running first, then add the gateway integration. That usually gives you a much higher success rate. For most users, this is easier to troubleshoot than piling on every component at the beginning, and it also leaves room for later expansion.

Original reference

This post is rewritten and organized based on:

Xchaoge Blog: 太简单了！Hermes Agent 本地部署（无需API）接入 Telegram + 微信

How to Fix SSL Certificate Verification Failed When llama-cli Downloads from Hugging Face on Windows

Fri, 17 Apr 2026 14:20:29 +0800

If you run this command on Windows:

`1`	`llama-cli -hf unsloth/gemma-4-E4B-it-GGUF`

and see an error like this:

1
2

get_repo_commit: error: HTTPLIB failed: SSL server verification failed
error: failed to download model from Hugging Face

the problem is usually not CUDA or llama.cpp itself. More often, the program cannot correctly access the system certificate chain in the current environment, so HTTPS verification fails.

From the log, ggml-rpc.dll and ggml-cpu-alderlake.dll were loaded successfully, which means the runtime environment is mostly fine. The issue is mainly in the model download step.

The easiest workaround: download the model manually

If you just want to get it running quickly, downloading the model manually is usually the most stable option.

Open the matching Hugging Face repository page.
Download the required .gguf file from Files and versions.
After the download finishes, run it with the local file path:

`1`	`llama-cli -m C:\Users\knightli\Downloads\gemma-4-e4b-it.gguf`

This bypasses SSL verification during the -hf download step and is useful when you only want to verify that the model can run locally.

If you still want to use `-hf` automatic download

You can manually specify a certificate file path so the program can find a usable CA bundle in the current session.

cacert.pem can be obtained from the CA Extract page maintained by the curl project:

Page: https://curl.se/docs/caextract.html
Direct download: https://curl.se/ca/cacert.pem

If you download it in a browser, open the direct download link and save it as cacert.pem. You can also download it to a fixed directory with PowerShell:

1
2

New-Item -ItemType Directory -Force C:\certs
Invoke-WebRequest -Uri https://curl.se/ca/cacert.pem -OutFile C:\certs\cacert.pem

After the download finishes, set these variables in the command line:

1
2

set SSL_CERT_FILE=C:\certs\cacert.pem
set CURL_CA_BUNDLE=C:\certs\cacert.pem

Then run the original command again:

`1`	`llama-cli -hf unsloth/gemma-4-E4B-it-GGUF`

If the issue really comes from the certificate chain, this usually fixes it directly.

What Does `it` Mean in Gemma-4-31B-it

Sat, 11 Apr 2026 20:45:34 +0800

In gemma-4-31B-it, it stands for Instruction Tuned.

For most users, that means this version is designed for chat, Q&A, coding help, and other instruction-following tasks.

What `it` means

Models often come in two common forms:

Base / Pre-trained: closer to a raw next-token predictor
it: tuned to follow user instructions more reliably

If you ask something like “translate this text” or “write a Python script”, the it version usually behaves more like an assistant.

What `31B` means

31B means the model has about 31 billion parameters.

In general:

more parameters often mean stronger capability
but also higher VRAM or RAM requirements

So 31B is a relatively large model and needs stronger hardware.

What `Gemma-4` means

Gemma-4 identifies the model family and generation:

Gemma: Google’s open model family
4: the fourth generation in that family

Which one to choose

If your goal is chat, Q&A, translation, or coding, the -it version is usually the better choice.

The base version is more relevant for lower-level research, fine-tuning, or custom training workflows.

One-line summary

gemma-4-31B-it means: Gemma 4 family, 31 billion parameters, instruction-tuned for conversation and task execution.

Gemma 4 Local Runtime Guide: From One-Command Start to Dev Integration

Fri, 10 Apr 2026 22:54:17 +0800

If you want to run Gemma 4 locally, you can choose from four practical paths depending on your goal and hardware.

1) Fastest start: Ollama (recommended)

This is the lowest-friction option for quick testing, daily chat, and local API usage.

`1`	`ollama run gemma4`

Highlights:

Works on Windows, macOS, and Linux
Handles hardware acceleration automatically
Offers OpenAI-style local API compatibility

2) GUI workflow: LM Studio / Unsloth Studio

If you prefer a desktop UI instead of terminal commands:

LM Studio: browse and run Gemma 4 quantized variants from Hugging Face (for example 4-bit, 8-bit), with resource visibility.
Unsloth Studio: supports both inference and low-VRAM fine-tuning, often friendlier on 6GB-8GB GPUs.

3) Low-spec and maximum control: llama.cpp

Good for older hardware, CPU-focused setups, or users who want deeper runtime control.

With .gguf model files and quantization, Gemma 4 can be made practical on much smaller hardware budgets.

4) Developer integration: Transformers / vLLM

If you need Gemma 4 inside your own application:

Transformers: straightforward Python integration
vLLM: high-throughput inference for stronger GPU environments

Quick selection

Need	Recommended tools	Hardware bar
I just want it running now	Ollama	Low
I want a ChatGPT-like UI	LM Studio	Medium
My VRAM is limited (6GB-8GB)	Unsloth / llama.cpp	Low
I am building local AI apps	Ollama / Transformers / vLLM	Medium to high
I need fine-tuning	Unsloth Studio	Medium to high

Model size suggestion

Gemma 4 comes in multiple sizes (for example E2B, E4B, 31B).

Start with quantized E2B/E4B on mainstream laptops
Move to larger variants only after your baseline pipeline is stable

How to Troubleshoot Slow `ollama pull` Model Downloads

Thu, 09 Apr 2026 10:42:39 +0800

ollama pull model_name:tag can be very slow in some regions, and the download process is not always stable.

If your issue looks like repeated interruptions halfway through a large model download, with errors such as TLS handshake timeout or unexpected EOF, the bottleneck may not be registry.ollama.ai itself, but the actual download path after the redirect.

This article walks through a simple troubleshooting approach: first get the real model file URLs, then confirm where the traffic actually ends up, and finally optimize only the domains that matter.

Get the model file download URLs

You can use the following project to extract the manifest and blob download URLs for an Ollama model directly:

https://github.com/Gholamrezadar/ollama-direct-downloader

Using gemma4:latest as an example, you can extract links like the following.

Manifest URL

`1`	`https://registry.ollama.ai/v2/library/gemma4/manifests/latest`

Blob URLs

https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3

If you only want a quick verification, you can also download the manifest and blobs directly with curl:

curl -L "https://registry.ollama.ai/v2/library/gemma4/manifests/latest" -o "latest"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11" -o "sha256-f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a" -o "sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2" -o "sha256-7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2"

The real download URL after the redirect

If you try downloading one of the blobs with wget, you will notice that the request does not stay on registry.ollama.ai. It gets redirected to a Cloudflare R2 object storage URL:

`1`	`wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a`

There are a few key details in the log:

registry.ollama.ai returns 307 Temporary Redirect
The final download URL lands on *.r2.cloudflarestorage.com
The large file transfer is actually being served by the object storage domain behind the redirect

This matters because if your proxy or routing rules only cover registry.ollama.ai but not *.r2.cloudflarestorage.com, downloads can still be slow or repeatedly interrupted.

Here is one example of an actual redirect log:

wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
--2026-04-09 09:22:04--  https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
Resolving registry.ollama.ai (registry.ollama.ai)... 104.21.75.227, 172.67.182.229, 2606:4700:3034::ac43:b6e5, ...
Connecting to registry.ollama.ai (registry.ollama.ai)|104.21.75.227|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?... [following]
--2026-04-09 09:22:05--  https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?...
Resolving dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com (dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com)... 172.64.66.1, 2606:4700:2ff9::1
Connecting to dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com|172.64.66.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9608338848 (8.9G) [application/octet-stream]

Adjust your network settings

Once you confirm the real download path, the troubleshooting direction becomes much clearer.

If you are using a proxy, split routing, or custom DNS, check these first:

Whether registry.ollama.ai and *.r2.cloudflarestorage.com are using the same stable route
Whether your proxy rules cover only the former but miss the latter
Whether your current outbound path is suitable for sustained multi-GB downloads

The key issue here is not simply whether the official site opens, but whether the redirected object storage path is stable enough for long-running large-file transfers. In many cases, the real bottleneck is the Cloudflare R2 layer rather than the registry domain in front of it.

Before-and-after comparison

Here is one real-world example while downloading gemma4:31b-it-q8_0.

Before adjusting the network path, the download was slow and failed midway:

PS C:\Users\knightli> ollama run gemma4:31b-it-q8_0
pulling manifest
pulling a0feadb736f5:  38% ▕██████████████████████                                    ▏  12 GB/ 33 GB  1.2 MB/s   4h40m
Error: max retries exceeded: unexpected EOF

After the adjustment, the same model download became noticeably faster and more stable:

1
2
3

PS C:\Users\knightli> ollama run gemma4:31b-it-q8_0
pulling manifest
pulling a0feadb736f5:  46% ▕████████████████████████████████████████████████████████████████▏ 15 GB/ 33 GB  8.5 MB/s  35m23s

This does not mean every network environment will see the same improvement, but it does support one useful conclusion: the bottleneck may be the actual large-file download path rather than the Ollama client itself.

A more practical troubleshooting order

If you run into the same issue, this order usually works well:

Run ollama pull or ollama run once and confirm the issue is reproducible.
Test a blob URL with wget or curl -L and confirm whether it redirects to *.r2.cloudflarestorage.com.
Adjust your proxy or routing only for the real download domain, then test speed and stability again.

The benefit of this order is that each step validates one clear hypothesis, so you do not have to troubleshoot blindly.

Conclusion

When ollama pull is slow, the problem is often not that registry.ollama.ai is unreachable, but that the Cloudflare R2 path actually serving the large files is unstable.

So instead of retrying over and over, a better approach is to identify the real download path first and optimize the network route where the traffic actually lands.

Gemma 4 on Raspberry Pi 5: It Works, But Responses Are Slow

Wed, 08 Apr 2026 18:42:00 +0800

I ran a near-limit experiment: running Gemma 4 on a Raspberry Pi 5 (8GB RAM). I was not targeting larger variants, only the smallest E2B model.

Conclusion first: it runs and it is usable, but it fits low-interaction workflows better than real-time chat.

Test Environment

Device: Raspberry Pi 5 (4-core CPU, 8GB RAM)
OS: Ubuntu Server (no GUI)
Access method: SSH
Runtime: LM Studio CLI (command-line-only mode)
Model: Gemma 4 E2B (about 4.5GB)

Step 1: Install and Start LM Studio CLI

I installed the LM Studio CLI build on the Pi, then started the service and checked available commands.

For a terminal-only setup, this deployment mode is a good fit for Raspberry Pi.

Step 2: Move Model Storage to SSD

To avoid heavy SD card writes, I switched model download storage to an external SSD.

On Raspberry Pi 5, SSD usage is much more practical than on older models. For long-term local model runs, SSD is strongly recommended.

Step 3: Download and Load Gemma 4 E2B

After download, the model loaded into memory successfully.

According to official information, Gemma 4 includes:

Tool-calling support for agent-style workflows (function calling)
Multimodal capabilities (image/video; smaller models also include audio-related capability)
128K context window
Apache 2.0 license (commercial use allowed)

Given Raspberry Pi hardware limits, E2B is the most practical tier to start with.

Step 4: Start API and Enable LAN Access

After loading, I started the API on local port 4000 and confirmed model listing works via HTTP.

The issue: by default, it only listens on localhost, so other LAN devices cannot access it directly.

Since host binding was not exposed by the startup options, I used socat for port forwarding, bridging an external Pi port to LM Studio’s internal port.

Result: successful. I could query the model list from a MacBook on the same LAN.

Step 5: Connect to Editor (Zed)

LM Studio’s local server is OpenAI-API-compatible, so most tools that support custom base_url can connect.

I added a new LLM provider in Zed pointing to the Pi-hosted Gemma 4 instance, and in-editor chat worked.

Practical Usability

This setup is suitable for:

Local automation scripts
Low-concurrency, low-real-time assistant tasks
Personal learning and edge-device experimentation

Less suitable for:

High-frequency interactive chat
Development collaboration scenarios sensitive to response latency

Conclusion

Running Gemma 4 (E2B) on Raspberry Pi 5 is feasible, and the practical output quality is better than expected.

If your goal is offline operation, tool integration, and lightweight-to-mid tasks, this setup is worth trying. If your goal is smooth real-time interaction, stronger hardware is still the better choice.

Connect OpenClaw to Local Gemma 4: Complete Setup Guide

Wed, 08 Apr 2026 18:18:00 +0800

This guide shows how to connect OpenClaw to a local Gemma 4 model through Ollama.

If you have not deployed Gemma 4 locally yet, start here:

How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide

Step 1: Start the Ollama API Service

Start Ollama first:

`1`	`ollama serve`

Then verify the API quickly with:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:12b",
  "prompt": "Hello"
}'

If you get a model response, your local API is ready.

Step 2: Configure OpenClaw to Use Ollama

The OpenClaw config file is usually located at:

`1`	`~/.openclaw/config.yaml`

Edit config.yaml and add a local model entry under models:

models:
  # Your existing model config...

  gemma4-local:
    provider: ollama
    base_url: http://localhost:11434
    model: gemma4:12b
    timeout: 120s

Step 3: Set Default Model (Optional)

If you want Gemma 4 as the default model:

`1`	`default_model: gemma4-local`

Step 4: Restart and Verify OpenClaw

Restart OpenClaw:

`1`	`openclaw restart`

List available models:

`1`	`openclaw models list`

Run a quick chat test:

`1`	`openclaw chat --model gemma4-local "Hello"`

If the chat returns normally, OpenClaw is successfully connected to local Gemma 4.

Common Troubleshooting

connection refused: make sure ollama serve is running.
Model not found: check model name with ollama list (for example gemma4:12b).
Timeout: increase timeout and test a smaller model first.

How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide

Wed, 08 Apr 2026 18:06:00 +0800

If you want to run Gemma 4 locally on a laptop, Ollama is one of the fastest and simplest options. Even without complex setup, you can usually get it running in about five minutes.

Step 1: Install Ollama

Open https://ollama.com and download the installer for your OS.
Complete installation based on your system:

macOS: drag it to Applications.
Windows: run the .exe installer.
Linux: use the install script from the official site.

After installation, Ollama runs as a background service. Beyond initial setup, daily usage is mostly simple commands.

Step 2: Download a Gemma 4 Model

Open a terminal and run:

`1`	`ollama pull gemma4:4b`

If your machine is stronger, you can switch to 12b or 27b. Once downloaded, the model is stored locally.

Check downloaded models with:

`1`	`ollama list`

Step 3: Run the Model

`1`	`ollama run gemma4:4b`

This opens an interactive chat session in your terminal. Type your prompt and press Enter. To exit, type:

/bye

If you prefer a browser chat UI, you can pair it with Open WebUI. It wraps Ollama with a local web interface and is usually quick to set up with Docker.

Laptop Performance Tips

Apple Silicon (M2/M3/M4): Metal acceleration is enabled by default, and 12B can run well.
NVIDIA GPU: CUDA is used automatically when a compatible GPU is detected. Keep drivers updated.
CPU-only inference: works, but larger models will be slower. For most CPU-only setups, 4B is the practical default.
Free memory before loading large models: as a rough rule, each billion parameters needs about 0.5GB to 1GB RAM.

How to Choose a Model

Gemma 4 1B: good for lightweight Q&A, simple summarization, and quick lookups; limited on complex reasoning.
Gemma 4 4B: best for most daily tasks (writing help, coding help, document summarization) with strong speed/quality balance.
Gemma 4 12B: better for longer context and more complex tasks, especially coding and reasoning.
Gemma 4 27B: better for high-demand workloads and closer to frontier-cloud quality, but needs significantly stronger hardware.

How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide

Wed, 08 Apr 2026 17:55:53 +0800

If you want to run Gemma 4 offline on your phone, this guide walks you through the full process from setup to practical usage.

Step 1: Get the App

Google AI Edge Gallery is currently not available on Google Play, so you need to install it via APK sideloading.

On your Android device, go to:

Settings -> Apps -> Special app access -> Install unknown apps

Then:

Find your browser (for example, Chrome or Firefox) and enable “Allow from this source.”
Open the Google AI Edge Gallery GitHub Releases page in your mobile browser.

URL: https://github.com/google-ai-edge/gallery/releases

Download the latest .apk package.
After the download completes, open the file from notifications or your file manager and follow the prompts.

With a stable connection, this step usually takes around 2 minutes.

Step 2: Open the App and Grant Permissions

When you first open AI Edge Gallery, it will request storage permission to save model files. It’s best to allow this; otherwise, the app cannot download or load models.

You will typically see these sections on the home screen:

Ask Image: Vision tasks (describe images, answer questions about photos)
AI Chat: Standard text chat
Summarize: Paste text and generate summaries
Smart Reply: Generate reply suggestions

For most users, AI Chat is the primary entry point.

Step 3: Download a Gemma 4 Model

Enter AI Chat.
Tap Get Models when prompted.
Choose a Gemma 4 model from the list (model size is shown).
Pick based on your device capability; if your phone has 8GB RAM, start with Gemma 4 4B.
Tap Download and let it run in the background.

Note: Larger models take longer to download. You can download multiple models and switch between them later. Downloaded models stay on your device, so you do not need to re-download them.

Step 4: Start Chatting

After the model download is finished:

Tap the model name to load it (the first load usually takes 10 to 30 seconds depending on model size and device performance).
Enter your prompt in the chat box and send it.
The model generates responses locally, and your data does not leave the phone.

The first reply is often slower due to model warm-up. Later messages in the same session are usually faster.

Step 5: Try Vision Features (Gemma 4 Multimodal)

If you downloaded a Gemma 4 multimodal variant:

Go back to the main menu and open Ask Image.
Select an image or take a photo.
Ask a question (for example, “What’s in this image?” or “Is there any text I should pay attention to?”).
Wait for the model to analyze the image locally and return a result.

This feature works offline, and your image is not sent to external servers.

Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B

Sun, 05 Apr 2026 08:30:00 +0800

Gemma 4 focuses on multimodality and local offline inference, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.

Gemma 4 Model Comparison

The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.

Model	Parameter Size	Positioning	Key Strengths	Main Limitations	Recommended Scenarios
Gemma 4 2B	2B	Ultra-lightweight	Low latency, low resource usage, lowest deployment barrier	Limited performance on complex reasoning and long task chains	Mobile, IoT, lightweight Q&A, simple automation
Gemma 4 4B	4B	Lightweight enhanced	Stronger understanding and generation than 2B, still easy to deploy locally	Limited ceiling for heavy coding and complex agent tasks	Local assistant, basic document work, multilingual daily tasks
Gemma 4 26B	26B	High-performance (MoE)	Better reasoning and tool use, suitable for production workflows	Significantly higher VRAM requirement and hardware threshold	Coding assistant, complex workflows, enterprise internal agents
Gemma 4 31B	31B	High-performance (dense)	Best overall capability and stronger stability on complex tasks	Highest resource cost and tuning complexity	Advanced reasoning, complex coding tasks, heavy automation

How to Choose: Start from Hardware and Tasks

If your top concern is whether it runs smoothly, use this guideline:

8GB VRAM: prioritize 2B/4B.
12GB VRAM: prioritize 4B or quantized variants of larger models.
24GB VRAM: focus on 26B, and evaluate quantized 31B based on workload.
Higher VRAM or multi-GPU: consider high-precision 31B setups.

Prioritize stability and inference speed first, then scale up model size gradually.

Four Typical Use Cases

1) Local General Assistant

Preferred model: 4B
Why: strong balance between cost and quality, suitable for long-running local use.

2) Coding and Automation

Preferred model: 26B
Why: more stable in multi-step tasks, tool calls, and script generation.

3) Advanced Reasoning and Complex Agents

Preferred model: 31B
Why: stronger robustness under complex context.

4) Edge Devices and Lightweight Offline Use

Preferred model: 2B
Why: easiest to deploy on resource-constrained devices.

Deployment Suggestions (Ollama)

A practical approach is to iterate in small steps:

Start with 4B to establish a baseline (latency, memory, quality).
Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).
Compare 26B/31B against that set for accuracy, latency, and VRAM cost.
Upgrade only when the gain is clear.

This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.

Conclusion

The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:

For low-cost fast rollout: start with 2B/4B.
For production-grade local AI workflows: prioritize 26B.
For advanced reasoning and heavy automation: move to 31B.

In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.

Gemma 4 on KnightLi Blog

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Quick Summary

Gemma 4 E2B VRAM Table

Gemma 4 E4B VRAM Table

Gemma 4 26B A4B VRAM Table

Gemma 4 31B VRAM Table

Why Actual Usage Is Higher Than File Size

How to Choose

References

Gemma 4 E4B Uncensored vs Official: What Actually Changes

What the derivative model explicitly claims

Official model vs “uncensored” derivative

The core difference is alignment, not raw capability

Why the official version behaves differently

When the official model is the better choice

When people choose the uncensored derivative

Practical conclusion

Sources

Deploy Hermes Agent Locally on Windows with WSL + Ollama and Connect Telegram

Overall flow

1. Install WSL and Ubuntu

2. Update Ubuntu and install the base environment

Install Python

Install zstd

Install Node.js

Install Git

3. Install Ollama and pull Gemma 4

4. Install and configure Hermes Agent

Common Hermes Agent commands

Start

Re-enter setup

Configure the chat gateway

Update

Basic Telegram connection steps

Who this setup fits

A few things to keep in mind

Conclusion

Original reference

How to Fix SSL Certificate Verification Failed When llama-cli Downloads from Hugging Face on Windows

The easiest workaround: download the model manually

If you still want to use -hf automatic download

What Does `it` Mean in Gemma-4-31B-it

What it means

What 31B means

What Gemma-4 means

Which one to choose

One-line summary

Gemma 4 Local Runtime Guide: From One-Command Start to Dev Integration

1) Fastest start: Ollama (recommended)

2) GUI workflow: LM Studio / Unsloth Studio

3) Low-spec and maximum control: llama.cpp

4) Developer integration: Transformers / vLLM

Quick selection

Model size suggestion

How to Troubleshoot Slow `ollama pull` Model Downloads

Get the model file download URLs

Manifest URL

Blob URLs

The real download URL after the redirect

Adjust your network settings

Before-and-after comparison

A more practical troubleshooting order

Conclusion

Gemma 4 on Raspberry Pi 5: It Works, But Responses Are Slow

Test Environment

Step 1: Install and Start LM Studio CLI

Step 2: Move Model Storage to SSD

Step 3: Download and Load Gemma 4 E2B

Step 4: Start API and Enable LAN Access

Step 5: Connect to Editor (Zed)

Practical Usability

Conclusion

Related Posts

Connect OpenClaw to Local Gemma 4: Complete Setup Guide

Step 1: Start the Ollama API Service

Step 2: Configure OpenClaw to Use Ollama

Step 3: Set Default Model (Optional)

Step 4: Restart and Verify OpenClaw

Common Troubleshooting

If you still want to use `-hf` automatic download

What `it` means

What `31B` means

What `Gemma-4` means