Local LLM on KnightLi Blog

llama.cpp b9196 Update: Windows Prebuilt Binaries Support CUDA 13.1, Vulkan, HIP, and SYCL

Mon, 18 May 2026 23:20:00 +0800

The recent Windows release of llama.cpp is much friendlier for local LLM users. In the past, running GGUF models on Windows often meant dealing with environment issues: CUDA version mismatches, missing DLLs, incompatible drivers, failed CMake builds, wrong environment variables, or complicated Vulkan / HIP / SYCL setup.

Now the official Release page provides several Windows prebuilt packages. In many cases, users no longer need to compile from source. Download the right build, unzip it, place the model file, and you can start a local inference service directly.

What llama.cpp Is Good For

llama.cpp is one of the most commonly used local GGUF model inference frameworks. It is lightweight, cross-platform, can run on CPU or GPU, and has a large ecosystem of GGUF model resources.

Common model families include:

Qwen
Llama
DeepSeek
Gemma
Mistral
Mixtral
Hermes

As GGUF quantized models become more common, many open source models now provide GGUF versions suitable for local deployment. For regular users, the value of llama.cpp is simple: you do not need a full complex inference stack to run a usable chat service on your own machine.

How to Choose a Windows Prebuilt Build

Windows users can choose different builds based on their hardware:

Windows x64 CPU
Windows x64 CUDA 12.4
Windows x64 CUDA 13.1
Windows x64 Vulkan
Windows x64 HIP Radeon
Windows x64 SYCL
Windows ARM64 CPU

If you use an NVIDIA GPU, the CUDA build is usually the first choice. Cards such as RTX 3060, 4060, 4070, 4080, and 4090 are better suited to the CUDA route.

If you use an AMD GPU, try HIP or Vulkan. In practice, Vulkan can sometimes be easier than HIP, especially if you do not want to set up a full ROCm environment.

If you use Intel integrated graphics or an Arc GPU, try SYCL or Vulkan. Performance is usually behind NVIDIA CUDA, but it is already enough to test many small and medium GGUF models.

The CPU build is suitable for users without a discrete GPU, or for those who only want to verify a model or run small models. It will not be fast, but deployment is the simplest.

Start a Regular GGUF Model

Assume you have downloaded the llama.cpp Windows prebuilt package and placed your model in the models directory. Enter the extracted llama.cpp directory and run:

`1`	`llama-server.exe -m models\your-model.gguf -ngl 999`

Here, -m points to the GGUF model file, and -ngl 999 tells llama.cpp to load as many layers as possible onto the GPU. The actual number depends on VRAM size, model size, and quantization format.

After startup succeeds, open this address in your browser:

`1`	`http://127.0.0.1:8080`

You will enter the local web chat interface.

If VRAM is not enough, switch to a smaller model or a lower quantization version, such as Q4 or Q5 GGUF files. Do not only look at parameter count; also check quantization format and context length settings.

Start a Multimodal Vision Model

Multimodal vision models usually need more than the main model file. They also need an mmproj vision projection file. Start them by specifying both:

`1`	`llama-server.exe -m "models\main-model.gguf" --mmproj "models\mmproj-model.gguf" -ngl 999`

Common uses include:

OCR recognition
Screenshot understanding
Webpage screenshot analysis
Image Q&A
Simple visual content judgment

For example, Qwen2-VL / Qwen2.5-VL models are useful for Chinese screenshot understanding, OCR, and image-text Q&A. Make sure the main model and mmproj file match; version mismatches can easily cause loading failures or abnormal output.

Use a bat Script to Manage Multiple Models

If you keep multiple models locally, you can write a simple .bat script to switch between them. The following example needs your own path and model names:

@echo off
chcp 65001 >nul
cd /d C:\path\to\llama-b9196-bin-win-cuda-13.1-x64

echo 请选择模型：
echo 1. Gemma
echo 2. Qwen VL 多模态
echo 3. DeepSeek

set /p choice=输入数字：

if "%choice%"=="1" llama-server.exe -m "models\gemma.gguf" -ngl 999
if "%choice%"=="2" llama-server.exe -m "models\qwen-vl.gguf" --mmproj "models\mmproj.gguf" -ngl 999
if "%choice%"=="3" llama-server.exe -m "models\deepseek.gguf" -ngl 999

pause

Save it as UTF-8, then change the extension to .bat. Double-clicking the script lets you choose different models by number.

Three Things to Check When Choosing Models

First, check hardware. More VRAM means you can run larger models. If VRAM is limited, do not force a large model; start with 7B, 8B, or a lower quantization version.

Second, check the use case. For everyday Q&A, summarization, and rewriting, small models or medium quantization are often enough. For coding, long-document analysis, or multimodal understanding, you need stronger models and more VRAM.

Third, check licenses and safety boundaries. Many community-modified models have different capabilities, restrictions, and licenses. Before downloading, confirm the source, license, intended use, and risks. Do not hand production work directly to models from unclear sources.

Common Issues

If startup reports missing DLLs, first confirm that the downloaded package matches your GPU route. NVIDIA users should not download the HIP build by mistake, and AMD users should not download the CUDA build.

If model loading is slow, the model may be too large, the disk may be slow, or part of the model may be falling back to CPU due to insufficient VRAM.

If the web page does not open, check whether the command line service started successfully, then confirm the port is 8080. If the port is occupied, check llama-server parameters and change the port.

If a multimodal model behaves incorrectly, first check whether the mmproj file matches the main model instead of only changing prompts.

Summary

The value of these Windows prebuilt packages is that they lower the entry barrier for local AI. Many users previously got stuck at compilation and dependency setup. Now they can move faster into downloading models, starting a service, and testing results.

For Windows users, the route can be summarized simply:

NVIDIA: prefer CUDA.
AMD: try Vulkan first, then HIP.
Intel: try SYCL or Vulkan.
No discrete GPU: use the CPU build for small models.

Before real use, still confirm model source, license, VRAM needs, and actual results. Local AI gives you control, offline operation, and low latency, but it is not free of cost: model management, hardware resources, and output quality are still your responsibility.

Source: https://www.freedidi.com/24211.html

Claude Code + Ollama Local Deployment Guide: Build a Free AI Coding Assistant with CC Switch

Fri, 15 May 2026 23:27:50 +0800

Claude Code has become a popular AI coding assistant recently. Its appeal is not just that it can chat about code, but that it can read a project, modify files, run commands, install dependencies, and keep fixing errors in an agent-like workflow.

The hard part is cost. Once a project grows, long context and repeated agent turns can burn through API quota quickly. If you just want to experiment, refactor small utilities, generate scripts, or work on a private local project, it is natural to ask: can Claude Code’s workflow be kept while the model runs locally?

The key tool in this setup is CC Switch. It lets Claude Code connect to the local Ollama service through an OpenAI-compatible API endpoint, so requests can be forwarded to a local model instead of the official Claude API.

What This Setup Solves

You can think of the whole setup as:

1
2
3

Claude Code desktop
+ CC Switch API forwarding layer
+ Ollama local model

Claude Code is still responsible for the coding workflow and project operations. CC Switch handles model provider configuration and API compatibility. Ollama runs the model locally.

This does not make a local model suddenly become Claude. Its real value is that it makes Claude Code’s agent workflow usable in lower-cost, offline, and private local scenarios.

Basic Preparation

Before you start, prepare these pieces:

Install Git.
Install Ollama.
Pull a local model suitable for coding.
Install CC Switch.
Have Claude Code available on your machine.

For the model side, you can start with coding-oriented models, such as Qwen Coder, DeepSeek Coder, or other models with decent tool-calling and code generation behavior. The larger the model, the better the result may be, but memory and GPU pressure will also rise.

If your machine only has limited memory, start with a smaller model first. Confirm that the workflow runs smoothly before trying a larger one.

Key CC Switch Configuration

After Ollama starts, its default local API address is usually:

`1`	`http://127.0.0.1:11434/v1`

In CC Switch, choose an OpenAI-compatible provider type, commonly:

`1`	`OpenAI Chat Completions`

Then point the base URL to Ollama’s local address.

For the API key field, local Ollama normally does not need a real key, but many tools still require an environment variable or placeholder. You can use:

`1`	`ANTHROPIC_API_KEY`

or another placeholder variable accepted by your local setup.

One configuration item is worth special attention:

`1`	`"inferenceModels"="[\"haiku\",\"sonnet\",\"opus\"]"`

This means mapping Claude Code’s expected model roles to the local provider. In practice, you need to bind haiku, sonnet, and opus to the model names exposed by Ollama or CC Switch. If this mapping is wrong, Claude Code may fail to call the model or may keep falling back to an unexpected configuration.

Where Claude Code Is Strong

Claude Code’s biggest advantage is not raw completion. It is the full coding workflow:

reading and understanding project structure;
locating related files based on a task;
editing code directly;
running commands and tests;
observing errors and iterating;
completing multi-step tasks in one session.

This is why many people want to keep Claude Code even when switching to a local model. A normal chat UI can generate code snippets, but it does not naturally operate inside a repository. Claude Code is closer to an executable development assistant.

What Role Ollama Plays Here

Ollama is responsible for local model runtime and management. It handles model downloading, loading, and local inference.

The advantage is clear: requests stay on your machine, repeated use does not create API bills, and you can use it when the network is limited. For private code, this is also easier to accept than sending every context window to a cloud model.

The trade-off is also clear. Local models depend heavily on your hardware and on model quality. A smaller model can handle simple edits, explanations, and script generation, but it may struggle with large cross-file refactors or subtle architectural decisions.

Where The Experience Has Boundaries

This setup should not be treated as a full replacement for Claude’s strongest cloud models.

You may run into these issues:

weaker long-context understanding;
unstable tool-calling behavior in complex tasks;
slower inference on CPU-only machines;
more hallucinated file paths or APIs;
less reliable multi-round planning;
lower success rate on large repository refactors.

So the better expectation is: use it as a free local development assistant, not as a perfect substitute for a top-tier cloud model.

Multimodal Compatibility Is Still Unstable

Some users want Claude Code to handle screenshots, UI images, diagrams, or other multimodal inputs. This part depends on the local model and the forwarding layer.

If the selected Ollama model does not support vision, or CC Switch does not translate the request format correctly, multimodal features may fail. Even with a vision model, behavior may differ from Claude’s official API.

For now, this setup is more suitable for text and code workflows. Treat multimodal support as experimental.

Who Should Try It

This setup is suitable for:

developers who want to try Claude Code’s workflow at low cost;
users who frequently write scripts, small tools, and automation snippets;
teams that want to keep code on local machines;
learners who want an AI coding assistant without constant API spend;
people testing different local coding models.

It is less suitable if you rely heavily on long context, large monorepos, strict code review quality, or complex full-project refactors.

Usage Advice

Start with small tasks.

For example:

explain a single file;
refactor a small function;
generate a shell script;
fix a simple error;
add a small feature;
write unit tests for a narrow module.

After each change, run tests or at least review the diff yourself. A local model can be useful, but you should not blindly accept every generated edit.

If the model keeps losing context, reduce the task scope. Instead of asking it to “refactor the whole project”, ask it to “refactor this function” or “add validation in this file”.

Summary

Claude Code + CC Switch + Ollama is an interesting combination. It keeps Claude Code’s agent-style development workflow while moving inference to a local model.

Its biggest strengths are lower cost, local privacy, and a smooth development workflow. Its limits are also obvious: model quality, hardware performance, long context, and tool-calling stability all affect the final experience.

If you already use Ollama and want a more practical local AI coding workflow, this setup is worth trying. Just remember to start small, verify every change, and treat the local model as an assistant rather than an automatic engineer.

Running DeepSeek 4 Locally: Antirez's ds4 Experiment on Apple Silicon Mac

Mon, 11 May 2026 08:51:37 +0800

Antirez has open sourced a new project: ds4. It is not a general-purpose LLM framework, but a local inference engine for DeepSeek V4 Flash, with a focus on Apple Silicon and the Metal backend.

Project URL: https://github.com/antirez/ds4

What is ds4?

ds4 has a clear goal: running DeepSeek V4 Flash locally on a Mac.

It currently provides three ways to use it:

Interactive CLI.
HTTP server.
An experimental Agent mode.

Judging from its positioning, it is more like an inference project deeply optimized for one specific model than a replacement for general-purpose tools such as llama.cpp, Ollama, or vLLM.

Why it is worth watching

There are three main reasons this kind of project is worth following.

First, the author is Antirez, the creator of Redis. He has long focused on low-level systems, performance, and simple tools, and his projects are usually quite direct in style.

Second, DeepSeek V4 Flash points toward efficient inference. If the local running experience is good enough, it could be very attractive for Mac users.

Third, ds4 directly targets Apple Metal. Compared with the route of supporting every platform first and optimizing later, it feels more like a project trying to go deep on one well-defined scenario.

Who should try it

ds4 is better suited for users who:

Use an Apple Silicon Mac.
Want to run DeepSeek V4 Flash locally.
Care about Metal inference performance.
Are willing to try an alpha-stage project.
Want to study lightweight inference engines and model runtime details.

If your goal is stable deployment, cross-platform operation, or OpenAI API-compatible infrastructure, it may not be the first choice at this stage. It is better treated as an experimental tool and a technical project to watch.

How to use it

The basic workflow in the project README is to build it first, then run it.

1
2
3

git clone https://github.com/antirez/ds4.git
cd ds4
make

Run it interactively:

./ds4

Start the HTTP server:

`1`	`./ds4 --server`

Agent mode:

`1`	`./ds4 --agent`

For exact parameters and model file preparation, follow the repository README, because the project is still changing quickly.

Current risks

ds4 is still at an early stage, so set expectations before using it:

Features may be incomplete.
Parameters, model formats, and command-line behavior may change.
Compatibility mainly revolves around Apple Silicon and Metal.
Agent mode is more experimental and is not suitable for direct production use.
When something breaks, you may need to read the README, issues, or source code yourself.

In other words, it is currently more of an open source experiment worth trying than a one-click tool for ordinary users.

How it differs from general inference tools

General-purpose inference tools usually aim for broad compatibility across model formats, platforms, backends, and APIs. ds4 takes a narrower path: local DeepSeek V4 Flash inference on Metal.

That choice has both benefits and trade-offs.

The benefit is that the implementation can stay focused, making performance and user experience easier to optimize around a single target. The trade-off is a limited scope: it is not meant to run every possible model, nor to replace a complete deployment platform.

If you already use llama.cpp or Ollama, ds4 is better treated as a supplementary testing tool, not an immediate replacement for your existing workflow.

Summary

The interesting part of ds4 is not that it is yet another local LLM tool. It is that its scope is intentionally narrow: DeepSeek V4 Flash, Apple Silicon, Metal, and local inference.

If you have a suitable Mac and are willing to tinker with an early-stage project, it is worth watching its performance, model support approach, and server/agent capabilities. For production environments, it is better to keep observing until the interfaces and usage patterns become more stable.

References

GitHub project: https://github.com/antirez/ds4

A Practical llama.cpp Multi-GPU Benchmarking Approach: Is 2x V100 16GB Faster Than One 32GB Card?

Sat, 09 May 2026 15:05:41 +0800

Short version: llama.cpp multi-GPU offload is not free performance just because you add a second card. If the model already fits fully on one 32GB GPU, 2x V100 16GB is often less convenient than a single 32GB card and may even be slower. If the model does not fit on one 16GB card, the main value of dual GPUs is that the model can stay on GPU, and the benefit can be obvious.

First, Understand split mode

llama.cpp multi-GPU usage mainly revolves around --split-mode and --tensor-split. When discussing performance, distinguish these modes first:

layer: splits layers across GPUs. It is usually the most compatible starting point.
tensor: splits tensor computation across multiple GPUs. It is closer to true parallel compute, but depends more heavily on inter-GPU bandwidth and backend support.
row: an older row-splitting mode that still appears in some setups, but is usually not the first choice for new deployments.

In simple terms, layer is like putting different floors on different cards. During single-token generation, it may not keep both cards fully busy at the same time. tensor is more like letting both cards work on the same layer together. It has more theoretical parallelism, but inter-GPU communication can become the bottleneck.

If One 32GB Card Can Fit the Model, Dual 16GB Is Not Always Faster

If the model and KV cache fit fully on one 32GB GPU, a single card is usually steadier and often faster. For hardware in the same generation, such as 1x V100 32GB versus 2x V100 16GB, the dual-card setup does not necessarily win.

A conservative expectation is that 2x V100 16GB may be 10% to 40% slower than one V100 32GB, especially for single-user chat, Continue Agent, and code Q&A workloads where one request is mainly generating one answer.

The reason is straightforward: multi-GPU does not simply merge VRAM into one fast pool. With layer splitting, inference moves across GPUs and one card may wait for the other during token generation. With tensor splitting, both cards can compute together, but intermediate results need cross-GPU synchronization, so bandwidth and latency directly affect throughput.

So if your choice is:

1x V100 32GB
2x V100 16GB

and the target model already fits fully on one 32GB card, the single 32GB card is often the more comfortable option.

If One 16GB Card Cannot Fit the Model, Dual Cards Matter

The situation changes completely when the model does not fit on one 16GB card but can fit across two 16GB cards.

In that case, the value of dual GPUs is very direct:

One 16GB card: may require heavy CPU offload, which can slow things down a lot.
2x 16GB cards: weights can stay mostly on GPU, which may be much faster than mixed CPU/GPU execution.

In this scenario, 2x V100 16GB is not guaranteed to beat one 32GB card, but it may be several times faster than a single 16GB card with heavy system-memory offload. In other words, the first value of dual cards is not acceleration. It is avoiding the need to push model weights into slower system RAM.

V100 PCIe and V100 SXM2 Are Very Different

The easiest thing to overlook in multi-GPU inference is the interconnect.

If you have V100 SXM2 with NVLink, cross-GPU communication bandwidth is much higher. NVIDIA’s V100 material lists NVLink interconnect bandwidth up to 300GB/s. In that environment, tensor mode or higher-batch workloads have a better chance of approaching or exceeding single-card performance.

If you have V100 PCIe, expectations should be more conservative. V100 PCIe mainly uses PCIe Gen3, and the listed interconnect bandwidth is 32GB/s. That is a very different class from NVLink, which is why dual PCIe cards often provide enough VRAM without doubling speed.

So when judging whether 2x V100 16GB is worthwhile, do not only add the VRAM to 32GB. Also check whether the cards are PCIe or SXM2/NVLink.

A Practical Buying Rule

If the model fits on one 32GB GPU, choose the single card first. Its latency, stability, and tuning cost are usually better.

If the model does not fit on one 16GB GPU but can fit on two 16GB GPUs, dual cards are worth using. At that point, the goal is to keep weights on GPU as much as possible, not to expect linear performance scaling.

If you have dual V100 PCIe cards, start with --split-mode layer and aim for stable execution with less CPU fallback.

If you have V100 SXM2/NVLink, it is more worth testing tensor-related modes, especially for prefill, larger batches, or concurrent serving.

When to Buy 2x16GB and When to Buy 1x32GB

If you serve only one user and mainly do chat, code completion, Continue Agent, or long-context Q&A, and the target model fits within 32GB, 1x32GB is usually the better choice. It avoids cross-GPU scheduling, has steadier latency, and is easier to debug.

If you already own one 16GB card and want a lower-cost path to run 30B, 32B, or higher-quantized models, 2x16GB makes sense. It may not double token/s, but it can keep weights on GPU that would otherwise require CPU offload.

If you are buying from scratch, the priority can look like this:

Single model, single user, latency-sensitive: prefer 1x32GB.
Model does not fit on one card and budget is limited: consider 2x16GB.
Machine has NVLink or SXM2: 2x16GB is much more interesting than ordinary PCIe dual cards.
You want longer context later: do not only count model weights; reserve VRAM for KV cache too.

Practical Advice for layer split and tensor split

The practical rule is: start with layer, then benchmark tensor.

layer is the default starting point. It splits the model by layer, has better compatibility, and is friendlier to PCIe dual-card systems. The downside is that generation can behave more like a pipeline: at certain moments one card is busy while the other waits.

tensor is better suited to machines with strong interconnects, such as V100 SXM2/NVLink. It splits part of the same layer’s computation across GPUs, so it has more parallelism in theory, but it also synchronizes across cards more often. On PCIe dual cards, communication overhead may eat the benefit.

You can start with these tests:

1
2
3

llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,1
llama-bench -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1
llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,0

The third command is not meant as the long-term configuration. It gives you a single-card reference, so you can see whether dual GPUs are actually faster or only distributing VRAM pressure.

Why prefill and decode Behave Differently

Local LLM performance should usually be viewed in two stages:

prefill: processes the input prompt. A typical metric is prompt-processing throughput such as pp512.
decode: generates the response token by token. A typical metric is token-generation throughput such as tg128.

prefill is more like large-batch matrix computation. With larger batches, it is easier to keep GPUs busy and more likely to benefit from multi-GPU parallelism. decode generates one token after another. The batch is smaller and synchronization is more frequent, so cross-card communication and scheduling latency are easier to notice.

That is why you may see dual GPUs improve pp512 while tg128 barely improves or even gets worse. For chat and agent workflows, user experience is closer to tg128. For long document ingestion, batch prefill, or concurrent serving, pp512 also matters.

Can KV cache Become a Second VRAM Bottleneck?

Yes. Many people only count model weights and forget KV cache.

Model weights decide whether the model can load. KV cache decides whether you can use the context length you want. The longer the context, the higher the concurrency, and the larger the batch, the more visible KV cache usage becomes. You may find that the model itself fits in 32GB, but 32K or 64K context pushes VRAM over the limit.

At minimum, leave VRAM headroom for:

KV cache
CUDA graph or backend runtime overhead
prompt batch and ubatch
desktop, driver, and other process usage

If you use 2x16GB, VRAM is not a fully equivalent 32GB pool. Some buffers, KV cache, or intermediate tensors may still be limited by remaining memory on a single card. When testing long context, use the target --ctx-size and target concurrency directly instead of only checking whether the model starts.

How to Benchmark Dual Cards with llama-bench

llama-bench is better than direct chatting for hardware comparison because it separates prompt processing and token generation into comparable metrics. The default example in the official README is:

`1`	`llama-bench -m model.gguf`

For dual V100 cards, test at least these sets:

# Single-card baseline
CUDA_VISIBLE_DEVICES=0 llama-bench -m model.gguf -ngl 99

# Dual-card layer split
CUDA_VISIBLE_DEVICES=0,1 llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,1

# Dual-card tensor split
CUDA_VISIBLE_DEVICES=0,1 llama-bench -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1

Focus on two columns:

pp512: prompt processing, more relevant to long inputs and batch prefill.
tg128: token generation, more relevant to single-user chat and agent responsiveness.

Keep the model, quantization, context length, batch settings, driver version, and llama.cpp version fixed. Run each group several times and compare medians rather than one-off results. Finally, test your real workflow too, such as Continue Agent, an OpenAI-compatible server, or your own RAG requests, because a good benchmark does not always mean better interactive experience.

One-Sentence Conclusion

The main advantage of 2x V100 16GB is VRAM capacity, not guaranteed generation speed. If the model fits on one card, a single 32GB card is usually faster and steadier. If the model does not fit on one 16GB card, dual 16GB cards become valuable because they avoid heavy CPU offload. Whether they are faster depends on split mode, batch size, model size, and whether the two V100 cards are connected through PCIe or NVLink.

References:

RTX 5090 / 5080 AI Inference Benchmarks: Choosing for Local LLMs, 4K Video, and Real-Time 3D

Fri, 08 May 2026 10:07:19 +0800

For local AI users, the RTX 50 series is exciting not only because of gaming performance, but because Blackwell, GDDR7 memory, and fifth-generation Tensor Cores change what a desktop AI workstation can do. If you run local LLMs, image generation, video enhancement, or real-time 3D workflows, the GPU is no longer just a rendering device.

RTX 5090 and RTX 5080 should not be judged by model name alone. Both use Blackwell, support DLSS 4, fifth-generation Tensor Cores, and FP4, but local AI experience is usually decided by VRAM capacity, memory bandwidth, software support, and model compatibility.

The short version: RTX 5090 is the better single-card flagship for local AI, large models, long context, image generation, and video AI. RTX 5080 is better for smaller models, tighter budgets, and workflows that fit inside 16GB of VRAM. Both improve on the previous generation, but not every AI app can immediately use all Blackwell features.

Start With The Hardware Gap

RTX 5090 has 32GB GDDR7, a 512-bit memory bus, 21760 CUDA cores, and 3352 AI TOPS. Public testing from Puget Systems also highlights about 1.79TB/s of memory bandwidth, compared with RTX 4090’s 24GB and about 1.01TB/s. That matters for AI workloads.

RTX 5080 is more restrained: 16GB GDDR7, a 256-bit memory bus, 10752 CUDA cores, and 1801 AI TOPS. Its bandwidth is about 960GB/s, a clear jump over RTX 4080-class cards, but VRAM stays at 16GB.

That gives the two cards very different roles:

RTX 5090 is stronger for larger models, longer context, and heavier multimodal workloads because of 32GB VRAM and high bandwidth.
RTX 5080 is more cost- and power-conscious, and fits small to medium models, image generation, lighter video work, and development.
If a workload is already VRAM-limited, RTX 5080 cannot solve that with compute alone.
If a workload is software-limited, RTX 5090 may not always pull far ahead of RTX 4090 in proportion to its specs.

Local AI inference often follows a simple rule: VRAM decides whether it runs, bandwidth decides how fast it feels. That is why RTX 5090 is more attractive for local LLM users.

Local LLMs: 32GB Matters More

When running LLMs, VRAM is mainly used by model weights, KV cache, and runtime overhead. Larger models, longer context, and higher concurrency all increase pressure.

RTX 5080’s 16GB can cover many 7B, 8B, and 14B models, and can run some larger models with 4-bit quantization. But if you want 30B-class models, longer context, or WebUI, RAG, voice, and tool calls at the same time, 16GB becomes a limit quickly.

RTX 5090’s 32GB gives local inference much more room. It is better for:

Running quantized models around the 30B level.
Keeping longer context on 7B and 14B models.
Local coding assistants, knowledge-base Q&A, and Agent debugging.
Loading embedding, reranker, or multimodal components alongside the main model.
Reducing model switching and context compromises on a single machine.

Still, 32GB is not magic. Even 70B-class models with 4-bit quantization often need careful context, runtime settings, and memory management. For high-concurrency service, multi-GPU or server GPUs remain more suitable.

For personal use, RTX 5090’s biggest benefit is less friction: more model choices, more comfortable context length, and enough room for GUI tools and companion components.

FP4 Is Potential, Not Instant Acceleration Everywhere

One major Blackwell change is FP4 support in fifth-generation Tensor Cores. NVIDIA’s TensorRT materials note that FP4 can reduce model memory use and data movement, and can help local inference for generative models such as FLUX.

That is important for image generation and future LLM inference. Lower precision means less VRAM pressure and less bandwidth pressure. On a high-bandwidth GPU such as RTX 5090, FP4 can theoretically amplify the advantage if frameworks and models support it well.

But FP4 gains depend on the software path:

Whether the model has a suitable FP4 quantized version.
Whether the inference framework supports the needed operators.
Whether TensorRT, ComfyUI, PyTorch, ONNX, or plugins are adapted.
Whether the task can accept the precision tradeoff.
Whether the user is willing to adjust the workflow for speed.

So RTX 50 AI performance should not be judged only by FP4 peak numbers. Blackwell provides the hardware base, but the real experience depends on app updates. Early adopters will see some benefits first; mainstream users may need to wait for the ecosystem.

Image Generation And 4K Video: Bandwidth And VRAM Work Together

Stable Diffusion, FLUX, video super-resolution, frame interpolation, denoising, matting, and generative video all care about VRAM. Higher resolution costs more memory; more nodes add runtime overhead; ControlNet, LoRA, high-res fix, and batch generation increase pressure further.

RTX 5080 can handle many image-generation jobs inside 16GB. For 1024px images, light LoRA use, and normal ComfyUI workflows, it is already fast enough. Problems appear with larger canvases, more complex node graphs, higher batch sizes, or long-sequence video generation.

RTX 5090 has clearer advantages in 4K video workflows:

32GB VRAM is better for high-resolution frames, long sequences, and complex node graphs.
Around 1.79TB/s bandwidth helps reduce data-movement bottlenecks.
Three ninth-generation NVENC encoders are useful for export, transcoding, and creator workflows.
Once FP4 and TensorRT support matures, image generation models may benefit more.

Public video AI benchmarks also show a caution: application optimization has not fully caught up. Puget Systems found that RTX 5090 does not always dramatically beat RTX 4090 in DaVinci Resolve AI and Topaz Video AI, and RTX 5080 does not always create a large gap over RTX 4080-class cards. Video AI is not just about specs; plugins, drivers, and model implementations matter.

In other words, RTX 50 is more compelling if your workflow already supports Blackwell, TensorRT, or FP4. If you mostly rely on commercial software that has not been optimized yet, the upgrade value depends on the exact version.

Real-Time 3D And AI Modeling: RTX 5090 Fits Heavier Scenes

Real-time 3D modeling, neural rendering, 3D asset generation, and viewport AI acceleration use CUDA, RT Cores, Tensor Cores, and VRAM at the same time. Unlike pure LLM work, the goal is not only token speed. Scene complexity, materials, geometry, ray tracing, AI denoising, and viewport frame rate all matter.

RTX 5080 can handle many 4K gaming, real-time preview, and medium-scale creative projects. For independent creators, it is a realistic high-performance option.

RTX 5090 is a better fit for:

Complex 3D scene preview.
High-resolution materials and large asset libraries.
AI denoising, upscaling, and generative modeling assistance running together.
Heavy D5 Render, Blender, Unreal Engine, and similar workloads.
Modeling while also running a local AI assistant or reference-image generator.

NVIDIA says RTX 50 can improve generative AI, video editing, and 3D rendering in creative apps, but production projects still depend on whether the software uses the new hardware paths. The reliable method is to test with your own project files, not only marketing charts.

How To Choose

If your main goal is local LLMs, start with VRAM. RTX 5080’s 16GB can run many lightweight models, but it is closer to an entry high-performance local AI card. RTX 5090’s 32GB is closer to a single-card local LLM workstation.

For image generation, RTX 5080 covers many daily workflows. If you often use high resolution, complex node graphs, batch generation, FLUX, or video generation, RTX 5090’s VRAM headroom matters more.

For 4K video AI, RTX 5090 is safer, but check the exact software version. Topaz, DaVinci Resolve, ComfyUI, TensorRT plugins, and drivers can all affect results.

For real-time 3D, RTX 5080 can satisfy many creators. RTX 5090 is better for heavier scenes, parallel apps, and long production sessions.

If you already own an RTX 4090, upgrade carefully. RTX 5090 has more VRAM and bandwidth, but some AI software has not fully unlocked Blackwell yet. Unless you clearly need 32GB, higher bandwidth, or the new encoders, waiting for the ecosystem is reasonable.

If you are still on RTX 30 series or older, RTX 50 will feel much more meaningful. Moving from 8GB, 10GB, or 12GB to 16GB or 32GB directly expands what local AI can run.

Summary

RTX 5090 and RTX 5080 both push consumer GPUs further into local AI, but they serve different users.

RTX 5090 is about 32GB GDDR7, very high memory bandwidth, and a stronger creative hardware stack. It suits users who want larger local models, more complex image generation, heavier video AI, and real-time 3D on one machine.

RTX 5080 is about entering Blackwell at a lower cost. It suits small and medium models, daily image generation, development tests, and high-performance creative work that fits in 16GB.

The buying rule is simple: first check whether your models and projects fit in VRAM, then check whether your software is optimized for Blackwell, and only then look at theoretical AI TOPS. For local AI, finishing reliably matters more than peak numbers.

References

DeepSeek V4 Local Private Deployment: Choosing Domestic Chips or Consumer GPU Clusters

Fri, 08 May 2026 09:39:35 +0800

After DeepSeek V4 was released, many enterprises started asking one question: can we avoid external APIs and deploy the model in our own data center, private cloud, or dedicated cluster?

This is a very practical need. Finance, healthcare, government, manufacturing, legal, and R&D teams often cannot send internal documents, code, contracts, tickets, or customer data directly to public cloud models. For these scenarios, DeepSeek V4 is attractive not only because of model capability, but because it gives enterprises an option closer to controllable LLM infrastructure.

However, local deployment of DeepSeek V4 is not as simple as downloading a model and finding a few GPUs. Especially for very large MoE models such as Pro, total parameter size, active parameters, context length, KV cache, concurrency, and inference framework all directly affect hardware cost. What enterprises really need is not blindly chasing the full version, but first deciding what deployment shape the business actually needs.

Clarify the Deployment Goal First

Enterprise local private deployment usually has three goals:

Keep data inside the domain: internal documents, code, customer materials, logs, and knowledge bases do not leave the enterprise environment.
Make operations stable and controllable: model services, permissions, audit, logs, and upgrade cadence are controlled by the enterprise.
Reduce long-term cost: for high-frequency calls, local inference may be more controllable than long-term external API purchases.

If only a few employees ask occasional questions, local deployment may not be cost-effective. Private deployment is truly suitable for high-frequency, stable, data-sensitive, and workflow-defined scenarios, such as:

Internal knowledge-base Q&A.
Code review and development assistants.
Customer-service ticket summarization.
Contract, medical-record, and report analysis.
Database query assistants.
Agent workflow automation.

These scenarios share the same traits: sensitive data, stable call patterns, and the ability to fit into enterprise governance through permissions and logs.

Do Not Chase Full Pro From Day One

Common DeepSeek V4 versions include Pro and Flash. In public materials, Pro targets stronger reasoning and complex Agent tasks, while Flash emphasizes cost and response speed. Enterprises should not assume every workload needs Pro.

You can split tasks by complexity:

Simple Q&A, summarization, classification, and tag generation: prioritize Flash or smaller models.
Internal knowledge-base retrieval augmentation: Flash is enough for many cases; RAG, permissions, and retrieval quality matter more.
Code Agents, complex reasoning, and long-context analysis: then evaluate Pro.
High-value, low-frequency tasks: Pro can be used, but high concurrency may not be necessary.
Regular office assistants: there is no need to occupy the most expensive inference resources for long periods.

The advantage of MoE models is that each inference only activates part of the parameters, but this does not mean the hardware pressure is small. Weight storage, expert parallelism, network communication, context cache, and concurrent scheduling are still heavy. With 1M-token-level long context in particular, the real resource consumer is often not a single answer, but long context, multi-user concurrency, and persistent sessions.

Domestic Chip Route: Better for Enterprise Private Cloud

If an enterprise already has a domestic compute pool, or has requirements around Xinchuang, compliance, or supply-chain control, it can first evaluate domestic chips such as Ascend and Cambricon.

The advantages of this route are:

Better alignment with localization and supply-chain control requirements.
Suitable for enterprise data centers, dedicated clouds, and government/enterprise projects.
Easier to unify permissions, audit, resource isolation, and operations.
Friendlier to long-term stable services.

But the domestic chip route also has three practical issues.

First, framework adaptation. Whether the model can run depends not only on chip compute power, but also on the maturity of the inference framework, operators, communication libraries, quantization formats, MoE expert parallelism, and long-context optimization.

Second, engineering experience. Enterprises need more than “it starts successfully”; they need stable services: multi-tenancy, rate limiting, monitoring, failure recovery, gray releases, log audit, and permission isolation all need to be built.

Third, ecosystem differences. The same model will not have identical performance, accuracy, quantization support, or deployment tools on NVIDIA, Ascend, Cambricon, and other platforms. Before launch, real stress testing is required instead of relying only on nominal compute.

Therefore, domestic chips are more suitable for enterprises with clear budgets, high compliance requirements, and willingness to invest in platform engineering. It is not the easiest route, but it may be the route that best fits long-term governance.

Consumer GPU Clusters: Better for Pilots and Small Teams

If the goal is to validate business value first, a consumer GPU cluster is easier to start with. GPUs such as RTX 4090, RTX 5090, RTX 3090, and RTX 3060 12GB have more community tools, quantized models, and local inference references, so trial-and-error cost is lower.

The consumer GPU route fits:

Internal pilots by R&D teams.
Knowledge-base Q&A for small and medium businesses.
Low-concurrency code assistants.
Offline document processing.
Internal tools without strict SLA requirements.

But it also has obvious limits:

VRAM is small, making it hard to host a full large model directly.
Multi-GPU communication is weak, and cross-machine communication is more troublesome.
Long-term full-load stability is weaker than server-grade solutions.
Chassis, power, cooling, drivers, and operations become hidden costs.
It is not suitable for promising enterprise-grade high availability from the start.

A more realistic approach is to first run Flash, distilled versions, quantized versions, or smaller models on consumer GPUs, get the business workflow working, and then decide whether to migrate to server GPUs or a domestic compute platform after call volume, quality, and data governance have been validated.

A Possible Deployment Architecture

A relatively stable enterprise private architecture can be divided into six layers:

Model layer: DeepSeek V4 Pro, V4 Flash, or smaller distilled models selected by task.
Inference layer: SGLang, vLLM, llama.cpp, vendor NPU inference stacks, or enterprise self-developed services.
Gateway layer: unified authentication, rate limiting, audit, model routing, and call logs.
Knowledge layer: vector database, full-text search, document parsing, permission filtering, and RAG.
Application layer: customer service, code assistants, document analysis, report Q&A, and Agent workflows.
Operations layer: monitoring, alerts, cost statistics, gray releases, rollback, and security audit.

The gateway layer and knowledge layer are the easiest to underestimate. Many projects fail not because the model is completely unusable, but because permissions, retrieval, logs, context management, prompt templates, and business workflows were not done well.

When deploying LLMs internally, enterprises should treat the model as infrastructure, not as an isolated chat page. The real value appears only when the model enters workflows and can stably process the enterprise’s own data and tasks.

Hardware Selection

Hardware selection should not only ask “can it run”; it should also ask “can it serve stably”.

You can choose by stage:

Validation Stage

The goal is to prove whether the business is worth doing.

Use 1-4 consumer GPUs.
Prioritize Flash, smaller models, distilled models, or quantized models.
Keep concurrency low and focus on task completion rate.
Do not promise high availability.

Do not buy large-scale hardware too early at this stage. First confirm whether employees actually use it, whether the business really saves time, and whether answers can enter real workflows.

Pilot Stage

The goal is to let one department or one business line use it steadily.

Use 4-16 GPUs or a set of domestic NPU nodes.
Add a unified gateway, logs, and permission controls.
Build RAG, document parsing, model routing, and caching.
Start tracking tokens, concurrency, latency, and failure rate.

At this stage, operations begin to matter. Model quality is only one part; stability, cost, and data governance are equally important.

Production Stage

The goal is to enter enterprise-grade service.

Use server GPUs, domestic compute clusters, or private-cloud resource pools.
Build multi-replica deployment, rate limiting, failover, and capacity planning.
Route models by task: simple tasks use lightweight models, complex tasks use Pro.
Connect to enterprise identity systems, audit systems, and security policies.

In production, it is not recommended to send every request to the strongest model. Proper model routing usually saves more money than simply adding hardware.

Choosing an Inference Framework

Models such as DeepSeek V4 have high requirements for inference frameworks. When MoE, long context, sparse attention, quantization, and multi-GPU parallelism are involved, framework maturity directly affects speed and stability.

Common choices can be understood this way:

SGLang: suitable for teams focused on high-performance inference, Agents, multi-turn tool calls, and complex service orchestration.
vLLM: mature ecosystem, suitable for general LLM services, but actual support depends on version and model adaptation progress.
llama.cpp: better for small models, quantized models, and edge deployment; not suitable for directly hosting a full very large MoE model.
Domestic NPU inference stacks: suitable for Xinchuang and domestic compute environments, but operator, quantization, and long-context support must be carefully verified.

Do not choose a framework only by benchmark. Enterprises should test their own real inputs: internal document length, concurrency, average output length, RAG hit rate, number of Agent tool calls, and retry count after failures.

Data Security Must Be Built Outside the Model

Private deployment does not automatically mean security. Running the model locally only solves part of the question of whether data leaves the enterprise.

You still need:

Accounts and permissions: different departments can only access their own knowledge bases.
Log audit: who asked what, which model was called, and which documents were accessed.
Data masking: customer information, ID numbers, phone numbers, contract amounts, and other sensitive fields must be handled.
Prompt security: prevent users from bypassing permissions or leaking system prompts through prompts.
Output review: important scenarios need human review or rule-based review.
Data lifecycle: uploaded documents, vector indexes, caches, and session records must be deletable.

Enterprise local LLM deployment cannot involve only the algorithm team. Security, legal, operations, and business owners should all participate; otherwise, risks will be exposed after launch.

Cost Is More Than GPUs

The cost of local deployment is often underestimated. Beyond GPUs or NPUs, you also need to count:

Servers, racks, power, cooling, and networking.
Storage and backup.
Inference framework adaptation and engineering development.
Operations monitoring and incident handling.
Model upgrades, rollback, and compatibility tests.
Security audit and permission systems.
Business-side prompts, RAG, and workflow construction.

If call volume is very low, external APIs may be cheaper. If call volume is high, data is sensitive, and workflows are stable, local deployment is more likely to amortize cost.

A more reasonable strategy is hybrid deployment:

Highly sensitive data goes to local models.
Low-sensitivity general tasks can use external APIs.
Simple tasks use small models.
Complex tasks use DeepSeek V4 Pro.
High-frequency tasks prioritize caching, retrieval, and model routing optimization.

Recommended Rollout Path

Enterprises can proceed in this order:

Choose 2-3 high-value scenarios first; do not roll out company-wide.
Use consumer GPUs or small-scale compute for a PoC.
Run Flash, distilled models, or quantized models first, and connect RAG and permissions.
Introduce Pro for comparison tests on complex tasks.
Record real call volume, latency, failure rate, and time saved by humans.
Then decide whether to purchase domestic chip clusters or server GPUs.
Before production, complete gateway, audit, monitoring, rate limiting, and rollback.

This path is more stable than buying a large cluster from the start. The biggest enterprise risk is not that the model is not strong enough, but that a lot of money is spent before the business workflow is ready to absorb the model capability.

Summary

DeepSeek V4 gives enterprises more room to imagine local private deployment, but it is not simply a “local ChatGPT”. The real difficulty is engineering: hardware, frameworks, model routing, permissions, RAG, audit, monitoring, and cost control all need to be considered together.

The domestic chip route better fits enterprises with high compliance requirements and long-term private cloud plans. Consumer GPU clusters are better for pilots and quick validation by small and medium teams. Pro fits complex reasoning and Agent tasks; Flash or smaller models fit many ordinary tasks.

If you only remember one sentence: DeepSeek V4 private deployment should not start with hardware procurement, but with business scenarios, data boundaries, and call volume. First get the scenario working, then decide whether to use a large model, how large it should be, and what compute platform to use.

References

Local LLM Models Recommended for an RTX 3060 GPU

Fri, 08 May 2026 09:25:24 +0800

The most common RTX 3060 variant has 12GB of VRAM. It is not a top-tier AI GPU, but it is a very usable card for local LLMs, especially 7B, 8B, 9B, and 12B models.

If you only want a quick rule of thumb, remember this:

On an RTX 3060 12GB, prioritize around-8B models in Q4_K_M or Q5_K_M quantization. Choose Q4 for stability, and try Q5 if you want better quality.

Do not start by chasing 32B or 70B models. Even if they can run with low-bit quantization and CPU offloading, their speed and experience are usually not suitable for daily use.

Start With the VRAM Limit

For local LLMs on an RTX 3060 12GB, the real limit is VRAM.

Model Size	Recommended Quantization	RTX 3060 12GB Experience
3B / 4B	Q4, Q5, Q8	Very easy, fast
7B / 8B / 9B	Q4_K_M, Q5_K_M	Best balance of quality and speed
12B / 14B	Q4_K_M	Usable, but avoid huge context
30B+	Q2 / Q3 or partial offload	Possible to tinker with, not recommended daily
70B+	Very low quantization or heavy CPU/RAM use	More like an experiment

Local LLMs do not only consume VRAM for the model file. Context length, KV cache, batch size, inference framework, and drivers all consume resources.

So 12GB of VRAM does not mean you can load a 12GB model file directly. It is better to leave room for the system and context.

Recommendation 1: Qwen3 8B

If you mainly use Chinese, Qwen3 8B is one of the first models worth trying on an RTX 3060.

Good for:

Chinese Q&A.
Summarization and rewriting.
Everyday knowledge assistant work.
Simple code explanation.
Local RAG.
Lightweight Agent flows.

Recommended choice:

1
2
3

Qwen3 8B GGUF
Q4_K_M: first choice
Q5_K_M: better quality, more VRAM pressure

Qwen models are friendly to Chinese usage. For daily writing, information organization, and Chinese instruction following, Qwen3 8B is a good first model.

Recommendation 2: Llama 3.1 8B Instruct

Llama 3.1 8B Instruct is a stable general-purpose model with mature English capability and ecosystem support.

Good for:

English Q&A.
Lightweight coding help.
General chat.
Document summarization.
Prompt testing.
Comparing different inference tools.

Recommended choice:

1
2
3

Llama 3.1 8B Instruct GGUF
Q4_K_M: better speed and VRAM stability
Q5_K_M: better answer quality

If you mainly process English materials, or want a model with many tutorials and broad compatibility, Llama 3.1 8B is still a good baseline.

Recommendation 3: Gemma 3 12B

Gemma 3 12B is closer to the upper practical limit for an RTX 3060 12GB.

It uses more VRAM than 8B models, but Q4 quantization can still make it usable on a 12GB card. It is a good option if you want to try a slightly larger model on one GPU.

Good for:

Higher-quality general Q&A.
English content processing.
More complex summarization and analysis.
Trying an upgrade over 8B models.

Recommended choice:

1
2
3

Gemma 3 12B GGUF
Q4_K_M or official QAT Q4
Keep context modest

If you run out of VRAM, reduce context length first, or return to an 8B model. For an RTX 3060, 12B is “worth trying,” not a no-brainer default.

Recommendation 4: DeepSeek R1 Distill Qwen 8B

If you want to experience reasoning-style local models, try models like DeepSeek R1 Distill Qwen 8B.

Good for:

Simple reasoning tasks.
Step-by-step analysis.
Learning reasoning-model output style.
Low-cost local experiments.

Recommended choice:

1
2

DeepSeek R1 Distill Qwen 8B GGUF
Q4_K_M

These models may produce longer reasoning-style outputs, so speed and context usage can be heavier than ordinary instruction models. They are not always more comfortable for daily chat, but they are useful for reasoning experiments.

Recommendation 5: Phi / MiniCPM / Smaller Models

If your RTX 3060 is an 8GB variant, or your system RAM is limited, consider 3B and 4B models first.

Good for:

Fast Q&A.
Simple summaries.
Embedding into local tools.
Low-latency chat.
Testing on older machines.

These models may not match 8B or 12B quality, but they are light, fast, and easy to deploy.

Which Quantization to Use

Local models commonly use GGUF, with quantization types such as Q4, Q5, Q6, and Q8.

Quantization	Traits	Best For
Q4_K_M	Small, fast, good enough	RTX 3060 first choice
Q5_K_M	Better quality, higher usage	Try with 8B models
Q6 / Q8	Closer to original quality, larger	Small models or more VRAM
Q2 / Q3	Saves VRAM but quality drops	Large-model tinkering

For RTX 3060 12GB, the practical choices are:

1
2
3

8B models: Q4_K_M or Q5_K_M
12B models: Q4_K_M first
Larger models: not recommended as daily drivers

Which Tool to Use

Beginners can start with Ollama, because installation and running models are simple.

Common commands:

1
2

ollama run qwen3:8b
ollama run llama3.1:8b

If you want finer control over GGUF files, GPU layers, and context length, use llama.cpp or GUI tools based on it.

Common choices:

Ollama: easiest, best for beginners.
LM Studio: friendly GUI, good for downloading and switching models.
llama.cpp: most control, best for performance tuning.
text-generation-webui: many features, good for backend testing.

For local chat and simple Q&A, Ollama or LM Studio is enough.

Do Not Set Context Too High

Many models advertise long-context support, but do not blindly set context to the maximum on an RTX 3060.

Longer context uses more KV cache and increases VRAM pressure. Even if the model loads, long context can slow generation down.

Suggested settings:

1
2
3

Normal chat: 4K to 8K
Document summaries: 8K to 16K
Long-document RAG: chunk first; do not paste everything at once

An RTX 3060 is better suited to “moderate context + good model + good retrieval” than forcing hundreds of thousands of tokens into one prompt.

Choose by Use Case

If you mainly write Chinese:

1
2

First choice: Qwen3 8B Q4_K_M
Alternative: DeepSeek R1 Distill Qwen 8B

If you mainly write English:

1
2

First choice: Llama 3.1 8B Instruct Q4_K_M
Alternative: Gemma 3 12B Q4_K_M

If you want speed:

1
2
3

3B / 4B models
8B Q4_K_M
Keep context at 4K to 8K

If you want better quality:

1
2
3

8B Q5_K_M
12B Q4_K_M
Accept slower speed

If you want coding help:

1
2

8B coding models can help with explanations and small edits
For complex engineering tasks, use stronger cloud models

Local RTX 3060 models are good for code explanation, function completion, small scripts, and offline assistance. For large refactors, difficult bugs, and cross-file Agent work, do not expect Claude Sonnet or GPT-5-level performance.

Reasonable Expectations

The RTX 3060 12GB is good enough to turn local LLMs from toys into daily tools, but it will not recreate top cloud models at home.

Its strengths:

Low cost.
More VRAM than 8GB cards.
Good 8B model experience.
Offline use.
Local processing for privacy-sensitive materials.

Its limits:

Large models are hard to run smoothly.
Long context consumes VRAM.
Slower than high-end GPUs.
Small local models have limited complex reasoning.
Multimodal and Agent workflows need more resources.

The stable route is: use 8B models as everyday local assistants, try 12B models for quality, and leave complex tasks to cloud models.

Summary

Recommended local LLM choices for RTX 3060 12GB:

Chinese general use: Qwen3 8B Q4_K_M
English general use: Llama 3.1 8B Instruct Q4_K_M
Higher-quality experiment: Gemma 3 12B Q4_K_M
Reasoning experiment: DeepSeek R1 Distill Qwen 8B Q4_K_M
Low-VRAM fast use: 3B / 4B small models

Choose Q4_K_M first. Try Q5_K_M for 8B models if you want better quality. Start with Ollama or LM Studio.

Do not treat the RTX 3060 as a large-model server. Treat it as a local knowledge assistant, privacy document processor, lightweight coding helper, and model experiment card, and it will fit its real capabilities much better.

References

Qwen3 8B GGUF: https://huggingface.co/Qwen/Qwen3-8B-GGUF
Llama 3.1 8B GGUF: https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF
Gemma 3 12B GGUF: https://huggingface.co/unsloth/gemma-3-12b-it-GGUF
llama.cpp: https://github.com/ggml-org/llama.cpp
Ollama: https://ollama.com

Hermes + Qwen3.6: A Low-Cost Local Agent Deployment

Mon, 04 May 2026 06:40:30 +0800

This article documents a local Agent deployment plan: run a Qwen3.6 GGUF model with llama.cpp inside WSL2, then connect Hermes Agent to the local OpenAI-compatible API. This gives you a long-running local AI assistant on your own computer, without paying by online service Token usage.

This setup is suitable for users who want to try local AI Agents while keeping data private and controllable over the long term. It can be used for daily Q&A, writing, coding assistance, document organization, and simple automation tasks. The larger the model, the higher the VRAM requirement. The original example uses Qwen3.6-27B, and 24GB VRAM is more stable. If your VRAM is smaller, choose a smaller model or a lower quantization.

Architecture

The overall chain is simple:

Install WSL2 and Ubuntu 24.04 on Windows.
Install CUDA Toolkit inside WSL2 and compile llama.cpp.
Download the Qwen3.6 GGUF model.
Start a local model service with llama-server.
Install Hermes Agent and configure it to http://localhost:8080/v1.
Optional: write a startup script so the model service starts automatically when WSL2 opens.

Hermes provides the Agent capability, while Qwen3.6 provides the local LLM capability. Together, they turn the computer into a private local AI assistant.

Install WSL2 and Ubuntu

Run in an administrator Windows PowerShell window:

1
2

wsl --install
wsl --set-default-version 2

After rebooting, install Ubuntu 24.04:

`1`	`wsl --install -d Ubuntu-24.04`

After installation, Ubuntu prompts you to set a username and password. Once inside Ubuntu, first check whether the NVIDIA GPU is visible in WSL2:

`1`	`nvidia-smi`

If the GPU cannot be detected, update the NVIDIA driver on Windows first. WSL2 inherits the Windows driver, but CUDA Toolkit still needs to be installed separately inside WSL2.

Install Python and Basic Tools

`1`	`sudo apt update && sudo apt install -y python3-pip python3-venv`

You also need build tools, Git, and CMake:

`1`	`sudo apt install -y cmake build-essential git`

Compile llama.cpp

Clone the repository:

1
2

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

If CUDA is already available in WSL2, compile directly:

1
2

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build -j$(nproc)

CMAKE_CUDA_ARCHITECTURES=89 is suitable for Ada GPUs, such as RTX 40 series cards. Adjust it according to your actual GPU architecture.

If compilation reports that CUDA Toolkit is missing, install CUDA Toolkit inside WSL2 first:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-8

Configure environment variables:

export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

Then rebuild:

cd ~/llama.cpp
rm -rf build
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build -j$(nproc)

Download the Qwen3.6 GGUF Model

The example uses Qwen3.6-27B-UD-Q4_K_XL.gguf from unsloth/Qwen3.6-27B-GGUF:

1
2
3

hf download unsloth/Qwen3.6-27B-GGUF \
Qwen3.6-27B-UD-Q4_K_XL.gguf \
--local-dir ~/models/

The file is about 17GB. If Hugging Face is slow, use a mirror such as ModelScope. Do not force a 27B model if your VRAM is insufficient; use a smaller model or lower quantization.

Start the Local Model Service

Start llama-server with your own model file name:

~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--port 8080

After startup, open this in a Windows browser:

`1`	`http://localhost:8080`

For Hermes Agent or other OpenAI-compatible clients, the API endpoint is usually:

`1`	`http://localhost:8080/v1`

Thinking Mode Tradeoff

Qwen3.6 may enable Thinking mode by default. It is suitable for complex reasoning, complicated coding problems, and multi-step analysis, but it is slower.

To disable Thinking mode, stop the service and add --chat-template-kwargs:

~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--chat-template-kwargs '{"enable_thinking":false}' \
--port 8080

After disabling Thinking, simple Q&A, writing, code completion, and code explanation become faster. For complex algorithm design, difficult debugging, and architecture analysis, Thinking mode is still recommended.

Install Hermes Agent

Keep llama-server running, then open a new WSL2 terminal and install Hermes Agent:

`1`	`curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh \| bash`

The installer handles dependencies such as Python, Node.js, ripgrep, and ffmpeg. When configuring the model endpoint, choose a custom endpoint:

1
2
3

URL: http://localhost:8080/v1
API Key: 12345678
Model: auto-detect

For a local llama-server, the API Key can be any placeholder value. After configuration, you can connect Telegram, WeChat, QQ, Discord, and other chat tools, allowing Hermes Agent to call the local model and execute tasks from those entry points.

Auto-Start the Model Service

You can write a startup script so the model service starts automatically when a WSL2 terminal opens.

Create the script:

cat > ~/start-llm.sh << 'EOF'
#!/bin/bash
echo "Starting Qwen3.6-27B llama-server..."
~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 65536 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--port 8080 \
--host 0.0.0.0 &
echo "llama-server started, PID: $!"
echo "API: http://localhost:8080/v1"
echo "Chat UI: http://localhost:8080"
EOF
chmod +x ~/start-llm.sh

Write it into .bashrc:

echo '# Auto-start llama-server' >> ~/.bashrc
echo 'if ! pgrep -f "llama-server" > /dev/null 2>&1; then' >> ~/.bashrc
echo '    ~/start-llm.sh' >> ~/.bashrc
echo 'fi' >> ~/.bashrc

Each time you open a WSL2 terminal, it will start llama-server if it is not already running. If it is running, it skips startup and avoids duplicate processes.

Notes

27B models require substantial VRAM; 24GB VRAM is more stable. Use a smaller model if VRAM is limited.
--ctx-size 65536 significantly increases VRAM and RAM pressure. If unstable, reduce it to 32768 or lower.
Both CUDA Toolkit in WSL2 and the Windows GPU driver must work properly. Either side can cause CUDA compilation or runtime failures.
Hermes Agent calls the local service through an OpenAI-compatible API. The key is that http://localhost:8080/v1 responds correctly.
If accessing from a phone or another device, handle Windows Firewall, LAN addresses, and security isolation. Do not expose the local model service directly to the public internet.

Original article: Hermes + Qwen3.6：本地最强 Agent 组合！零成本、无限 Token，太香了！
llama.cpp: ggerganov/llama.cpp
Hermes Agent: NousResearch/hermes-agent
Qwen3.6 GGUF example: unsloth/Qwen3.6-27B-GGUF

NVIDIA Releases Nemotron 3 Nano Omni: An Open Omnimodal Reasoning Model for Agents

Fri, 01 May 2026 12:07:15 +0800

NVIDIA has released Nemotron 3 Nano Omni, an open omnimodal reasoning model designed for agent workflows. Its focus is not simply text question answering, but putting language, vision, and audio into the same reasoning framework so the model can handle inputs that are closer to real work.

In positioning, Nemotron 3 Nano Omni looks more like a foundation model prepared for AI Agents. It can understand information from screens, documents, images, speech, and video, then turn that information into actionable reasoning results. This kind of capability fits computer operation, document intelligence, video understanding, voice interaction, customer service, education, and enterprise process automation.

Model Specs

Nemotron 3 Nano Omni uses a MoE architecture. The key specs NVIDIA lists are:

Item	Information
Model name	`Nemotron 3 Nano Omni`
Architecture	MoE
Parameter scale	30B total / 3B active
Modalities	Text, image, audio, video
Context length	256K tokens
License	Apache 2.0
Main deployment direction	AI Agents, multimodal reasoning, enterprise agents

The most notable point here is 30B-A3B. It means the model has about 30B total parameters, but only activates about 3B parameters during each inference step. This is a tradeoff between capability and inference cost: the model keeps a larger expert capacity while using only part of it at runtime.

That said, MoE active params does not mean VRAM can be estimated as if this were only a 3B model. A full deployment still needs to account for expert weights, KV cache, vision and audio encoder modules, context length, and inference framework overhead.

It Is Not Solving a Single-Modality Problem

Traditional large language models mainly process text. Multimodal models add image understanding. Nemotron 3 Nano Omni has a broader target: it emphasizes omnimodal input, meaning text, images, audio, and video are all brought into a unified reasoning process.

This matters a lot for agents. Real agent tasks are often not “take a piece of text and generate another piece of text”; they are more like:

reading buttons, tables, and windows on a screen;
parsing PDFs, screenshots, charts, and webpages;
listening to spoken instructions or meeting recordings;
understanding actions, scenes, and timing in video;
combining those signals into the next operation.

If a model can only handle one modality, an Agent needs extra glue between multiple specialized models. The value of an omnimodal model is reducing that integration cost and letting the same model directly process more complex environmental inputs.

Built for Computer Operation and Document Intelligence

NVIDIA specifically notes that Nemotron 3 Nano Omni can be used for computer-operation tasks. These tasks usually require the model to understand user interfaces:

what controls are on the screen;
what state the current window is in;
which button or menu is the next target;
what the content in tables, dialogs, and input boxes means.

This is also one of the hard-to-avoid capabilities when AI Agents move into real deployment. If an agent is going to help people operate office software, browsers, enterprise backends, or developer tools, it has to understand the interface, not just read API docs.

Document intelligence follows a similar logic. Enterprise materials often mix text, tables, images, scanned pages, and charts. An omnimodal model can put all of that content into the same context for understanding, making it suitable for contract review, report analysis, invoice processing, knowledge-base QA, and process automation.

Audio and Video Bring Agents Closer to Real Scenarios

Audio and video inputs can noticeably expand the range of agent applications.

Audio scenarios include:

meeting recording summaries;
customer service call analysis;
voice command understanding;
education and training content organization.

Video scenarios include:

instructional video understanding;
security and industrial inspection;
screen recording analysis;
operation workflow review;
temporal reasoning in multi-step tasks.

If these tasks rely only on text transcription, a lot of visual and timing information is lost. An omnimodal model can directly combine voice, frames, and textual clues, giving Agents a more complete sense of their environment.

Deployment and Ecosystem

NVIDIA is placing Nemotron 3 Nano Omni inside an open ecosystem, and the model uses the Apache 2.0 license. That matters for developers and enterprises because it lowers the licensing barrier for experimentation, integration, and secondary development.

From NVIDIA’s introduction, this model is also closely tied to its inference ecosystem. For enterprise users, real deployment usually raises questions like:

whether it can run efficiently on NVIDIA GPUs;
whether it supports long context and multimodal input;
whether it can connect to existing Agent frameworks;
whether it can process internal documents, audio/video, and UI screenshots;
whether it can be deployed in private environments.

NVIDIA emphasizes that the model has a clear throughput advantage and says it can reach up to 9x the throughput of comparable open omnimodal reasoning models. The real value of that number still depends on the specific hardware, context length, input modalities, and inference framework. But the direction is clear: NVIDIA wants to bring open multimodal models and its inference infrastructure together into enterprise Agent scenarios.

Suitable Use Cases

Nemotron 3 Nano Omni is better suited to tasks such as:

Agents that need to understand text, images, audio, and video at the same time;
enterprise document intelligence and knowledge-base QA;
computer operation based on screenshots or web interfaces;
multimodal analysis of meetings, customer service, and teaching content;
video understanding, workflow review, and temporal reasoning;
teams that require open licensing and private deployment.

It is not necessarily a fit for every regular user. If the task is local chat, code completion, or simple QA, a single-modality language model may be lighter, faster, and more resource-efficient. The value of Nemotron 3 Nano Omni mainly appears in complex input and multimodal Agent workflows.

What This Means for AI Agents

For AI Agents to truly enter work scenarios, they cannot only write text. They need to understand interfaces, speech, documents, and changes in video, then turn that information into the next action.

That is where Nemotron 3 Nano Omni matters. It is not simply making the model larger; it is unifying the many kinds of input Agents face into one reasoning model. This can make it easier for developers to build agents for real tasks instead of building only around chat windows.

From this angle, the point of NVIDIA’s release is not just “another multimodal model”. It is part of a continuing effort to connect open models, GPU inference, enterprise Agents, and private deployment. What will be worth watching next is how it performs in concrete Agent frameworks, enterprise workflows, and local deployments.

References:

NVIDIA Technical Blog: NVIDIA Nemotron 3 Nano Omni

Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

Fri, 01 May 2026 12:02:00 +0800

The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:

Qwen3.6-27B: a 27B dense model.
Qwen3.6-35B-A3B: a 35B total / 3B active MoE model.

There are also online product or API model names such as Qwen3.6-Plus and Qwen3.6-Max. If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table. This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.

As with the Gemma 4 table in /05/10, two concepts need to be separated first:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.

Qwen3.6 has a very long default context. The official model card states native support for 262,144 tokens and extension to 1,010,000 tokens. So the “minimum VRAM” column below only applies to short or medium context. If you really want 128K, 256K, or longer context, reserve much more room for KV cache.

Quick Summary

VRAM	Good Fit	Avoid
8GB	Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk	Q4 and above
12GB	27B Q2/Q3, 35B-A3B Q2/Q3 with short context	27B Q4 with long context
16GB	27B Q3/Q4, 35B-A3B Q3/IQ4_XS	35B-A3B Q4 with long context
24GB	27B Q4/Q5/Q6, 35B-A3B Q4	35B-A3B Q8, BF16
32GB	27B Q8, 35B-A3B Q5/Q6	BF16
48GB	35B-A3B Q8, 27B with longer context more comfortably	35B-A3B BF16
80GB+	27B / 35B-A3B BF16	No need to chase BF16 for ordinary local chat

If you have a 24GB GPU, focus on:

Qwen3.6-27B Q4_K_M
Qwen3.6-27B Q5_K_M
Qwen3.6-35B-A3B UD-Q4_K_M

If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.

Official Weight Sizes

The following BF16 weight sizes come from model.safetensors.index.json in the official Hugging Face repositories. They are useful as a reference for the original model scale.

Model	Architecture	Official BF16 Weight Size	Official Context
`Qwen3.6-27B`	27B dense	55.56GB	Native 262K, extendable to 1,010K
`Qwen3.6-35B-A3B`	35B total / 3B active MoE	71.90GB	Native 262K, extendable to 1,010K

Although 35B-A3B activates about 3B parameters per step, it still needs to load the full MoE weights. So it should not be estimated like a 3B small model.

Qwen3.6-27B VRAM Table

Qwen3.6-27B is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model. For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	9.39GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	10.85GB	12GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	11.85GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	11.99GB	14GB	18GB	VRAM-saving 3-bit
`Q3_K_S`	12.36GB	16GB	20GB	3-bit entry point
`Q3_K_M`	13.59GB	16GB	20GB	Common 3-bit compromise
`IQ4_XS`	15.44GB	20GB	24GB	Near-Q4, more VRAM efficient
`IQ4_NL`	16.07GB	20GB	24GB	Quality/size balance
`Q4_K_M`	16.82GB	20GB	24GB	Recommended 27B default
`Q5_K_M`	19.51GB	24GB	32GB	Higher-quality quantization
`Q6_K`	22.52GB	28GB	32GB	Quality first
`Q8_0`	28.60GB	32GB	40GB	Near-original precision
`BF16`	53.80GB	64GB	80GB	Research, evaluation, precision comparison

For ordinary local coding and chat, Q4_K_M is the easiest starting point to recommend. A 24GB GPU can run Q4_K_M fairly comfortably, but for long context, reduce quantization size or context length.

Qwen3.6-35B-A3B VRAM Table

Qwen3.6-35B-A3B is an MoE model with 35B total parameters and about 3B active parameters per step. Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.

But note that MoE 3B active mainly affects compute. It does not mean VRAM usage is comparable to a 3B model. Full operation still needs the expert weights.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	10.76GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	11.52GB	14GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	12.29GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	13.21GB	16GB	20GB	VRAM-saving 3-bit
`UD-Q3_K_S`	15.36GB	18GB	24GB	3-bit entry point
`UD-Q3_K_M`	16.60GB	20GB	24GB	Common 3-bit compromise
`UD-IQ4_XS`	17.73GB	20GB	24GB	Quality/size balance
`UD-IQ4_NL`	18.04GB	20GB	24GB	Near-Q4 recommended option
`UD-Q4_K_M`	22.13GB	24GB	32GB	Recommended 35B-A3B default
`UD-Q5_K_M`	26.46GB	32GB	40GB	Higher-quality quantization
`UD-Q6_K`	29.31GB	32GB	48GB	Quality first
`Q8_0`	36.90GB	48GB	64GB	Near-original precision
`BF16`	69.37GB	80GB	96GB	Research, evaluation, precision comparison

With 24GB VRAM, UD-Q4_K_M is a key option, but do not set the context too high. If you want room for 128K+ context, UD-IQ4_XS, UD-IQ4_NL, or 3-bit versions are more realistic.

27B vs 35B-A3B

Need	Better Choice
Stable dense-model behavior	`Qwen3.6-27B`
Faster response, agents, and tool use	`Qwen3.6-35B-A3B`
Daily local use on 24GB VRAM	`35B-A3B UD-Q4_K_M` or `27B Q4_K_M`
Testing on 16GB VRAM	Use 2-bit/3-bit for both; avoid long context
Long context first	Use lower-bit quantization and leave more KV cache room
Quality first with 32GB+ VRAM	`27B Q5/Q6` or `35B-A3B Q5/Q6`

If you mainly write code, run agents, or use tools, 35B-A3B is worth trying first. If you care more about dense-model stability and consistency, 27B is more straightforward.

Why Long Context Uses So Much VRAM

The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning. But for local deployment, long context means a much larger KV cache.

Actual VRAM usage is affected by:

KV cache: longer context means higher usage.
Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.
Whether --language-model-only is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.
Batch size and concurrency: more concurrency requires more VRAM.
KV cache quantization: q8_0, q4_0, and similar settings can save VRAM, but may affect details.
Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.

So do not look only at GGUF file size. If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.

How to Choose

If you just want to try Qwen3.6 locally:

12GB VRAM: try 27B UD-IQ2_M or 35B-A3B UD-IQ2_M, with short context.
16GB VRAM: try 27B Q3_K_M or 35B-A3B UD-IQ3_XXS.
24GB VRAM: prefer 27B Q4_K_M, 35B-A3B UD-IQ4_NL, or 35B-A3B UD-Q4_K_M.
32GB VRAM: consider 27B Q5/Q6 or 35B-A3B Q5/Q6.
48GB and above: try Q8_0, or reserve more room for long context.

Most users do not need BF16. The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.

References

Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions

Fri, 01 May 2026 11:55:25 +0800

DeepSeek V4 and Gemma 4 are not in the same class for local deployment. With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.

The official DeepSeek V4 Preview release mainly includes two inference models:

DeepSeek-V4-Pro: 1.6T total / 49B active params
DeepSeek-V4-Flash: 284B total / 13B active params

The official Hugging Face collection also includes two Base models:

DeepSeek-V4-Pro-Base
DeepSeek-V4-Flash-Base

This article only discusses rough VRAM requirements when the full model weights are loaded. For MoE models, active params mainly affects per-token compute. It does not mean only those parameters need to be loaded. Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.

Quick Summary

VRAM Scale	What Is Realistic	Do Not Expect
24GB	Cannot fully run DeepSeek V4; use smaller distilled models or API	Full V4-Flash / V4-Pro local loading
48GB	Still not suitable for full loading; good for small models or remote API clients	Stable V4-Flash Q4
80GB	Theoretically try V4-Flash Q2/Q3 or heavy offload	V4-Pro
128GB	V4-Flash Q4 becomes more realistic; Q5/Q6 still tight	V4-Pro Q4
192GB	V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range	V4-Pro Q4
256GB	V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested	V4-Pro Q5 and above
512GB	V4-Pro Q4 starts to become discussable	V4-Pro FP8
1TB+	V4-Pro FP8 and low-bit Pro-Base are more realistic	Low-cost single-machine deployment
2TB+	Pro-Base FP8 class	Ordinary workstation deployment

If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target. More realistic options are:

Use the official DeepSeek API or compatible services.
Wait for stable community GGUF/EXL2/MLX quantizations and inference support.
Use smaller DeepSeek distilled models.
Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.

Official Weight Sizes

The following figures come from model.safetensors.index.json in the official Hugging Face repositories. They reflect current public weight file sizes, not full runtime VRAM use under long context.

Model	Parameter Scale	Official Weight Size	Notes
`DeepSeek-V4-Flash`	284B total / 13B active	159.61GB	Inference model, smallest in this group
`DeepSeek-V4-Pro`	1.6T total / 49B active	864.70GB	Inference model, stronger but enormous
`DeepSeek-V4-Flash-Base`	284B total	294.67GB	Base model, closer to full FP8 weight size
`DeepSeek-V4-Pro-Base`	1.6T total	1606.03GB	Base model, about 1.6TB

Even the smallest V4-Flash is already close to 160GB of official weights. That is why it should not be treated like a 13B model just because it has 13B active params.

DeepSeek V4 Flash VRAM Estimate

V4-Flash is the most approachable DeepSeek V4 variant for local experiments. But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.

The table below uses the official 159.61GB weight size as the baseline. Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	159.61GB	192GB	256GB	Multi-GPU servers, inference service
`Q6`	120GB	160GB	192GB	Quality-first quantization tests
`Q5`	100GB	128GB	160GB	Quality/size balance
`Q4`	80GB	96GB	128GB	More realistic starting point for Flash
`Q3`	60GB	80GB	96GB	Large-VRAM single GPU or multi-GPU tests
`Q2`	40GB	48GB	64GB	Extreme low-bit experiments with clear quality risk

If mature V4-Flash Q4 builds appear later, it still probably will not be a 24GB GPU model. A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.

DeepSeek V4 Pro VRAM Estimate

V4-Pro is the flagship inference model, with official weights around 864.70GB. Even at 4-bit quantization, the full weights remain in the hundreds of GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	864.70GB	1TB	1.2TB+	Multi-node or multi-GPU inference service
`Q6`	648GB	768GB	1TB	High-quality quantized service
`Q5`	540GB	640GB	768GB	Quality/cost balance
`Q4`	432GB	512GB	640GB	Lowest practical quality line for Pro
`Q3`	324GB	384GB	512GB	Low-bit experiments
`Q2`	216GB	256GB	320GB	Extreme experiments with high quality and stability risk

For individual users, V4-Pro is better consumed through an API. If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.

DeepSeek V4 Flash-Base VRAM Estimate

Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment. V4-Flash-Base has official weights of about 294.67GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	294.67GB	384GB	512GB	Research, preprocessing, evaluation
`Q6`	221GB	256GB	320GB	High-quality quantization research
`Q5`	184GB	224GB	256GB	Quality/size balance
`Q4`	147GB	192GB	224GB	Lower-cost Base experiments
`Q3`	111GB	128GB	160GB	Low-bit experiments
`Q2`	74GB	96GB	128GB	Extreme experiments

If you only want to use DeepSeek V4 capabilities, do not start with the Base model. Base models cost more to deploy and tune; most applications should use the inference model or API.

DeepSeek V4 Pro-Base VRAM Estimate

V4-Pro-Base is the heaviest variant, with official weights around 1606.03GB. That is already a 1.6TB-class model file.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	1606.03GB	2TB	2.4TB+	Large-scale research clusters
`Q6`	1205GB	1.5TB	2TB	High-quality quantization research
`Q5`	1004GB	1.2TB	1.5TB	Research and evaluation
`Q4`	803GB	1TB	1.2TB	Low-bit research
`Q3`	602GB	768GB	1TB	Extreme low-bit research
`Q2`	402GB	512GB	640GB	Extreme experiments

This kind of model should not be discussed in the framework of “can a home GPU run it?” Even Q4 is already beyond the comfortable range of most single-machine workstations.

Why Active Params Are Not Enough

DeepSeek V4 is an MoE model. MoE means each token activates only part of the experts, so compute is much lower than the total parameter count. But this does not mean VRAM only needs to hold the active parameters.

Full local inference also depends on:

Whether all expert weights must stay resident on GPU.
Whether on-demand expert loading is supported.
CPU memory to GPU memory transfer costs.
NVMe offload latency.
KV cache growth under long context.
Extra runtime overhead under 1M context.
Multi-node and multi-GPU communication cost.

So V4-Pro with 49B active should not be deployed like a 49B model. V4-Flash with 13B active should not be treated like a 13B small model either.

How to Choose

If you are an ordinary individual user:

Do not try to fully self-host DeepSeek V4.
Use the official API when you need DeepSeek V4 capabilities.
For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.
With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.

If you have 128GB to 256GB total VRAM:

Watch for stable community implementations of V4-Flash Q4/Q5.
Do not treat V4-Pro as your main local model.

If you have 512GB+ total VRAM:

V4-Pro Q4 starts to become an engineering validation target.
You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.

The key question for DeepSeek V4 local deployment is not “which quantized file should I download?” It is “do I have the system-level inference capacity for this model?” It is closer to a server model than a desktop model.

References

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Fri, 01 May 2026 11:42:34 +0800

Gemma 4 currently has four main sizes for local deployment: E2B, E4B, 26B A4B, and 31B. E2B and E4B target lightweight and edge devices, 26B A4B uses an MoE architecture, and 31B is the larger dense model.

The easiest mistake in local inference is mixing up two numbers:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.

The tables below estimate VRAM requirements based on GGUF file size. The default assumption is local text inference with llama.cpp, LM Studio, Ollama, or similar runtimes, using short to medium context. If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.

Quick Summary

VRAM	Good Fit	Avoid
4GB	Low-bit E2B quantizations	E4B and above
6GB	E2B Q4/Q5, low-bit E4B	26B, 31B
8GB	E2B Q8, E4B Q4/Q5	26B Q4, 31B Q4
12GB	E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests	26B Q4 with long context, 31B Q4
16GB	Low-bit 26B, low-bit 31B	31B Q4 with long context, 26B Q5 and above
24GB	26B Q4/Q5, 31B Q4	31B Q8, BF16
32GB	26B Q6/Q8, 31B Q5/Q6	BF16
48GB	31B Q8 more comfortably, 26B Q8 with longer context	31B BF16
80GB+	26B/31B BF16	Single consumer GPU deployment

If you just want something usable locally, start with E4B Q4_K_M or E2B Q4_K_M. With 24GB VRAM, 26B A4B Q4_K_M and 31B Q4_K_M start to become realistic choices.

Gemma 4 E2B VRAM Table

E2B is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing. It is easy to run, but complex reasoning, coding, and long tasks are limited.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	2.29GB	4GB	6GB	Extreme low-VRAM tests
`UD-Q2_K_XL`	2.40GB	4GB	6GB	Low-VRAM usability
`Q3_K_M`	2.54GB	4GB	6GB	Lightweight chat and summaries
`IQ4_XS`	2.98GB	6GB	8GB	Balance of quality and size
`Q4_K_M`	3.11GB	6GB	8GB	Recommended E2B default
`Q5_K_M`	3.36GB	6GB	8GB	Slightly steadier than Q4
`Q6_K`	4.50GB	8GB	10GB	Higher-quality small model
`Q8_0`	5.05GB	8GB	10GB	Near-original precision for lightweight deployment
`BF16`	9.31GB	12GB	16GB	Debugging, comparison, research

For daily use, E2B Q4_K_M is already enough. With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.

Gemma 4 E4B VRAM Table

E4B is the more practical lightweight model. Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	3.53GB	6GB	8GB	Low-VRAM tests
`UD-Q2_K_XL`	3.74GB	6GB	8GB	Low-VRAM usability
`Q3_K_M`	4.06GB	6GB	10GB	Lightweight local assistant
`IQ4_XS`	4.72GB	8GB	12GB	Balance of quality and speed
`Q4_K_M`	4.98GB	8GB	12GB	Recommended E4B default
`Q5_K_M`	5.48GB	8GB	12GB	Steadier everyday use
`Q6_K`	7.07GB	10GB	16GB	Quality first
`Q8_0`	8.19GB	12GB	16GB	Near-original precision
`BF16`	15.05GB	20GB	24GB	Research, evaluation, precision comparison

If your GPU has 8GB VRAM, E4B Q4_K_M is a realistic starting point. With 12GB or 16GB VRAM, E4B Q8_0 is also worth considering.

Gemma 4 26B A4B VRAM Table

26B A4B is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference. It is better suited to more complex Q&A, coding, tool use, and agent workflows.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	9.97GB	14GB	16GB	Extreme 16GB GPU tests
`UD-Q2_K_XL`	10.55GB	14GB	16GB	Running 26B with low VRAM
`UD-Q3_K_M`	12.53GB	16GB	20GB	Better quality while still VRAM-conscious
`UD-IQ4_XS`	13.42GB	16GB	24GB	Balance of quality and size
`UD-Q4_K_M`	16.87GB	20GB	24GB	Recommended 26B default
`UD-Q5_K_M`	21.15GB	24GB	32GB	Higher-quality quantization
`UD-Q6_K`	23.17GB	28GB	32GB	Quality first
`Q8_0`	26.86GB	32GB	40GB	Near-original precision
`BF16`	50.51GB	64GB	80GB	Not realistic for most single consumer GPUs

24GB VRAM is the comfortable dividing line for 26B A4B. A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.

Gemma 4 31B VRAM Table

31B is the larger dense model. Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	8.53GB	12GB	16GB	Extreme low-VRAM tests with clear quality loss
`UD-IQ2_M`	10.75GB	14GB	18GB	Low-VRAM tests
`UD-Q2_K_XL`	11.77GB	16GB	20GB	16GB GPU experiments
`Q3_K_S`	13.21GB	16GB	24GB	More VRAM-efficient 3-bit
`Q3_K_M`	14.74GB	20GB	24GB	Common 3-bit compromise
`IQ4_XS`	16.37GB	20GB	24GB	Near-Q4 compromise
`Q4_K_M`	18.32GB	24GB	32GB	Recommended 31B default
`Q5_K_M`	21.66GB	28GB	32GB	Higher-quality quantization
`Q6_K`	25.20GB	32GB	40GB	Quality first
`Q8_0`	32.64GB	40GB	48GB	Near-original precision
`BF16`	61.41GB	80GB	96GB	Server or large-VRAM workstation

Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point. Q4_K_M is the balanced choice, while Q5_K_M and above make more sense with 32GB+ VRAM.

Why Actual Usage Is Higher Than File Size

The GGUF file size is only the weight size. Runtime usage also includes:

KV cache: longer context means higher memory use.
Batch size and concurrency: processing more tokens or more users increases VRAM.
Multimodal components: image, audio, or video input often requires mmproj or extra modules.
Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.
KV cache quantization: q8_0, q4_0, and similar modes can save VRAM, but may affect detail.

So the “minimum VRAM” column should be read as the threshold for startup and short-context inference. For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.

How to Choose

If you just want to try Gemma 4 locally:

4GB to 6GB VRAM: choose E2B Q3_K_M or E2B Q4_K_M.
8GB VRAM: prefer E4B Q4_K_M; E2B Q8_0 is also fine.
12GB VRAM: choose E4B Q8_0, or try low-bit 26B/31B variants.
16GB VRAM: try 26B A4B UD-Q3_K_M or 31B Q3_K_S, but do not expect long context to feel comfortable.
24GB VRAM: focus on 26B A4B UD-Q4_K_M and 31B Q4_K_M.
32GB and above: consider Q5_K_M, Q6_K, or longer context.

Most users do not need BF16. Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.

References

How to Tune llama.cpp on 8GB VRAM: Why 32K Is Safer and 64K Needs KV Cache Quantization

Thu, 23 Apr 2026 12:13:04 +0800

Whether 8GB of VRAM is enough to run local LLMs smoothly, especially under long-context workloads, is one of the most common questions people run into when using llama.cpp.

There are three key takeaways worth remembering first:

On 8GB VRAM, 32K context is usually the safer balance point
If you really want to run 64K, KV Cache quantization is often essential
In full-GPU inference, blindly increasing CPU thread count can actually make performance worse

1. First, what do 32K, 64K, and KV Cache actually mean?

For many readers, these are the three terms that cause the most confusion.

32K and 64K refer to context length, meaning how many tokens the model can process at one time. Here, K means thousand, so 32K is about 32000 tokens, and 64K is about 64000 tokens. The longer the context, the more prior content the model can see at once, which is useful for long-document QA, long conversations, and multi-step analysis.

KV Cache is an intermediate-result cache that the model keeps in order to speed up autoregressive generation. You can think of it like this: once the model has already read and computed part of the context, it does not need to recompute everything from scratch every time. Instead, it stores key intermediate information and reuses it. The K and V come from Key and Value in the Transformer architecture.

Why do these three terms always appear together? Because:

32K and 64K define how much content you want the model to remember at once
KV Cache determines how much extra VRAM is needed to maintain that memory
The longer the context, the larger the KV Cache usually becomes, and the higher the VRAM pressure gets

So when long-context inference slows down, the root problem is often not that the model is “bad at computing”, but that the cache has grown large enough to push VRAM to its limit.

2. Why does 32K perform so differently from 64K?

Using roughly 30000 Chinese characters from The Three-Body Problem as a stress-test input, the comparison between 32K and 64K context can look dramatic: with similar document size, 64K can become much slower and total runtime can increase significantly.

The reason is not that the model suddenly becomes worse. The real issue is hitting the VRAM boundary.

At 32K, model weights plus cache may still fit within 8GB VRAM, so most data traffic stays on the GPU’s own memory bandwidth. But once you move to 64K, the cache grows further, total memory use approaches or exceeds the VRAM ceiling, and part of the data gets pushed into shared or system memory.

At that point, what collapses is not raw compute, but bandwidth.

In other words, what looks like “context doubled and performance crashed” is often really a case of the data path falling out of VRAM and into much slower memory.

3. If you want 64K, KV Cache quantization matters a lot

One of the most important conclusions for 8GB VRAM users is that KV Cache quantization matters a great deal.

Without changing the model itself, quantizing only the cache can directly reduce cache memory usage under long context. That means some of the data that previously spilled out of VRAM can move back into VRAM. As a result, 64K is still heavier than 32K, but it is less likely to fall into the slowest performance zone.

Put simply:

32K is the more practical default range for 8GB VRAM
64K is not impossible
But without cache quantization, performance can drop from “usable” to “hard to use”

If your goal is stable long-context inference, the usual priority should be:

Check whether VRAM is already near its ceiling
Decide whether to enable KV Cache quantization
Only then continue experimenting with more aggressive throughput settings

4. Low GPU utilization does not mean the GPU is idle

This is a point that often breaks intuition.

When people see only 20% or 30% GPU usage in Task Manager, they often assume:

the parameters must be wrong
the model is not really running on the GPU
the GPU is not being used fully

But the more likely explanation in llama.cpp inference is that the bottleneck is not core compute, but memory reads and writes.

That means GPU cores may finish a batch of computation quickly, then spend the rest of the time waiting for the next batch of weights or cached data to arrive.

So what you see becomes:

core utilization is not especially high
but end-to-end speed still fails to improve

This is not the GPU being lazy. It is the data path being too narrow.

That is why you should not look only at GPU Usage when judging local LLM performance. VRAM capacity, memory bandwidth, and cache spillover often matter more.

5. Increasing throughput parameters can help, but only if VRAM can handle it

Another useful idea is this: if GPU cores are not fully saturated, maybe you can increase throughput-related parameters so the GPU processes more data at once and uses its parallelism more effectively.

This can indeed improve speed.

But there is an important condition: VRAM must still have headroom.

Because once you increase throughput-related settings, you often also increase VRAM usage. If you are already in a 64K scenario with large cache and VRAM near exhaustion, pushing those parameters further can lead to two outcomes:

a crash
or a fallback into much slower shared-memory behavior

So the safer sequence is usually not “max out the knobs first”, but:

protect the VRAM boundary first
then try throughput optimization
after every change, check both speed and stability again

6. More CPU threads are not always better

This is one of the easiest traps to remember.

It is very natural to assume that more threads should mean better speed. But in practice, once the model is already running mostly on the GPU, forcing CPU thread count higher can make performance noticeably worse.

The reason is straightforward.

In full-GPU inference, the CPU is more of a scheduler and preprocessing helper than the main compute engine. If you open too many threads, CPU-side thread contention, scheduling overhead, and context-switching costs all become heavier, which can disrupt the data flow that should have stayed smooth.

The result is:

the CPU looks busier
but overall speed gets slower

So in this kind of setup, default settings or lower thread counts are often more reliable than simply maxing everything out.

7. A more practical approach for 8GB VRAM users

If we compress the conclusions above into a practical workflow, it looks roughly like this:

1. Treat 32K as the default goal

If you only have an 8GB GPU, do not rush to chase 64K. 32K is usually the more realistic balance between speed, stability, and memory usage.

2. If you want 64K, deal with the cache first

Do not start by asking whether you can squeeze out a little more speed. First confirm whether KV Cache is quantized and whether VRAM is already near the limit.

3. Do not judge everything by GPU utilization

Low utilization does not necessarily mean the settings are wrong. It may simply mean memory bandwidth is the real bottleneck.

4. Throughput optimization is valid, but do not cross the VRAM boundary

These parameters can help, but only if there is still enough VRAM headroom.

5. Be conservative with CPU threads first

If the model is already running mostly on the GPU, higher CPU thread counts are not automatically better. Start with defaults or lower thread counts, then test gradually.

Conclusion

The most valuable part of this whole discussion is not just a few benchmark numbers, but the fact that it makes one easily overlooked truth much clearer:

Local LLM tuning is often not about pushing every setting to the maximum. It is about understanding whether your real bottleneck is compute, VRAM capacity, memory bandwidth, or CPU scheduling.

For 8GB VRAM users, the safer strategy is usually not to force the longest possible context, but to protect the VRAM boundary first and only then decide how far to push further.

If you only remember one sentence, make it this:

32K is often the more stable working range for 8GB VRAM; 64K is possible, but only if you have already brought KV Cache and VRAM usage under control.

A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio

Wed, 22 Apr 2026 21:47:34 +0800

Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.

If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use MoE models inside LM Studio with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.

01 Why a 16GB GPU is not necessarily limited to 12B to 14B

The core idea is straightforward: VRAM size matters, but model architecture matters just as much.

If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.

But MoE models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.

That is exactly why a 16GB GPU still leaves some room to work with.

02 Key practical takeaway: 35B MoE models can run surprisingly fast

One representative case is a quantized MoE model such as Qwen 3.5 35B A3B. With a 16GB GPU and the right settings in LM Studio, Q6 quantization can reach something above 30 tokens/s, and Q4 can sometimes test even higher.

That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.

As a comparison, large models of a similar scale that are not MoE often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.

03 In LM Studio, the key is not just one parameter

If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:

GPU Offload
the setting that forces part of the expert layers into CPU memory

The first one is easy to understand. GPU Offload is basically something you push as high as possible, so the model prioritizes GPU computation.

The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since MoE models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.

A safer way to tune it is to start within a range and then adjust gradually for your machine:

start with related values somewhere between 20 and 35
then fine-tune based on VRAM usage and memory pressure

At its core, this method is using system memory to buy back VRAM headroom.

04 It can still run at 128K context, and smaller contexts reduce VRAM further

Another interesting point is that even with the context length pushed to 128K, a 35B-class MoE model can still maintain a relatively high speed.

That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like LM Studio, the real question is often not simply “can it run or not,” but rather:

are you willing to trade more system memory for less VRAM usage
are you willing to shorten the context length
are you willing to accept different capability tradeoffs across quantization levels

If the context is reduced further from 128K to 64K or 32K, VRAM pressure can drop even more. That means some 35B-class MoE models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.

05 The cost of this approach: much higher demands on RAM and virtual memory

This kind of setup is not free performance.

What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.

So if you want to try it yourself, it is worth checking a few things first:

whether your system RAM is large enough
whether your virtual memory allocation is large enough
whether too many background applications are already consuming resources

If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.

06 More aggressive quantization is not always better

There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.

The practical takeaway is that some models do run faster under Q4, but their original capability can also degrade more. By comparison, Q6 tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:

maximum speed and fitting into VRAM
or preserving more of the model’s original capability

Those two priorities do not necessarily lead to the same quantization choice.

07 What kinds of models are worth trying

From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:

models built on MoE architecture
models that are well supported in LM Studio and have complete quantized variants
models with clear advantages in long context or instruction following

And the idea does not stop at one 35B MoE model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.

The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.

08 Short conclusion

If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.

A more accurate way to put it is:

a 16GB GPU is not automatically ruled out for larger models
dense models and MoE models need to be considered separately
GPU Offload and expert-layer transfer to CPU memory inside LM Studio can significantly change VRAM usage
in practice, you are trading higher memory pressure for larger model scale and better usable speed

This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Sun, 19 Apr 2026 00:18:00 +0800

When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?

This note summarizes how Ollama behaves with multiple GPUs. The short version:

Ollama supports multiple GPUs.
The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.
By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.
If a model does not fit on one GPU, Ollama can spread it across available GPUs.
Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.
SLI / NVLink is not required for multi-GPU use.
To limit which GPUs Ollama can use, use CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, or GGML_VK_VISIBLE_DEVICES.

Official Behavior: Single GPU First, Multi-GPU When Needed

Ollama’s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.

The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.

So do not think of Ollama multi-GPU as “more cards automatically means several times faster.” A more accurate model is:

Small model fits on one GPU: usually runs on one GPU.
Large model does not fit on one GPU: split across multiple GPUs.
Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.

Use this command to see where the model is loaded:

`1`	`ollama ps`

The PROCESSOR column may show something like:

1
2
3

100% GPU
48%/52% CPU/GPU
100% CPU

If you see 48%/52% CPU/GPU, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.

Multi-GPU Is Not Simple Compute Stacking

Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.

So multi-GPU benefits usually fall into two categories:

VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.
Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.

If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama’s default “use one GPU when it fits” strategy avoids that unnecessary PCIe cost.

SLI or NVLink Is Not Required

Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.

NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.

What you should pay attention to is PCIe bandwidth. The difference between x1, x4, x8, and x16 affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.

Safer rules:

Prefer x16 / x8 over mining-style x1 risers.
PCIe bandwidth matters more when switching large models frequently.
If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.
For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.

Limit Which NVIDIA GPUs Ollama Uses

On NVIDIA multi-GPU systems, use CUDA_VISIBLE_DEVICES to control which GPUs Ollama can see.

Temporary run:

`1`	`CUDA_VISIBLE_DEVICES=0,1 ollama serve`

Use only the second GPU:

`1`	`CUDA_VISIBLE_DEVICES=1 ollama serve`

Force Ollama not to use NVIDIA GPUs:

`1`	`CUDA_VISIBLE_DEVICES=-1 ollama serve`

The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:

`1`	`nvidia-smi -L`

Example output:

1
2

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)

Then specify the UUID:

`1`	`CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve`

If Ollama is installed as a Linux systemd service, put the variable into the service environment:

`1`	`sudo systemctl edit ollama.service`

Add:

1
2

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Reload and restart:

1
2

sudo systemctl daemon-reload
sudo systemctl restart ollama

AMD and Vulkan Device Selection

For AMD ROCm, use ROCR_VISIBLE_DEVICES to control visible GPUs:

`1`	`ROCR_VISIBLE_DEVICES=0,1 ollama serve`

To force Ollama not to use ROCm GPUs, use an invalid ID:

`1`	`ROCR_VISIBLE_DEVICES=-1 ollama serve`

Ollama’s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use GGML_VK_VISIBLE_DEVICES:

`1`	`OLLAMA_VULKAN=1 GGML_VK_VISIBLE_DEVICES=0 ollama serve`

If Vulkan devices cause problems, disable them:

`1`	`GGML_VK_VISIBLE_DEVICES=-1 ollama serve`

AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as HSA_OVERRIDE_GFX_VERSION. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.

Exposing Multiple GPUs in Docker

If you run Ollama in Docker, NVIDIA setups usually require nvidia-container-toolkit, then --gpus to expose devices.

Expose all GPUs:

docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Expose specific GPUs:

docker run -d \
  --gpus '"device=0,1"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

You can also combine this with environment variables:

docker run -d \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If nvidia-smi cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.

What Is `OLLAMA_SCHED_SPREAD`

In some multi-GPU configuration discussions, you may see OLLAMA_SCHED_SPREAD=1 or OLLAMA_SCHED_SPREAD=true. It is related to Ollama’s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.

Example:

`1`	`OLLAMA_SCHED_SPREAD=1 ollama serve`

Or with systemd:

1
2

[Service]
Environment="OLLAMA_SCHED_SPREAD=true"

But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.

Treat OLLAMA_SCHED_SPREAD as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on ollama ps, logs, and nvidia-smi.

How to Check Whether Multiple GPUs Are Being Used

Useful commands:

`1`	`ollama ps`

`1`	`watch -n 0.5 nvidia-smi`

View the Ollama service logs:

`1`	`journalctl -u ollama -f`

If using Docker:

`1`	`docker logs -f ollama`

Watch for:

Whether Ollama discovers compatible GPUs.
Whether the model shows 100% GPU or a CPU/GPU split.
Whether each GPU has VRAM allocated.
Whether VRAM grows on multiple GPUs during model loading.
Whether generation token/s improves compared with CPU/RAM spillover.
Whether OOM or model unloading happens frequently.

GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the “does not fit” problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.

Misunderstanding 4: NVLink / SLI Is Required

No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.

GPU Selection Suggestions

For Ollama local inference, the rough priority is:

Larger single-GPU VRAM is usually easier to manage.
Identical GPUs are easier to troubleshoot than mixed GPUs.
More complete PCIe lanes make large-model loading smoother.
Older cards should be checked for CUDA compute capability or ROCm support first.
Multi-GPU power, cooling, and chassis airflow must be planned ahead.

For budget second-hand platforms:

Dual RTX 3090 remains a common high-VRAM option.
Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.
Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.
Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.

Summary

Ollama multi-GPU support is best understood as “VRAM expansion first, performance acceleration second.” If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.

In practice, use ollama ps to check where the model is loaded, then use nvidia-smi or ROCm tools to observe VRAM allocation. For GPU selection, use CUDA_VISIBLE_DEVICES on NVIDIA, ROCR_VISIBLE_DEVICES on AMD ROCm, and GGML_VK_VISIBLE_DEVICES for Vulkan. If running in Docker, first make sure the container can see the GPUs.

Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.

References

Ollama FAQ: How does Ollama load models on multiple GPUs?: https://github.com/ollama/ollama/blob/main/docs/faq.mdx
Ollama GPU docs: Hardware support / GPU Selection: https://github.com/ollama/ollama/blob/main/docs/gpu.mdx
Ollama Docker Hub: https://hub.docker.com/r/ollama/ollama
NVIDIA Container Toolkit: https://github.com/NVIDIA/nvidia-container-toolkit

Gemma 4 E4B Uncensored vs Official: What Actually Changes

Sat, 18 Apr 2026 10:20:00 +0800

If you see a model like HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive, the most important point is this: it is not a new Google base model. It is a derivative release built on top of the official google/gemma-4-E4B-it, but with alignment behavior intentionally pushed toward fewer refusals.

That means the real difference is usually behavioral policy and response style, not a brand-new architecture.

What the derivative model explicitly claims

According to its Hugging Face model card, the HauhauCS release says:

it is based on google/gemma-4-E4B-it
it makes “no changes to datasets or capabilities”
it is “just without the refusals”
the Aggressive variant is “fully unlocked and won’t refuse prompts”

Those are the creator’s claims, not an independent benchmark. Still, they tell you the intended positioning very clearly: this is an unofficial derivative optimized to reduce safety refusals.

Official model vs “uncensored” derivative

Dimension	Official `google/gemma-4-E4B-it`	`Gemma-4-E4B-Uncensored-HauhauCS-Aggressive`
Source	Official Google release	Third-party derivative on Hugging Face
Base architecture	Gemma 4 E4B instruction-tuned model	Same base family, explicitly described as based on `google/gemma-4-E4B-it`
Main goal	General-purpose helpful assistant with responsible-use framing	Reduce refusals and keep answering even when the official model might decline
Safety posture	Aligned with Gemma family safety docs and prohibited-use policy	Intentionally weakened refusal behavior
Response style	More likely to refuse, redirect, or soften certain requests	More likely to answer directly, including prompts the official model may block
Risk profile	Lower misuse risk by default, but still not risk-free	Higher misuse risk, higher chance of unsafe or non-compliant output
Predictability in products	Easier to justify in normal apps and enterprise environments	Harder to justify in public-facing, business, or policy-sensitive deployments
Compliance burden	Still requires application-level safeguards	Requires even stronger downstream safeguards because the model itself is less restrictive

The core difference is alignment, not raw capability

Many users mistakenly treat “uncensored” as if it means “smarter.” That is usually the wrong frame.

For a derivative like this, what changes first is:

how often the model refuses
how strongly it follows harmful or policy-sensitive instructions
how much filtering remains in its final answers

What does not automatically change:

the underlying Gemma 4 family architecture
context window class
multimodal support class
general reasoning ceiling

In other words, an uncensored derivative is often better described as a different behavioral tuning of the same model family, not a higher-tier model.

Why the official version behaves differently

Google’s official Gemma materials frame the family as being built for responsible AI development. The Gemma model card highlights misuse, harmful content, privacy, and bias risks, and Google’s Gemma Prohibited Use Policy explicitly forbids using Gemma or model derivatives to:

facilitate dangerous, illegal, or malicious activities
generate harmful or deceptive content
override or circumvent safety filters

So the official model is not just “more conservative” by accident. Its surrounding policy and intended deployment posture are deliberately different.

When the official model is the better choice

Use the official google/gemma-4-E4B-it path if you care about:

product deployment
enterprise or team use
lower legal and policy exposure
fewer obviously unsafe outputs
easier documentation and review

For most normal applications, this is the safer default.

When people choose the uncensored derivative

Users usually choose an uncensored derivative for:

local private experimentation
testing where the official model refuses too early
roleplay or open-ended creative prompting
comparing alignment behavior across variants

But this comes with a real trade-off: you are moving more safety responsibility from the model provider to yourself.

Practical conclusion

The difference between a so-called “jailbroken” Gemma 4 E4B and the ordinary official version is mostly this:

the official version is optimized for usable capability with guardrails
the uncensored derivative is optimized for fewer refusals with weaker guardrails

That does not automatically make the uncensored model stronger. It mainly makes it more permissive.

If your goal is stable, explainable, and lower-risk deployment, use the official model first. If your goal is local experimentation and you understand the compliance and safety trade-offs, then an uncensored derivative is a behavior variant worth testing separately, not a drop-in “better” replacement.

Sources

Hugging Face: HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
Hugging Face: google/gemma-4-E4B-it
Google AI for Developers: Gemma Prohibited Use Policy
Google AI for Developers: Gemma model card

How to Use llama-quantize for GGUF Models

Sun, 12 Apr 2026 09:42:36 +0800

llama-quantize is the quantization tool in llama.cpp. It is used to convert high-precision GGUF models into smaller quantized versions.

Its most common use is turning formats such as F32, BF16, or FP16 into versions like Q4_K_M, Q5_K_M, or Q8_0 that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.

Basic workflow

A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py ./models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

After that, you can run the quantized model with llama-cli:

1
2

# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"

Common options

--allow-requantize: allows requantizing an already quantized model, usually not ideal for quality
--leave-output-tensor: keeps the output layer unquantized, increasing size but sometimes helping quality
--pure: disables mixed quantization and uses a more uniform quant type
--imatrix: uses an importance matrix to improve quantization quality
--keep-split: keeps the original shard layout instead of producing one merged file

If you just want a practical starting point, this is often enough:

`1`	`./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M`

How to choose a quant

You can think of quant levels as a tradeoff between size, speed, and quality:

Q8_0: larger, but usually safer for quality
Q6_K / Q5_K_M: common balanced choices
Q4_K_M: a very common default with a good size-quality balance
Q3 / Q2: useful when hardware is very limited, but quality loss is more visible

The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.

Practical takeaway

start with Q4_K_M or Q5_K_M
move up to Q6_K or Q8_0 if quality matters more
move down to Q3 or Q2 if memory is tight
compare versions with the same prompt set

In short, llama-quantize is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.

How to Get GGUF Models from Hugging Face with llama.cpp

Sun, 12 Apr 2026 09:31:38 +0800

llama.cpp can work directly with GGUF models hosted on Hugging Face, so you do not always need to download model files manually first.

If a model repository already provides GGUF files, you can use the -hf argument in the CLI, for example:

`1`	`llama-cli -hf ggml-org/gemma-3-1b-it-GGUF`

By default, this downloads from Hugging Face.
If you use another service that exposes a Hugging Face compatible API, you can switch the download endpoint with the MODEL_ENDPOINT environment variable.

One important detail is that llama.cpp only works directly with the GGUF format.
If your model is in another format, you need to convert it first with the convert_*.py scripts provided in the repository.

Hugging Face also offers several online tools related to llama.cpp, including:

converting models to GGUF
quantizing weights to reduce size
converting LoRA adapters
editing GGUF metadata in the browser
hosting llama.cpp inference endpoints

If you only want the practical takeaway, start with repositories that already provide GGUF, then use llama-cli -hf <user>/<model>. In most cases, that is the simplest path.

What Does `it` Mean in Gemma-4-31B-it

Sat, 11 Apr 2026 20:45:34 +0800

In gemma-4-31B-it, it stands for Instruction Tuned.

For most users, that means this version is designed for chat, Q&A, coding help, and other instruction-following tasks.

What `it` means

Models often come in two common forms:

Base / Pre-trained: closer to a raw next-token predictor
it: tuned to follow user instructions more reliably

If you ask something like “translate this text” or “write a Python script”, the it version usually behaves more like an assistant.

What `31B` means

31B means the model has about 31 billion parameters.

In general:

more parameters often mean stronger capability
but also higher VRAM or RAM requirements

So 31B is a relatively large model and needs stronger hardware.

What `Gemma-4` means

Gemma-4 identifies the model family and generation:

Gemma: Google’s open model family
4: the fourth generation in that family

Which one to choose

If your goal is chat, Q&A, translation, or coding, the -it version is usually the better choice.

The base version is more relevant for lower-level research, fine-tuning, or custom training workflows.

One-line summary

gemma-4-31B-it means: Gemma 4 family, 31 billion parameters, instruction-tuned for conversation and task execution.

Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2

Sat, 11 Apr 2026 20:07:29 +0800

When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.

Understand 32, 16, and Q levels first

32: closest to original/uncompressed quality, but hardware demand is extreme.
16: still very close to original quality, around half the size of 32.
Q8: common entry point for quantized models (Q8_0 or Q8).
Q6, Q5, Q4, Q3, Q2: lower number means lower resource use and higher quality loss risk.

What `K_M` / `K_S` means

K_M and K_S are mixed quantization variants:

most weights stay at the target quantization level
important parts keep higher precision

So at the same level, Qx_K_M or Qx_K_S is usually slightly better than plain Qx.

Practical picking strategy

If hardware allows, start with Q8.
If memory is tight, step down through Q6 / Q5 / Q4.
Try not to go below Q4; Q4_K_M is a common lower bound.
Below Q4, quality degradation becomes increasingly visible.

Quality order (best to worst)

32
16

– Above this point, quality is effectively the same, but hardware requirements are extreme –

Q8
Q6_K_M
Q6_K_S
Q6
Q5_K_M
Q5_K_S
Q5

– This is the typical sweet spot –

Q4_K_M
Q4_K_S
Q4

– Below this point, quality loss becomes visible –

Q3_K_M
Q3_K_S
Q3
Q2_K_M
Q2_K_S
Q2

If you want one short rule: start with Q8 or Q6_K_M, then move down to Q5 or Q4_K_M only when needed.

How to Access a Local Ollama API Over LAN on Windows

Sat, 11 Apr 2026 16:43:52 +0800

If you want other devices in the same LAN to access your local Ollama API, follow these steps.

Set the listening host

First, set Ollama to listen on all network interfaces:

OLLAMA_HOST=0.0.0.0:11434

Open the firewall

In Windows Firewall advanced settings, create an inbound rule and allow the target port (for example 8080):

Press Win + S, search and open “Windows Defender Firewall”.
Click “Advanced settings”.
Select “Inbound Rules” -> “New Rule…”.
Choose “Port”, then click “Next”.
Select protocol (usually TCP), enter the target port in “Specific local ports” (for example 8080), then click “Next”.
Choose “Allow the connection”, then click “Next”.
In “Profile”, select Domain, Private, and Public, then click “Next”.
Name the rule (for example OpenPort8080) and click “Finish”.

Run Ollama

Ollama run 模型

Access the model through API

curl http://192.168.x.xxx:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "这个是什么模型?"
}'

Gemma 4 Local Runtime Guide: From One-Command Start to Dev Integration

Fri, 10 Apr 2026 22:54:17 +0800

If you want to run Gemma 4 locally, you can choose from four practical paths depending on your goal and hardware.

1) Fastest start: Ollama (recommended)

This is the lowest-friction option for quick testing, daily chat, and local API usage.

`1`	`ollama run gemma4`

Highlights:

Works on Windows, macOS, and Linux
Handles hardware acceleration automatically
Offers OpenAI-style local API compatibility

2) GUI workflow: LM Studio / Unsloth Studio

If you prefer a desktop UI instead of terminal commands:

LM Studio: browse and run Gemma 4 quantized variants from Hugging Face (for example 4-bit, 8-bit), with resource visibility.
Unsloth Studio: supports both inference and low-VRAM fine-tuning, often friendlier on 6GB-8GB GPUs.

3) Low-spec and maximum control: llama.cpp

Good for older hardware, CPU-focused setups, or users who want deeper runtime control.

With .gguf model files and quantization, Gemma 4 can be made practical on much smaller hardware budgets.

4) Developer integration: Transformers / vLLM

If you need Gemma 4 inside your own application:

Transformers: straightforward Python integration
vLLM: high-throughput inference for stronger GPU environments

Quick selection

Need	Recommended tools	Hardware bar
I just want it running now	Ollama	Low
I want a ChatGPT-like UI	LM Studio	Medium
My VRAM is limited (6GB-8GB)	Unsloth / llama.cpp	Low
I am building local AI apps	Ollama / Transformers / vLLM	Medium to high
I need fine-tuning	Unsloth Studio	Medium to high

Model size suggestion

Gemma 4 comes in multiple sizes (for example E2B, E4B, 31B).

Start with quantized E2B/E4B on mainstream laptops
Move to larger variants only after your baseline pipeline is stable

What are Ollama cloud models and how do you use them

Thu, 09 Apr 2026 18:42:32 +0800

If you already use Ollama to run local models, cloud models are easy to understand.

There is only one core difference:
local models run on your own machine, while cloud models run on Ollama’s cloud infrastructure and return the result to you.

What are Ollama cloud models

Ollama cloud models keep the Ollama workflow, but move the actual computation from your local machine to the cloud.

The main benefits are:

Less pressure on local hardware
Easier access to larger models that your machine cannot run well
You can keep using the familiar Ollama workflow

How they differ from local models

Item	Local models	Cloud models
Runtime location	Your machine	Cloud
Hardware requirements	High	Low
Latency	Usually lower	Affected by network
Privacy	Stronger	Requests are sent to the cloud

If you care more about privacy, low latency, and offline use, local models are a better fit.
If your hardware is limited but you still want to use larger models, cloud models are more convenient.

How to identify a cloud model

At the moment, Ollama cloud models are typically labeled with a -cloud suffix, for example:

`1`	`gpt-oss:120b-cloud`

The available model list may change over time, so the official Ollama pages should be treated as the source of truth.

How to use them

First, sign in:

`1`	`ollama signin`

After that, run a cloud model directly:

`1`	`ollama run gpt-oss:120b-cloud`

If you are calling it from code, you can also configure an API key:

`1`	`export OLLAMA_API_KEY=your_api_key`

Python example:

import os
from ollama import Client

client = Client(
    host="https://ollama.com",
    headers={"Authorization": "Bearer " + os.environ["OLLAMA_API_KEY"]},
)

messages = [
    {"role": "user", "content": "Why is the sky blue?"}
]

for part in client.chat("gpt-oss:120b-cloud", messages=messages, stream=True):
    print(part["message"]["content"], end="", flush=True)

Summary

Ollama cloud models can be summarized in one sentence:

the commands are almost the same, but the model is no longer running on your local machine.

If your computer cannot handle large models well, but you still want to keep the Ollama workflow, cloud models are a very direct option.

How to Download a GGUF Model from Hugging Face and Import It into Ollama

Thu, 09 Apr 2026 11:00:07 +0800

If a model is not available in the official Ollama library, or if you want to use a specific GGUF file from Hugging Face, you can download it manually and then import it into Ollama.

Step 1: Download the GGUF file from Hugging Face

First, find the target model’s GGUF file on Hugging Face. You will usually see multiple quantized versions, such as:

Q4_K_M
Q5_K_M
Q8_0

Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the .gguf file in a fixed directory so you can reference it from the Modelfile.

Step 2: Write the Modelfile

Create a Modelfile in the same directory as the model file. The most basic version looks like this:

`1`	`FROM ./model.gguf`

If the filename is different, replace it with the actual filename, for example:

`1`	`FROM ./gemma-3-12b-it-q4_k_m.gguf`

If your goal is just to get it running, this single FROM line is usually enough.

Step 3: Import it into Ollama

Then run:

`1`	`ollama create myModelName -f Modelfile`

myModelName is the local model name you want to use inside Ollama
-f Modelfile tells Ollama to create the model from that file

Once the creation succeeds, the GGUF file becomes a local model that you can call directly.

Step 4: Run the model

After creation, run:

`1`	`ollama run myModelName`

From that point on, it works much like a model pulled with ollama pull.

How to inspect an existing model’s Modelfile

If you are not sure how to write a Modelfile, you can inspect the configuration of an existing model directly:

`1`	`ollama show --modelfile llama3.2`

This command prints the Modelfile for llama3.2, which is useful as a reference for:

How FROM should be written
How the template and system prompt are structured
How parameters are declared

When this approach makes sense

This manual Hugging Face import flow is useful when:

The model you want is not available in Ollama’s official library
You want a specific quantized variant
You have already downloaded the GGUF file manually
You want finer control over how the model is packaged

If Ollama already provides an official version, using pull is usually simpler. But when you need a specific quantization or a custom wrapper, GGUF + Modelfile gives you more flexibility.

Common notes

The path after FROM must match the actual location of the .gguf file.
If the filename contains spaces or special characters, it is better to rename it first.
Different GGUF quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.
If the model is a chat model, you may still need to adjust the prompt template later for better results.

Conclusion

Downloading a GGUF file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal Modelfile, then run ollama create, and you can bring a third-party GGUF model into your Ollama workflow.

How to Troubleshoot Slow `ollama pull` Model Downloads

Thu, 09 Apr 2026 10:42:39 +0800

ollama pull model_name:tag can be very slow in some regions, and the download process is not always stable.

If your issue looks like repeated interruptions halfway through a large model download, with errors such as TLS handshake timeout or unexpected EOF, the bottleneck may not be registry.ollama.ai itself, but the actual download path after the redirect.

This article walks through a simple troubleshooting approach: first get the real model file URLs, then confirm where the traffic actually ends up, and finally optimize only the domains that matter.

Get the model file download URLs

You can use the following project to extract the manifest and blob download URLs for an Ollama model directly:

https://github.com/Gholamrezadar/ollama-direct-downloader

Using gemma4:latest as an example, you can extract links like the following.

Manifest URL

`1`	`https://registry.ollama.ai/v2/library/gemma4/manifests/latest`

Blob URLs

https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2
https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3

If you only want a quick verification, you can also download the manifest and blobs directly with curl:

curl -L "https://registry.ollama.ai/v2/library/gemma4/manifests/latest" -o "latest"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11" -o "sha256-f0988ff50a2458c598ff6b1b87b94d0f5c44d73061c2795391878b00b2285e11"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a" -o "sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a"
curl -L "https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2" -o "sha256-7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2"

The real download URL after the redirect

If you try downloading one of the blobs with wget, you will notice that the request does not stay on registry.ollama.ai. It gets redirected to a Cloudflare R2 object storage URL:

`1`	`wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a`

There are a few key details in the log:

registry.ollama.ai returns 307 Temporary Redirect
The final download URL lands on *.r2.cloudflarestorage.com
The large file transfer is actually being served by the object storage domain behind the redirect

This matters because if your proxy or routing rules only cover registry.ollama.ai but not *.r2.cloudflarestorage.com, downloads can still be slow or repeatedly interrupted.

Here is one example of an actual redirect log:

wget https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
--2026-04-09 09:22:04--  https://registry.ollama.ai/v2/library/gemma4/blobs/sha256:4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
Resolving registry.ollama.ai (registry.ollama.ai)... 104.21.75.227, 172.67.182.229, 2606:4700:3034::ac43:b6e5, ...
Connecting to registry.ollama.ai (registry.ollama.ai)|104.21.75.227|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?... [following]
--2026-04-09 09:22:05--  https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/4c/4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a/data?...
Resolving dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com (dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com)... 172.64.66.1, 2606:4700:2ff9::1
Connecting to dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com|172.64.66.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9608338848 (8.9G) [application/octet-stream]

Adjust your network settings

Once you confirm the real download path, the troubleshooting direction becomes much clearer.

If you are using a proxy, split routing, or custom DNS, check these first:

Whether registry.ollama.ai and *.r2.cloudflarestorage.com are using the same stable route
Whether your proxy rules cover only the former but miss the latter
Whether your current outbound path is suitable for sustained multi-GB downloads

The key issue here is not simply whether the official site opens, but whether the redirected object storage path is stable enough for long-running large-file transfers. In many cases, the real bottleneck is the Cloudflare R2 layer rather than the registry domain in front of it.

Before-and-after comparison

Here is one real-world example while downloading gemma4:31b-it-q8_0.

Before adjusting the network path, the download was slow and failed midway:

PS C:\Users\knightli> ollama run gemma4:31b-it-q8_0
pulling manifest
pulling a0feadb736f5:  38% ▕██████████████████████                                    ▏  12 GB/ 33 GB  1.2 MB/s   4h40m
Error: max retries exceeded: unexpected EOF

After the adjustment, the same model download became noticeably faster and more stable:

1
2
3

PS C:\Users\knightli> ollama run gemma4:31b-it-q8_0
pulling manifest
pulling a0feadb736f5:  46% ▕████████████████████████████████████████████████████████████████▏ 15 GB/ 33 GB  8.5 MB/s  35m23s

This does not mean every network environment will see the same improvement, but it does support one useful conclusion: the bottleneck may be the actual large-file download path rather than the Ollama client itself.

A more practical troubleshooting order

If you run into the same issue, this order usually works well:

Run ollama pull or ollama run once and confirm the issue is reproducible.
Test a blob URL with wget or curl -L and confirm whether it redirects to *.r2.cloudflarestorage.com.
Adjust your proxy or routing only for the real download domain, then test speed and stability again.

The benefit of this order is that each step validates one clear hypothesis, so you do not have to troubleshoot blindly.

Conclusion

When ollama pull is slow, the problem is often not that registry.ollama.ai is unreachable, but that the Cloudflare R2 path actually serving the large files is unstable.

So instead of retrying over and over, a better approach is to identify the real download path first and optimize the network route where the traffic actually lands.

Gemma 4 on Raspberry Pi 5: It Works, But Responses Are Slow

Wed, 08 Apr 2026 18:42:00 +0800

I ran a near-limit experiment: running Gemma 4 on a Raspberry Pi 5 (8GB RAM). I was not targeting larger variants, only the smallest E2B model.

Conclusion first: it runs and it is usable, but it fits low-interaction workflows better than real-time chat.

Test Environment

Device: Raspberry Pi 5 (4-core CPU, 8GB RAM)
OS: Ubuntu Server (no GUI)
Access method: SSH
Runtime: LM Studio CLI (command-line-only mode)
Model: Gemma 4 E2B (about 4.5GB)

Step 1: Install and Start LM Studio CLI

I installed the LM Studio CLI build on the Pi, then started the service and checked available commands.

For a terminal-only setup, this deployment mode is a good fit for Raspberry Pi.

Step 2: Move Model Storage to SSD

To avoid heavy SD card writes, I switched model download storage to an external SSD.

On Raspberry Pi 5, SSD usage is much more practical than on older models. For long-term local model runs, SSD is strongly recommended.

Step 3: Download and Load Gemma 4 E2B

After download, the model loaded into memory successfully.

According to official information, Gemma 4 includes:

Tool-calling support for agent-style workflows (function calling)
Multimodal capabilities (image/video; smaller models also include audio-related capability)
128K context window
Apache 2.0 license (commercial use allowed)

Given Raspberry Pi hardware limits, E2B is the most practical tier to start with.

Step 4: Start API and Enable LAN Access

After loading, I started the API on local port 4000 and confirmed model listing works via HTTP.

The issue: by default, it only listens on localhost, so other LAN devices cannot access it directly.

Since host binding was not exposed by the startup options, I used socat for port forwarding, bridging an external Pi port to LM Studio’s internal port.

Result: successful. I could query the model list from a MacBook on the same LAN.

Step 5: Connect to Editor (Zed)

LM Studio’s local server is OpenAI-API-compatible, so most tools that support custom base_url can connect.

I added a new LLM provider in Zed pointing to the Pi-hosted Gemma 4 instance, and in-editor chat worked.

Practical Usability

This setup is suitable for:

Local automation scripts
Low-concurrency, low-real-time assistant tasks
Personal learning and edge-device experimentation

Less suitable for:

High-frequency interactive chat
Development collaboration scenarios sensitive to response latency

Conclusion

Running Gemma 4 (E2B) on Raspberry Pi 5 is feasible, and the practical output quality is better than expected.

If your goal is offline operation, tool integration, and lightweight-to-mid tasks, this setup is worth trying. If your goal is smooth real-time interaction, stronger hardware is still the better choice.

Connect OpenClaw to Local Gemma 4: Complete Setup Guide

Wed, 08 Apr 2026 18:18:00 +0800

This guide shows how to connect OpenClaw to a local Gemma 4 model through Ollama.

If you have not deployed Gemma 4 locally yet, start here:

How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide

Step 1: Start the Ollama API Service

Start Ollama first:

`1`	`ollama serve`

Then verify the API quickly with:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:12b",
  "prompt": "Hello"
}'

If you get a model response, your local API is ready.

Step 2: Configure OpenClaw to Use Ollama

The OpenClaw config file is usually located at:

`1`	`~/.openclaw/config.yaml`

Edit config.yaml and add a local model entry under models:

models:
  # Your existing model config...

  gemma4-local:
    provider: ollama
    base_url: http://localhost:11434
    model: gemma4:12b
    timeout: 120s

Step 3: Set Default Model (Optional)

If you want Gemma 4 as the default model:

`1`	`default_model: gemma4-local`

Step 4: Restart and Verify OpenClaw

Restart OpenClaw:

`1`	`openclaw restart`

List available models:

`1`	`openclaw models list`

Run a quick chat test:

`1`	`openclaw chat --model gemma4-local "Hello"`

If the chat returns normally, OpenClaw is successfully connected to local Gemma 4.

Common Troubleshooting

connection refused: make sure ollama serve is running.
Model not found: check model name with ollama list (for example gemma4:12b).
Timeout: increase timeout and test a smaller model first.

How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide

Wed, 08 Apr 2026 18:06:00 +0800

If you want to run Gemma 4 locally on a laptop, Ollama is one of the fastest and simplest options. Even without complex setup, you can usually get it running in about five minutes.

Step 1: Install Ollama

Open https://ollama.com and download the installer for your OS.
Complete installation based on your system:

macOS: drag it to Applications.
Windows: run the .exe installer.
Linux: use the install script from the official site.

After installation, Ollama runs as a background service. Beyond initial setup, daily usage is mostly simple commands.

Step 2: Download a Gemma 4 Model

Open a terminal and run:

`1`	`ollama pull gemma4:4b`

If your machine is stronger, you can switch to 12b or 27b. Once downloaded, the model is stored locally.

Check downloaded models with:

`1`	`ollama list`

Step 3: Run the Model

`1`	`ollama run gemma4:4b`

This opens an interactive chat session in your terminal. Type your prompt and press Enter. To exit, type:

/bye

If you prefer a browser chat UI, you can pair it with Open WebUI. It wraps Ollama with a local web interface and is usually quick to set up with Docker.

Laptop Performance Tips

Apple Silicon (M2/M3/M4): Metal acceleration is enabled by default, and 12B can run well.
NVIDIA GPU: CUDA is used automatically when a compatible GPU is detected. Keep drivers updated.
CPU-only inference: works, but larger models will be slower. For most CPU-only setups, 4B is the practical default.
Free memory before loading large models: as a rough rule, each billion parameters needs about 0.5GB to 1GB RAM.

How to Choose a Model

Gemma 4 1B: good for lightweight Q&A, simple summarization, and quick lookups; limited on complex reasoning.
Gemma 4 4B: best for most daily tasks (writing help, coding help, document summarization) with strong speed/quality balance.
Gemma 4 12B: better for longer context and more complex tasks, especially coding and reasoning.
Gemma 4 27B: better for high-demand workloads and closer to frontier-cloud quality, but needs significantly stronger hardware.

How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide

Wed, 08 Apr 2026 17:55:53 +0800

If you want to run Gemma 4 offline on your phone, this guide walks you through the full process from setup to practical usage.

Step 1: Get the App

Google AI Edge Gallery is currently not available on Google Play, so you need to install it via APK sideloading.

On your Android device, go to:

Settings -> Apps -> Special app access -> Install unknown apps

Then:

Find your browser (for example, Chrome or Firefox) and enable “Allow from this source.”
Open the Google AI Edge Gallery GitHub Releases page in your mobile browser.

URL: https://github.com/google-ai-edge/gallery/releases

Download the latest .apk package.
After the download completes, open the file from notifications or your file manager and follow the prompts.

With a stable connection, this step usually takes around 2 minutes.

Step 2: Open the App and Grant Permissions

When you first open AI Edge Gallery, it will request storage permission to save model files. It’s best to allow this; otherwise, the app cannot download or load models.

You will typically see these sections on the home screen:

Ask Image: Vision tasks (describe images, answer questions about photos)
AI Chat: Standard text chat
Summarize: Paste text and generate summaries
Smart Reply: Generate reply suggestions

For most users, AI Chat is the primary entry point.

Step 3: Download a Gemma 4 Model

Enter AI Chat.
Tap Get Models when prompted.
Choose a Gemma 4 model from the list (model size is shown).
Pick based on your device capability; if your phone has 8GB RAM, start with Gemma 4 4B.
Tap Download and let it run in the background.

Note: Larger models take longer to download. You can download multiple models and switch between them later. Downloaded models stay on your device, so you do not need to re-download them.

Step 4: Start Chatting

After the model download is finished:

Tap the model name to load it (the first load usually takes 10 to 30 seconds depending on model size and device performance).
Enter your prompt in the chat box and send it.
The model generates responses locally, and your data does not leave the phone.

The first reply is often slower due to model warm-up. Later messages in the same session are usually faster.

Step 5: Try Vision Features (Gemma 4 Multimodal)

If you downloaded a Gemma 4 multimodal variant:

Go back to the main menu and open Ask Image.
Select an image or take a photo.
Ask a question (for example, “What’s in this image?” or “Is there any text I should pay attention to?”).
Wait for the model to analyze the image locally and return a result.

This feature works offline, and your image is not sent to external servers.

Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B

Sun, 05 Apr 2026 08:30:00 +0800

Gemma 4 focuses on multimodality and local offline inference, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.

Gemma 4 Model Comparison

The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.

Model	Parameter Size	Positioning	Key Strengths	Main Limitations	Recommended Scenarios
Gemma 4 2B	2B	Ultra-lightweight	Low latency, low resource usage, lowest deployment barrier	Limited performance on complex reasoning and long task chains	Mobile, IoT, lightweight Q&A, simple automation
Gemma 4 4B	4B	Lightweight enhanced	Stronger understanding and generation than 2B, still easy to deploy locally	Limited ceiling for heavy coding and complex agent tasks	Local assistant, basic document work, multilingual daily tasks
Gemma 4 26B	26B	High-performance (MoE)	Better reasoning and tool use, suitable for production workflows	Significantly higher VRAM requirement and hardware threshold	Coding assistant, complex workflows, enterprise internal agents
Gemma 4 31B	31B	High-performance (dense)	Best overall capability and stronger stability on complex tasks	Highest resource cost and tuning complexity	Advanced reasoning, complex coding tasks, heavy automation

How to Choose: Start from Hardware and Tasks

If your top concern is whether it runs smoothly, use this guideline:

8GB VRAM: prioritize 2B/4B.
12GB VRAM: prioritize 4B or quantized variants of larger models.
24GB VRAM: focus on 26B, and evaluate quantized 31B based on workload.
Higher VRAM or multi-GPU: consider high-precision 31B setups.

Prioritize stability and inference speed first, then scale up model size gradually.

Four Typical Use Cases

1) Local General Assistant

Preferred model: 4B
Why: strong balance between cost and quality, suitable for long-running local use.

2) Coding and Automation

Preferred model: 26B
Why: more stable in multi-step tasks, tool calls, and script generation.

3) Advanced Reasoning and Complex Agents

Preferred model: 31B
Why: stronger robustness under complex context.

4) Edge Devices and Lightweight Offline Use

Preferred model: 2B
Why: easiest to deploy on resource-constrained devices.

Deployment Suggestions (Ollama)

A practical approach is to iterate in small steps:

Start with 4B to establish a baseline (latency, memory, quality).
Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).
Compare 26B/31B against that set for accuracy, latency, and VRAM cost.
Upgrade only when the gain is clear.

This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.

Conclusion

The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:

For low-cost fast rollout: start with 2B/4B.
For production-grade local AI workflows: prioritize 26B.
For advanced reasoning and heavy automation: move to 31B.

In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.