GGUF on KnightLi Blog

Qwen3.6-35B-A3B jailbreak local deployment: uncensored GGUF, llama.cpp, and safety boundaries

Sun, 24 May 2026 23:52:16 +0800

Freedidi recently introduced a popular local model: Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive. The original article describes it as a jailbreak or uncensored open model, and gives GGUF quantized files, a llama.cpp launch method, and ideas for connecting it to agents.

This kind of model is worth watching, but it should be understood calmly. The point is not only that it has fewer restrictions. It brings several important local AI capabilities together:

A 35B-class model with a MoE architecture.
GGUF quantization that can run on consumer GPUs.
An OpenAI-compatible local API through llama.cpp.
Multimodal vision input through mmproj.
Integration with local agent tools such as Hermes and OpenClaw.

If you care about local models, the more important trend is not the jailbreak label. It is that local models are moving from “can chat” toward “can use tools, understand images, and serve as agent backends.”

What this model is

The model name mentioned in the original article is:

`1`	`Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive`

The name contains several key pieces:

Qwen3.6: based on the Qwen model family.
35B: around 35B total parameters.
A3B: roughly 3B active parameters per inference step, following a MoE-style design.
Uncensored / Aggressive: fewer safety restrictions or a more aggressive tuning style.
GGUF: a quantized format for local inference tools such as llama.cpp.

One important note: Uncensored does not mean more reliable. It usually means the model refuses less often, but it may also generate unconstrained, unverified, or risky content more easily. It can be useful for technical experiments, but it should not be connected directly to public services, production systems, or unattended workflows.

Why a 35B model can run locally

Many people see 35B and assume it requires a server or high-end multi-GPU machine. The key point in the original article is that this model uses a MoE architecture.

MoE can be understood simply: the model has many total parameters, but each inference step activates only part of the experts. The original article says it activates roughly 3B parameters per run, so with quantization it can have much lower speed and VRAM pressure than a traditional dense 35B model.

After GGUF quantization, it becomes possible to run it on consumer GPUs. The article says the smallest quantized version is around 11GB, and 6GB/8GB GPUs can try it, though at least 8GB VRAM is recommended.

A more realistic expectation:

6GB VRAM: possible with low-bit quantization, but reduce expectations for context length and speed.
8GB VRAM: better for entry-level testing with smaller quantization.
16GB VRAM: more comfortable for longer context and more GPU offload.
24GB VRAM: better for higher-quality quantizations such as Q4_K_M and Q4_K_P.

Whether a local model is usable is not only about whether it starts. Context length, generation speed, remaining VRAM, KV cache, multimodal mode, concurrency, and task type all matter.

How to read the quantization choices

The original article roughly recommends:

Q4_K_P: better for RTX 4090 or other 24GB VRAM machines.
Q4_K_M: more stable and higher quality.
IQ4_NL: strong compression while preserving quality as much as possible.
IQ2_M: for 6GB/8GB VRAM users.

Think of this as a trade-off between quality and resource usage:

Q4 quantizations are usually more stable, but use more VRAM.
IQ2 / IQ3 quantizations save resources, but may reduce answer quality, long-text stability, and detail handling.
If you only want to test agent calls and a local API, low quantization can help you get the flow running.
If you plan to write code, analyze images, or do complex reasoning for long periods, choose higher-quality quantization when possible.

Do not treat “it starts” as “it is good enough for long-term use.” Low-VRAM startup and stable task completion are different things.

llama.cpp deployment approach

The original article recommends llama.cpp because it supports Windows, Linux, macOS, and backends such as NVIDIA CUDA, AMD, Intel, Vulkan, and CPU.

A typical launch command looks like:

llama-server.exe ^
  -m "model-path.gguf" ^
  --mmproj "mmproj.gguf" ^
  -ngl 999 ^
  -c 131072 ^
  -n 8192 ^
  --host 127.0.0.1 ^
  --port 8080 ^
  --jinja

Several parameters are worth understanding:

-m: path to the main GGUF model.
--mmproj: multimodal projection file required for vision input.
-ngl: offload layers to GPU as much as possible, depending on VRAM and backend.
-c: context length; higher values use more memory and VRAM.
-n: maximum generated tokens per response.
--host 127.0.0.1: listen only locally, safer than exposing publicly.
--port 8080: local API port.
--jinja: important for newer Qwen chat templates; without it, formatting issues, repetition, or Chinese output problems may occur.

The easiest trap is context length. -c 131072 looks attractive, but long context significantly increases KV cache usage. On low-VRAM machines, start smaller and increase gradually.

How multimodal support works

The article says this build supports multimodal vision, including image analysis, screenshots, OCR, complex UI analysis, and code screenshots.

In llama.cpp, multimodal support usually requires both the main model and the matching mmproj file. If --mmproj is not loaded correctly, image upload may be unavailable or the model may not understand images correctly.

Useful local multimodal scenarios include:

Analyzing UI screenshots.
OCR on image text.
Reading code screenshots or error screenshots.
Providing visual input to local agents.
Processing private images without uploading them to the cloud.

But vision understanding is not strict OCR or a guaranteed source of truth. For invoices, contracts, IDs, medical images, and other high-risk material, human review is still required.

OpenAI-compatible API

llama-server in llama.cpp can expose a local interface similar to the OpenAI API. The local base URL from the original article is:

`1`	`http://127.0.0.1:8080/v1`

This means many tools that support custom OpenAI-compatible providers can send requests to the local model. The API key can often be any placeholder value, depending on whether the client enforces validation.

This is useful because:

No cloud API key is needed.
There is no per-token billing.
Data can remain on the local machine.
It can connect to local agents, coding assistants, or chat frontends.
It can be used as a local OpenAI API replacement for experiments.

Do not expose the local API directly to the public internet. Even when the model runs locally, an open API can be abused, consume machine resources, or produce content you did not intend to generate.

Why Hermes and OpenClaw matter

The original article says the value becomes clearer when connecting this local model to Hermes or OpenClaw.

The meaning is that the model itself is only the inference core. Agent tools connect it to real tasks, such as:

Writing code.
Calling tools.
Reading files.
Analyzing images.
Searching the web.
Executing multi-step tasks.
Maintaining long-context workflows.

A local model used only for chat has limited value. If it can act as a stable agent backend, it becomes closer to a local AI workstation.

However, connecting an uncensored model to an agent requires extra caution. When the agent can operate files, run commands, visit web pages, and call tools, model output turns into real actions. The fewer restrictions the model has, the more important external permissions, human confirmation, and audit logs become.

Safety boundaries for uncensored models

The main selling point of these models is often that they refuse less. But fewer refusals also mean higher risk.

Keep in mind:

It may more easily produce illegal, dangerous, or misleading content.
It may not actively remind you of safety boundaries.
It may give overconfident advice on high-risk topics.
It may be induced by prompts to perform inappropriate tasks.
It is not suitable for direct public exposure.

A safer approach:

Test only on a local machine or controlled LAN.
Do not connect it to high-privilege tools.
Do not let it automatically delete, pay, publish, or bulk-submit.
Put file, command, network, and browser permission boundaries around agent tools.
Keep human review for high-risk outputs.

The freer the model is, the more external system constraints it needs.

Who should try it

This kind of model fits users who:

Want to study local LLM deployment.
Have at least 8GB VRAM and are willing to tune GGUF and llama.cpp.
Want to connect local models to OpenAI-compatible clients.
Care about local multimodal input, screenshot analysis, and agent backends.
Want to process some private data offline.

It is less suitable for:

Beginners who do not want to tune parameters.
Services that require stable production SLA.
Teams with strict security and compliance requirements.
Business workflows that require strict factual reliability.
People who want to expose the model directly to external users.

Conclusion

Models like Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive show that local AI capabilities are moving quickly. Consumer GPUs can run larger models, GGUF quantization lowers deployment barriers, llama.cpp gives local models OpenAI-compatible APIs, and multimodal plus agent tools push them from chat toward task execution.

But it should not be understood only as a jailbreak model. The more valuable angle is that local AI is becoming composable infrastructure. The final experience depends on the model, inference engine, API server, frontend, agent tools, and permission controls together.

If you try it, start with low-risk local testing: choose an appropriate quantization, reduce context length, verify --jinja and --mmproj, then connect a client. After it is stable, consider connecting agent workflows.

References:

Freedidi article: https://www.freedidi.com/24284.html
llama.cpp GitHub: https://github.com/ggml-org/llama.cpp

Running Qwen3.6-35B Locally on an RTX 3070 8GB: llama.cpp Deployment Notes and Tuning Parameters

Fri, 22 May 2026 22:44:16 +0800

Whether an 8GB GPU can run a 35B-class model depends on more than the total parameter count. Model architecture, quantization format, and the way the inference framework schedules work all matter.

The core idea in this setup is to use a GGUF quantized version of an MoE model such as Qwen3.6-35B-A3B, then use llama.cpp with CUDA acceleration, CPU Offload, MoE parameter scheduling, and KV Cache quantization to split memory pressure between the GPU and system RAM. With that approach, an older GPU such as the RTX 3070 8GB can still have a chance to run a 35B-class local multimodal model.

One point needs to be clear first: this is not “fitting a full 35B model entirely into 8GB of VRAM.” A more accurate way to understand it is that the GPU handles the compute that benefits most from GPU acceleration, while some expert layers and cache pressure are carried by system memory. The real experience depends on RAM capacity, CPU performance, quantization format, context length, and parameter choices.

Test environment

This kind of setup is sensitive to system memory. A reference configuration is:

CPU: Intel Core i7-12700 class
GPU: NVIDIA RTX 3070 8GB
RAM: 64GB
OS: Windows 11
Inference framework: llama.cpp CUDA build
Model format: GGUF

If you only have 16GB or 32GB of RAM, it is not necessarily impossible to try, but a 35B MoE model is more likely to create memory pressure during loading and long-context inference. For stable use, 64GB of RAM is a safer target.

Why 8GB VRAM can still run a 35B model

The key to Qwen3.6-35B-A3B is its MoE architecture. Its total parameter scale is 35B, but not all parameters are activated during each inference step; only part of the expert parameters are active.

That leads to two consequences:

The full model file is still large and requires enough disk space and system memory.
The active compute per inference step is lower than a full 35B Dense model.

llama.cpp’s CPU Offload and MoE-related parameters can further reduce the VRAM threshold. The GPU mainly handles attention and some high-value compute, while the CPU and system memory carry part of the expert-layer weights. The tradeoff is that speed, response latency, and stability depend more on the whole machine, not only the GPU model.

Preparing llama.cpp

Windows users can download a prebuilt CUDA version of llama.cpp directly. Pay attention to three points:

The GPU driver should be new enough, and the CUDA runtime should match the llama.cpp package you download.
After downloading, place it in a path without Chinese characters or special characters so batch scripts are easier to run.
Put model files under a unified models directory to avoid very long paths in commands.

If you use AMD, Intel graphics, or a CPU-only environment, you can also choose Vulkan, HIP, SYCL, or CPU builds, but the parameters and performance will be different. This article focuses on the CUDA route for NVIDIA GPUs.

Download the model and multimodal projection file

The model used here is:

Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

The Q4_K_M quantization format is chosen mainly to balance accuracy, file size, and speed. On low-VRAM machines, it is not a good idea to start with a higher-precision version, because loading failures or frequent system paging become much more likely.

If you want image understanding, you also need the multimodal projection file, for example:

mmproj-BF16.gguf

This file is important. Downloading only the main model usually gives you text inference only. Without mmproj, the web UI may not expose a usable image upload feature, or uploaded images may not be processed correctly.

Keep the directory structure simple:

llama.cpp/
├─ llama-server.exe
└─ models/
   ├─ Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
   └─ mmproj-BF16.gguf

RTX 3070 8GB startup parameters

Below is an example startup script for an RTX 3070 8GB. Change the path to your own llama.cpp directory.

@echo off
chcp 65001 >nul
cd /d D:\AI\llama.cpp

llama-server.exe ^
  -m "models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
  --mmproj "models\mmproj-BF16.gguf" ^
  -ngl 99 ^
  --n-cpu-moe 999 ^
  --flash-attn on ^
  --jinja ^
  -c 32768 ^
  -t 12 ^
  -b 512 ^
  -ub 128 ^
  --cache-type-k q4_0 ^
  --cache-type-v q4_0 ^
  --mlock ^
  --host 127.0.0.1 ^
  --port 8080

pause

After startup, open this address in your browser:

`1`	`http://127.0.0.1:8080`

If the page opens and the model replies normally, the service has started successfully. The first model load can be slow. Avoid launching multiple instances repeatedly during loading, because that can fill system memory more easily.

Understanding the key parameters

-ngl 99 tries to place as many layers as possible on the GPU. How many layers actually fit depends on the model structure, quantization format, and VRAM usage.

--n-cpu-moe 999 pushes more MoE expert layers to the CPU side, reducing VRAM pressure. It is one of the key parameters for running large MoE models on low-VRAM hardware.

--flash-attn on enables Flash Attention, which can reduce the cost of attention computation. Whether it is available depends on the current llama.cpp version and GPU support.

-c 32768 sets the context length. Long context significantly increases KV Cache pressure. If startup fails or inference is very slow, try lowering it to 8192 or 16384.

--cache-type-k q4_0 and --cache-type-v q4_0 quantize the KV Cache, saving memory and VRAM, though they may have a small impact on output quality and speed.

-b 512 and -ub 128 control batching-related parameters. In a low-VRAM environment, do not start with overly aggressive batch settings.

Common issues

If startup reports insufficient VRAM, first reduce the context length, for example changing -c 32768 to -c 8192, then try lowering -b and -ub.

If the image upload button is unavailable, first check whether the --mmproj path is correct and whether the mmproj file matches the model.

If the model responds slowly after loading, it usually does not mean the GPU is idle. Large amounts of weights or expert layers may be handled by the CPU and system memory. Use Task Manager to observe GPU, CPU, memory, and disk usage to identify the bottleneck.

If the output format looks wrong, confirm that --jinja is enabled and check whether the model requires the corresponding chat template.

If the browser cannot open the service after startup, check the --host and --port settings, and make sure port 8080 is not occupied by another program.

Who should try this

This setup is suitable for users who already have 8GB VRAM devices such as RTX 3070, RTX 4060 Laptop, or RTX 3060 8GB, but want to experiment with larger MoE models.

It is not suitable for people who need maximum speed. Running a 35B MoE model on low VRAM essentially trades CPU and system memory for a lower VRAM requirement. Being able to run it is one thing; whether it feels smooth enough is another.

If your goal is high-frequency daily chatting, 7B, 8B, or 14B models may feel better. If your goal is to explore larger MoE models, multimodal capability, and the boundary of local deployment, an RTX 3070 8GB with 64GB of RAM is still worth trying.

Summary

The reason an RTX 3070 8GB can run Qwen3.6-35B-A3B is not that the GPU suddenly has more VRAM. It is the combination of MoE architecture, GGUF quantization, llama.cpp CPU Offload, and KV Cache optimization that lowers the threshold.

The most interesting part of this setup is that it lets older GPUs still participate in local large-model experiments. As long as you accept tradeoffs in speed and stability, an 8GB VRAM machine can still be a local AI model testing platform, not only an entry-level device for small models.

References:

Original article: https://www.freedidi.com/24267.html

llama.cpp b9196 Update: Windows Prebuilt Binaries Support CUDA 13.1, Vulkan, HIP, and SYCL

Mon, 18 May 2026 23:20:00 +0800

The recent Windows release of llama.cpp is much friendlier for local LLM users. In the past, running GGUF models on Windows often meant dealing with environment issues: CUDA version mismatches, missing DLLs, incompatible drivers, failed CMake builds, wrong environment variables, or complicated Vulkan / HIP / SYCL setup.

Now the official Release page provides several Windows prebuilt packages. In many cases, users no longer need to compile from source. Download the right build, unzip it, place the model file, and you can start a local inference service directly.

What llama.cpp Is Good For

llama.cpp is one of the most commonly used local GGUF model inference frameworks. It is lightweight, cross-platform, can run on CPU or GPU, and has a large ecosystem of GGUF model resources.

Common model families include:

Qwen
Llama
DeepSeek
Gemma
Mistral
Mixtral
Hermes

As GGUF quantized models become more common, many open source models now provide GGUF versions suitable for local deployment. For regular users, the value of llama.cpp is simple: you do not need a full complex inference stack to run a usable chat service on your own machine.

How to Choose a Windows Prebuilt Build

Windows users can choose different builds based on their hardware:

Windows x64 CPU
Windows x64 CUDA 12.4
Windows x64 CUDA 13.1
Windows x64 Vulkan
Windows x64 HIP Radeon
Windows x64 SYCL
Windows ARM64 CPU

If you use an NVIDIA GPU, the CUDA build is usually the first choice. Cards such as RTX 3060, 4060, 4070, 4080, and 4090 are better suited to the CUDA route.

If you use an AMD GPU, try HIP or Vulkan. In practice, Vulkan can sometimes be easier than HIP, especially if you do not want to set up a full ROCm environment.

If you use Intel integrated graphics or an Arc GPU, try SYCL or Vulkan. Performance is usually behind NVIDIA CUDA, but it is already enough to test many small and medium GGUF models.

The CPU build is suitable for users without a discrete GPU, or for those who only want to verify a model or run small models. It will not be fast, but deployment is the simplest.

Start a Regular GGUF Model

Assume you have downloaded the llama.cpp Windows prebuilt package and placed your model in the models directory. Enter the extracted llama.cpp directory and run:

`1`	`llama-server.exe -m models\your-model.gguf -ngl 999`

Here, -m points to the GGUF model file, and -ngl 999 tells llama.cpp to load as many layers as possible onto the GPU. The actual number depends on VRAM size, model size, and quantization format.

After startup succeeds, open this address in your browser:

`1`	`http://127.0.0.1:8080`

You will enter the local web chat interface.

If VRAM is not enough, switch to a smaller model or a lower quantization version, such as Q4 or Q5 GGUF files. Do not only look at parameter count; also check quantization format and context length settings.

Start a Multimodal Vision Model

Multimodal vision models usually need more than the main model file. They also need an mmproj vision projection file. Start them by specifying both:

`1`	`llama-server.exe -m "models\main-model.gguf" --mmproj "models\mmproj-model.gguf" -ngl 999`

Common uses include:

OCR recognition
Screenshot understanding
Webpage screenshot analysis
Image Q&A
Simple visual content judgment

For example, Qwen2-VL / Qwen2.5-VL models are useful for Chinese screenshot understanding, OCR, and image-text Q&A. Make sure the main model and mmproj file match; version mismatches can easily cause loading failures or abnormal output.

Use a bat Script to Manage Multiple Models

If you keep multiple models locally, you can write a simple .bat script to switch between them. The following example needs your own path and model names:

@echo off
chcp 65001 >nul
cd /d C:\path\to\llama-b9196-bin-win-cuda-13.1-x64

echo 请选择模型：
echo 1. Gemma
echo 2. Qwen VL 多模态
echo 3. DeepSeek

set /p choice=输入数字：

if "%choice%"=="1" llama-server.exe -m "models\gemma.gguf" -ngl 999
if "%choice%"=="2" llama-server.exe -m "models\qwen-vl.gguf" --mmproj "models\mmproj.gguf" -ngl 999
if "%choice%"=="3" llama-server.exe -m "models\deepseek.gguf" -ngl 999

pause

Save it as UTF-8, then change the extension to .bat. Double-clicking the script lets you choose different models by number.

Three Things to Check When Choosing Models

First, check hardware. More VRAM means you can run larger models. If VRAM is limited, do not force a large model; start with 7B, 8B, or a lower quantization version.

Second, check the use case. For everyday Q&A, summarization, and rewriting, small models or medium quantization are often enough. For coding, long-document analysis, or multimodal understanding, you need stronger models and more VRAM.

Third, check licenses and safety boundaries. Many community-modified models have different capabilities, restrictions, and licenses. Before downloading, confirm the source, license, intended use, and risks. Do not hand production work directly to models from unclear sources.

Common Issues

If startup reports missing DLLs, first confirm that the downloaded package matches your GPU route. NVIDIA users should not download the HIP build by mistake, and AMD users should not download the CUDA build.

If model loading is slow, the model may be too large, the disk may be slow, or part of the model may be falling back to CPU due to insufficient VRAM.

If the web page does not open, check whether the command line service started successfully, then confirm the port is 8080. If the port is occupied, check llama-server parameters and change the port.

If a multimodal model behaves incorrectly, first check whether the mmproj file matches the main model instead of only changing prompts.

Summary

The value of these Windows prebuilt packages is that they lower the entry barrier for local AI. Many users previously got stuck at compilation and dependency setup. Now they can move faster into downloading models, starting a service, and testing results.

For Windows users, the route can be summarized simply:

NVIDIA: prefer CUDA.
AMD: try Vulkan first, then HIP.
Intel: try SYCL or Vulkan.
No discrete GPU: use the CPU build for small models.

Before real use, still confirm model source, license, VRAM needs, and actual results. Local AI gives you control, offline operation, and low latency, but it is not free of cost: model management, hardware resources, and output quality are still your responsibility.

Source: https://www.freedidi.com/24211.html

Local LLM Models Recommended for an RTX 3060 GPU

Fri, 08 May 2026 09:25:24 +0800

The most common RTX 3060 variant has 12GB of VRAM. It is not a top-tier AI GPU, but it is a very usable card for local LLMs, especially 7B, 8B, 9B, and 12B models.

If you only want a quick rule of thumb, remember this:

On an RTX 3060 12GB, prioritize around-8B models in Q4_K_M or Q5_K_M quantization. Choose Q4 for stability, and try Q5 if you want better quality.

Do not start by chasing 32B or 70B models. Even if they can run with low-bit quantization and CPU offloading, their speed and experience are usually not suitable for daily use.

Start With the VRAM Limit

For local LLMs on an RTX 3060 12GB, the real limit is VRAM.

Model Size	Recommended Quantization	RTX 3060 12GB Experience
3B / 4B	Q4, Q5, Q8	Very easy, fast
7B / 8B / 9B	Q4_K_M, Q5_K_M	Best balance of quality and speed
12B / 14B	Q4_K_M	Usable, but avoid huge context
30B+	Q2 / Q3 or partial offload	Possible to tinker with, not recommended daily
70B+	Very low quantization or heavy CPU/RAM use	More like an experiment

Local LLMs do not only consume VRAM for the model file. Context length, KV cache, batch size, inference framework, and drivers all consume resources.

So 12GB of VRAM does not mean you can load a 12GB model file directly. It is better to leave room for the system and context.

Recommendation 1: Qwen3 8B

If you mainly use Chinese, Qwen3 8B is one of the first models worth trying on an RTX 3060.

Good for:

Chinese Q&A.
Summarization and rewriting.
Everyday knowledge assistant work.
Simple code explanation.
Local RAG.
Lightweight Agent flows.

Recommended choice:

1
2
3

Qwen3 8B GGUF
Q4_K_M: first choice
Q5_K_M: better quality, more VRAM pressure

Qwen models are friendly to Chinese usage. For daily writing, information organization, and Chinese instruction following, Qwen3 8B is a good first model.

Recommendation 2: Llama 3.1 8B Instruct

Llama 3.1 8B Instruct is a stable general-purpose model with mature English capability and ecosystem support.

Good for:

English Q&A.
Lightweight coding help.
General chat.
Document summarization.
Prompt testing.
Comparing different inference tools.

Recommended choice:

1
2
3

Llama 3.1 8B Instruct GGUF
Q4_K_M: better speed and VRAM stability
Q5_K_M: better answer quality

If you mainly process English materials, or want a model with many tutorials and broad compatibility, Llama 3.1 8B is still a good baseline.

Recommendation 3: Gemma 3 12B

Gemma 3 12B is closer to the upper practical limit for an RTX 3060 12GB.

It uses more VRAM than 8B models, but Q4 quantization can still make it usable on a 12GB card. It is a good option if you want to try a slightly larger model on one GPU.

Good for:

Higher-quality general Q&A.
English content processing.
More complex summarization and analysis.
Trying an upgrade over 8B models.

Recommended choice:

1
2
3

Gemma 3 12B GGUF
Q4_K_M or official QAT Q4
Keep context modest

If you run out of VRAM, reduce context length first, or return to an 8B model. For an RTX 3060, 12B is “worth trying,” not a no-brainer default.

Recommendation 4: DeepSeek R1 Distill Qwen 8B

If you want to experience reasoning-style local models, try models like DeepSeek R1 Distill Qwen 8B.

Good for:

Simple reasoning tasks.
Step-by-step analysis.
Learning reasoning-model output style.
Low-cost local experiments.

Recommended choice:

1
2

DeepSeek R1 Distill Qwen 8B GGUF
Q4_K_M

These models may produce longer reasoning-style outputs, so speed and context usage can be heavier than ordinary instruction models. They are not always more comfortable for daily chat, but they are useful for reasoning experiments.

Recommendation 5: Phi / MiniCPM / Smaller Models

If your RTX 3060 is an 8GB variant, or your system RAM is limited, consider 3B and 4B models first.

Good for:

Fast Q&A.
Simple summaries.
Embedding into local tools.
Low-latency chat.
Testing on older machines.

These models may not match 8B or 12B quality, but they are light, fast, and easy to deploy.

Which Quantization to Use

Local models commonly use GGUF, with quantization types such as Q4, Q5, Q6, and Q8.

Quantization	Traits	Best For
Q4_K_M	Small, fast, good enough	RTX 3060 first choice
Q5_K_M	Better quality, higher usage	Try with 8B models
Q6 / Q8	Closer to original quality, larger	Small models or more VRAM
Q2 / Q3	Saves VRAM but quality drops	Large-model tinkering

For RTX 3060 12GB, the practical choices are:

1
2
3

8B models: Q4_K_M or Q5_K_M
12B models: Q4_K_M first
Larger models: not recommended as daily drivers

Which Tool to Use

Beginners can start with Ollama, because installation and running models are simple.

Common commands:

1
2

ollama run qwen3:8b
ollama run llama3.1:8b

If you want finer control over GGUF files, GPU layers, and context length, use llama.cpp or GUI tools based on it.

Common choices:

Ollama: easiest, best for beginners.
LM Studio: friendly GUI, good for downloading and switching models.
llama.cpp: most control, best for performance tuning.
text-generation-webui: many features, good for backend testing.

For local chat and simple Q&A, Ollama or LM Studio is enough.

Do Not Set Context Too High

Many models advertise long-context support, but do not blindly set context to the maximum on an RTX 3060.

Longer context uses more KV cache and increases VRAM pressure. Even if the model loads, long context can slow generation down.

Suggested settings:

1
2
3

Normal chat: 4K to 8K
Document summaries: 8K to 16K
Long-document RAG: chunk first; do not paste everything at once

An RTX 3060 is better suited to “moderate context + good model + good retrieval” than forcing hundreds of thousands of tokens into one prompt.

Choose by Use Case

If you mainly write Chinese:

1
2

First choice: Qwen3 8B Q4_K_M
Alternative: DeepSeek R1 Distill Qwen 8B

If you mainly write English:

1
2

First choice: Llama 3.1 8B Instruct Q4_K_M
Alternative: Gemma 3 12B Q4_K_M

If you want speed:

1
2
3

3B / 4B models
8B Q4_K_M
Keep context at 4K to 8K

If you want better quality:

1
2
3

8B Q5_K_M
12B Q4_K_M
Accept slower speed

If you want coding help:

1
2

8B coding models can help with explanations and small edits
For complex engineering tasks, use stronger cloud models

Local RTX 3060 models are good for code explanation, function completion, small scripts, and offline assistance. For large refactors, difficult bugs, and cross-file Agent work, do not expect Claude Sonnet or GPT-5-level performance.

Reasonable Expectations

The RTX 3060 12GB is good enough to turn local LLMs from toys into daily tools, but it will not recreate top cloud models at home.

Its strengths:

Low cost.
More VRAM than 8GB cards.
Good 8B model experience.
Offline use.
Local processing for privacy-sensitive materials.

Its limits:

Large models are hard to run smoothly.
Long context consumes VRAM.
Slower than high-end GPUs.
Small local models have limited complex reasoning.
Multimodal and Agent workflows need more resources.

The stable route is: use 8B models as everyday local assistants, try 12B models for quality, and leave complex tasks to cloud models.

Summary

Recommended local LLM choices for RTX 3060 12GB:

Chinese general use: Qwen3 8B Q4_K_M
English general use: Llama 3.1 8B Instruct Q4_K_M
Higher-quality experiment: Gemma 3 12B Q4_K_M
Reasoning experiment: DeepSeek R1 Distill Qwen 8B Q4_K_M
Low-VRAM fast use: 3B / 4B small models

Choose Q4_K_M first. Try Q5_K_M for 8B models if you want better quality. Start with Ollama or LM Studio.

Do not treat the RTX 3060 as a large-model server. Treat it as a local knowledge assistant, privacy document processor, lightweight coding helper, and model experiment card, and it will fit its real capabilities much better.

References

Qwen3 8B GGUF: https://huggingface.co/Qwen/Qwen3-8B-GGUF
Llama 3.1 8B GGUF: https://huggingface.co/macandchiz/Llama-3.1-8B-Instruct-GGUF
Gemma 3 12B GGUF: https://huggingface.co/unsloth/gemma-3-12b-it-GGUF
llama.cpp: https://github.com/ggml-org/llama.cpp
Ollama: https://ollama.com

Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

Fri, 01 May 2026 12:02:00 +0800

The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:

Qwen3.6-27B: a 27B dense model.
Qwen3.6-35B-A3B: a 35B total / 3B active MoE model.

There are also online product or API model names such as Qwen3.6-Plus and Qwen3.6-Max. If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table. This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.

As with the Gemma 4 table in /05/10, two concepts need to be separated first:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.

Qwen3.6 has a very long default context. The official model card states native support for 262,144 tokens and extension to 1,010,000 tokens. So the “minimum VRAM” column below only applies to short or medium context. If you really want 128K, 256K, or longer context, reserve much more room for KV cache.

Quick Summary

VRAM	Good Fit	Avoid
8GB	Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk	Q4 and above
12GB	27B Q2/Q3, 35B-A3B Q2/Q3 with short context	27B Q4 with long context
16GB	27B Q3/Q4, 35B-A3B Q3/IQ4_XS	35B-A3B Q4 with long context
24GB	27B Q4/Q5/Q6, 35B-A3B Q4	35B-A3B Q8, BF16
32GB	27B Q8, 35B-A3B Q5/Q6	BF16
48GB	35B-A3B Q8, 27B with longer context more comfortably	35B-A3B BF16
80GB+	27B / 35B-A3B BF16	No need to chase BF16 for ordinary local chat

If you have a 24GB GPU, focus on:

Qwen3.6-27B Q4_K_M
Qwen3.6-27B Q5_K_M
Qwen3.6-35B-A3B UD-Q4_K_M

If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.

Official Weight Sizes

The following BF16 weight sizes come from model.safetensors.index.json in the official Hugging Face repositories. They are useful as a reference for the original model scale.

Model	Architecture	Official BF16 Weight Size	Official Context
`Qwen3.6-27B`	27B dense	55.56GB	Native 262K, extendable to 1,010K
`Qwen3.6-35B-A3B`	35B total / 3B active MoE	71.90GB	Native 262K, extendable to 1,010K

Although 35B-A3B activates about 3B parameters per step, it still needs to load the full MoE weights. So it should not be estimated like a 3B small model.

Qwen3.6-27B VRAM Table

Qwen3.6-27B is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model. For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	9.39GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	10.85GB	12GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	11.85GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	11.99GB	14GB	18GB	VRAM-saving 3-bit
`Q3_K_S`	12.36GB	16GB	20GB	3-bit entry point
`Q3_K_M`	13.59GB	16GB	20GB	Common 3-bit compromise
`IQ4_XS`	15.44GB	20GB	24GB	Near-Q4, more VRAM efficient
`IQ4_NL`	16.07GB	20GB	24GB	Quality/size balance
`Q4_K_M`	16.82GB	20GB	24GB	Recommended 27B default
`Q5_K_M`	19.51GB	24GB	32GB	Higher-quality quantization
`Q6_K`	22.52GB	28GB	32GB	Quality first
`Q8_0`	28.60GB	32GB	40GB	Near-original precision
`BF16`	53.80GB	64GB	80GB	Research, evaluation, precision comparison

For ordinary local coding and chat, Q4_K_M is the easiest starting point to recommend. A 24GB GPU can run Q4_K_M fairly comfortably, but for long context, reduce quantization size or context length.

Qwen3.6-35B-A3B VRAM Table

Qwen3.6-35B-A3B is an MoE model with 35B total parameters and about 3B active parameters per step. Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.

But note that MoE 3B active mainly affects compute. It does not mean VRAM usage is comparable to a 3B model. Full operation still needs the expert weights.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	10.76GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	11.52GB	14GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	12.29GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	13.21GB	16GB	20GB	VRAM-saving 3-bit
`UD-Q3_K_S`	15.36GB	18GB	24GB	3-bit entry point
`UD-Q3_K_M`	16.60GB	20GB	24GB	Common 3-bit compromise
`UD-IQ4_XS`	17.73GB	20GB	24GB	Quality/size balance
`UD-IQ4_NL`	18.04GB	20GB	24GB	Near-Q4 recommended option
`UD-Q4_K_M`	22.13GB	24GB	32GB	Recommended 35B-A3B default
`UD-Q5_K_M`	26.46GB	32GB	40GB	Higher-quality quantization
`UD-Q6_K`	29.31GB	32GB	48GB	Quality first
`Q8_0`	36.90GB	48GB	64GB	Near-original precision
`BF16`	69.37GB	80GB	96GB	Research, evaluation, precision comparison

With 24GB VRAM, UD-Q4_K_M is a key option, but do not set the context too high. If you want room for 128K+ context, UD-IQ4_XS, UD-IQ4_NL, or 3-bit versions are more realistic.

27B vs 35B-A3B

Need	Better Choice
Stable dense-model behavior	`Qwen3.6-27B`
Faster response, agents, and tool use	`Qwen3.6-35B-A3B`
Daily local use on 24GB VRAM	`35B-A3B UD-Q4_K_M` or `27B Q4_K_M`
Testing on 16GB VRAM	Use 2-bit/3-bit for both; avoid long context
Long context first	Use lower-bit quantization and leave more KV cache room
Quality first with 32GB+ VRAM	`27B Q5/Q6` or `35B-A3B Q5/Q6`

If you mainly write code, run agents, or use tools, 35B-A3B is worth trying first. If you care more about dense-model stability and consistency, 27B is more straightforward.

Why Long Context Uses So Much VRAM

The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning. But for local deployment, long context means a much larger KV cache.

Actual VRAM usage is affected by:

KV cache: longer context means higher usage.
Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.
Whether --language-model-only is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.
Batch size and concurrency: more concurrency requires more VRAM.
KV cache quantization: q8_0, q4_0, and similar settings can save VRAM, but may affect details.
Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.

So do not look only at GGUF file size. If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.

How to Choose

If you just want to try Qwen3.6 locally:

12GB VRAM: try 27B UD-IQ2_M or 35B-A3B UD-IQ2_M, with short context.
16GB VRAM: try 27B Q3_K_M or 35B-A3B UD-IQ3_XXS.
24GB VRAM: prefer 27B Q4_K_M, 35B-A3B UD-IQ4_NL, or 35B-A3B UD-Q4_K_M.
32GB VRAM: consider 27B Q5/Q6 or 35B-A3B Q5/Q6.
48GB and above: try Q8_0, or reserve more room for long context.

Most users do not need BF16. The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.

References

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Fri, 01 May 2026 11:42:34 +0800

Gemma 4 currently has four main sizes for local deployment: E2B, E4B, 26B A4B, and 31B. E2B and E4B target lightweight and edge devices, 26B A4B uses an MoE architecture, and 31B is the larger dense model.

The easiest mistake in local inference is mixing up two numbers:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.

The tables below estimate VRAM requirements based on GGUF file size. The default assumption is local text inference with llama.cpp, LM Studio, Ollama, or similar runtimes, using short to medium context. If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.

Quick Summary

VRAM	Good Fit	Avoid
4GB	Low-bit E2B quantizations	E4B and above
6GB	E2B Q4/Q5, low-bit E4B	26B, 31B
8GB	E2B Q8, E4B Q4/Q5	26B Q4, 31B Q4
12GB	E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests	26B Q4 with long context, 31B Q4
16GB	Low-bit 26B, low-bit 31B	31B Q4 with long context, 26B Q5 and above
24GB	26B Q4/Q5, 31B Q4	31B Q8, BF16
32GB	26B Q6/Q8, 31B Q5/Q6	BF16
48GB	31B Q8 more comfortably, 26B Q8 with longer context	31B BF16
80GB+	26B/31B BF16	Single consumer GPU deployment

If you just want something usable locally, start with E4B Q4_K_M or E2B Q4_K_M. With 24GB VRAM, 26B A4B Q4_K_M and 31B Q4_K_M start to become realistic choices.

Gemma 4 E2B VRAM Table

E2B is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing. It is easy to run, but complex reasoning, coding, and long tasks are limited.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	2.29GB	4GB	6GB	Extreme low-VRAM tests
`UD-Q2_K_XL`	2.40GB	4GB	6GB	Low-VRAM usability
`Q3_K_M`	2.54GB	4GB	6GB	Lightweight chat and summaries
`IQ4_XS`	2.98GB	6GB	8GB	Balance of quality and size
`Q4_K_M`	3.11GB	6GB	8GB	Recommended E2B default
`Q5_K_M`	3.36GB	6GB	8GB	Slightly steadier than Q4
`Q6_K`	4.50GB	8GB	10GB	Higher-quality small model
`Q8_0`	5.05GB	8GB	10GB	Near-original precision for lightweight deployment
`BF16`	9.31GB	12GB	16GB	Debugging, comparison, research

For daily use, E2B Q4_K_M is already enough. With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.

Gemma 4 E4B VRAM Table

E4B is the more practical lightweight model. Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	3.53GB	6GB	8GB	Low-VRAM tests
`UD-Q2_K_XL`	3.74GB	6GB	8GB	Low-VRAM usability
`Q3_K_M`	4.06GB	6GB	10GB	Lightweight local assistant
`IQ4_XS`	4.72GB	8GB	12GB	Balance of quality and speed
`Q4_K_M`	4.98GB	8GB	12GB	Recommended E4B default
`Q5_K_M`	5.48GB	8GB	12GB	Steadier everyday use
`Q6_K`	7.07GB	10GB	16GB	Quality first
`Q8_0`	8.19GB	12GB	16GB	Near-original precision
`BF16`	15.05GB	20GB	24GB	Research, evaluation, precision comparison

If your GPU has 8GB VRAM, E4B Q4_K_M is a realistic starting point. With 12GB or 16GB VRAM, E4B Q8_0 is also worth considering.

Gemma 4 26B A4B VRAM Table

26B A4B is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference. It is better suited to more complex Q&A, coding, tool use, and agent workflows.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	9.97GB	14GB	16GB	Extreme 16GB GPU tests
`UD-Q2_K_XL`	10.55GB	14GB	16GB	Running 26B with low VRAM
`UD-Q3_K_M`	12.53GB	16GB	20GB	Better quality while still VRAM-conscious
`UD-IQ4_XS`	13.42GB	16GB	24GB	Balance of quality and size
`UD-Q4_K_M`	16.87GB	20GB	24GB	Recommended 26B default
`UD-Q5_K_M`	21.15GB	24GB	32GB	Higher-quality quantization
`UD-Q6_K`	23.17GB	28GB	32GB	Quality first
`Q8_0`	26.86GB	32GB	40GB	Near-original precision
`BF16`	50.51GB	64GB	80GB	Not realistic for most single consumer GPUs

24GB VRAM is the comfortable dividing line for 26B A4B. A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.

Gemma 4 31B VRAM Table

31B is the larger dense model. Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	8.53GB	12GB	16GB	Extreme low-VRAM tests with clear quality loss
`UD-IQ2_M`	10.75GB	14GB	18GB	Low-VRAM tests
`UD-Q2_K_XL`	11.77GB	16GB	20GB	16GB GPU experiments
`Q3_K_S`	13.21GB	16GB	24GB	More VRAM-efficient 3-bit
`Q3_K_M`	14.74GB	20GB	24GB	Common 3-bit compromise
`IQ4_XS`	16.37GB	20GB	24GB	Near-Q4 compromise
`Q4_K_M`	18.32GB	24GB	32GB	Recommended 31B default
`Q5_K_M`	21.66GB	28GB	32GB	Higher-quality quantization
`Q6_K`	25.20GB	32GB	40GB	Quality first
`Q8_0`	32.64GB	40GB	48GB	Near-original precision
`BF16`	61.41GB	80GB	96GB	Server or large-VRAM workstation

Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point. Q4_K_M is the balanced choice, while Q5_K_M and above make more sense with 32GB+ VRAM.

Why Actual Usage Is Higher Than File Size

The GGUF file size is only the weight size. Runtime usage also includes:

KV cache: longer context means higher memory use.
Batch size and concurrency: processing more tokens or more users increases VRAM.
Multimodal components: image, audio, or video input often requires mmproj or extra modules.
Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.
KV cache quantization: q8_0, q4_0, and similar modes can save VRAM, but may affect detail.

So the “minimum VRAM” column should be read as the threshold for startup and short-context inference. For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.

How to Choose

If you just want to try Gemma 4 locally:

4GB to 6GB VRAM: choose E2B Q3_K_M or E2B Q4_K_M.
8GB VRAM: prefer E4B Q4_K_M; E2B Q8_0 is also fine.
12GB VRAM: choose E4B Q8_0, or try low-bit 26B/31B variants.
16GB VRAM: try 26B A4B UD-Q3_K_M or 31B Q3_K_S, but do not expect long context to feel comfortable.
24GB VRAM: focus on 26B A4B UD-Q4_K_M and 31B Q4_K_M.
32GB and above: consider Q5_K_M, Q6_K, or longer context.

Most users do not need BF16. Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.

References

How to Use llama-quantize for GGUF Models

Sun, 12 Apr 2026 09:42:36 +0800

llama-quantize is the quantization tool in llama.cpp. It is used to convert high-precision GGUF models into smaller quantized versions.

Its most common use is turning formats such as F32, BF16, or FP16 into versions like Q4_K_M, Q5_K_M, or Q8_0 that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.

Basic workflow

A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py ./models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

After that, you can run the quantized model with llama-cli:

1
2

# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"

Common options

--allow-requantize: allows requantizing an already quantized model, usually not ideal for quality
--leave-output-tensor: keeps the output layer unquantized, increasing size but sometimes helping quality
--pure: disables mixed quantization and uses a more uniform quant type
--imatrix: uses an importance matrix to improve quantization quality
--keep-split: keeps the original shard layout instead of producing one merged file

If you just want a practical starting point, this is often enough:

`1`	`./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M`

How to choose a quant

You can think of quant levels as a tradeoff between size, speed, and quality:

Q8_0: larger, but usually safer for quality
Q6_K / Q5_K_M: common balanced choices
Q4_K_M: a very common default with a good size-quality balance
Q3 / Q2: useful when hardware is very limited, but quality loss is more visible

The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.

Practical takeaway

start with Q4_K_M or Q5_K_M
move up to Q6_K or Q8_0 if quality matters more
move down to Q3 or Q2 if memory is tight
compare versions with the same prompt set

In short, llama-quantize is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.

How to Get GGUF Models from Hugging Face with llama.cpp

Sun, 12 Apr 2026 09:31:38 +0800

llama.cpp can work directly with GGUF models hosted on Hugging Face, so you do not always need to download model files manually first.

If a model repository already provides GGUF files, you can use the -hf argument in the CLI, for example:

`1`	`llama-cli -hf ggml-org/gemma-3-1b-it-GGUF`

By default, this downloads from Hugging Face.
If you use another service that exposes a Hugging Face compatible API, you can switch the download endpoint with the MODEL_ENDPOINT environment variable.

One important detail is that llama.cpp only works directly with the GGUF format.
If your model is in another format, you need to convert it first with the convert_*.py scripts provided in the repository.

Hugging Face also offers several online tools related to llama.cpp, including:

converting models to GGUF
quantizing weights to reduce size
converting LoRA adapters
editing GGUF metadata in the browser
hosting llama.cpp inference endpoints

If you only want the practical takeaway, start with repositories that already provide GGUF, then use llama-cli -hf <user>/<model>. In most cases, that is the simplest path.

Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2

Sat, 11 Apr 2026 20:07:29 +0800

When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.

Understand 32, 16, and Q levels first

32: closest to original/uncompressed quality, but hardware demand is extreme.
16: still very close to original quality, around half the size of 32.
Q8: common entry point for quantized models (Q8_0 or Q8).
Q6, Q5, Q4, Q3, Q2: lower number means lower resource use and higher quality loss risk.

What `K_M` / `K_S` means

K_M and K_S are mixed quantization variants:

most weights stay at the target quantization level
important parts keep higher precision

So at the same level, Qx_K_M or Qx_K_S is usually slightly better than plain Qx.

Practical picking strategy

If hardware allows, start with Q8.
If memory is tight, step down through Q6 / Q5 / Q4.
Try not to go below Q4; Q4_K_M is a common lower bound.
Below Q4, quality degradation becomes increasingly visible.

Quality order (best to worst)

32
16

– Above this point, quality is effectively the same, but hardware requirements are extreme –

Q8
Q6_K_M
Q6_K_S
Q6
Q5_K_M
Q5_K_S
Q5

– This is the typical sweet spot –

Q4_K_M
Q4_K_S
Q4

– Below this point, quality loss becomes visible –

Q3_K_M
Q3_K_S
Q3
Q2_K_M
Q2_K_S
Q2

If you want one short rule: start with Q8 or Q6_K_M, then move down to Q5 or Q4_K_M only when needed.

How to Download a GGUF Model from Hugging Face and Import It into Ollama

Thu, 09 Apr 2026 11:00:07 +0800

If a model is not available in the official Ollama library, or if you want to use a specific GGUF file from Hugging Face, you can download it manually and then import it into Ollama.

Step 1: Download the GGUF file from Hugging Face

First, find the target model’s GGUF file on Hugging Face. You will usually see multiple quantized versions, such as:

Q4_K_M
Q5_K_M
Q8_0

Which version you choose depends on your VRAM, RAM, and your tradeoff between speed and quality. After downloading, place the .gguf file in a fixed directory so you can reference it from the Modelfile.

Step 2: Write the Modelfile

Create a Modelfile in the same directory as the model file. The most basic version looks like this:

`1`	`FROM ./model.gguf`

If the filename is different, replace it with the actual filename, for example:

`1`	`FROM ./gemma-3-12b-it-q4_k_m.gguf`

If your goal is just to get it running, this single FROM line is usually enough.

Step 3: Import it into Ollama

Then run:

`1`	`ollama create myModelName -f Modelfile`

myModelName is the local model name you want to use inside Ollama
-f Modelfile tells Ollama to create the model from that file

Once the creation succeeds, the GGUF file becomes a local model that you can call directly.

Step 4: Run the model

After creation, run:

`1`	`ollama run myModelName`

From that point on, it works much like a model pulled with ollama pull.

How to inspect an existing model’s Modelfile

If you are not sure how to write a Modelfile, you can inspect the configuration of an existing model directly:

`1`	`ollama show --modelfile llama3.2`

This command prints the Modelfile for llama3.2, which is useful as a reference for:

How FROM should be written
How the template and system prompt are structured
How parameters are declared

When this approach makes sense

This manual Hugging Face import flow is useful when:

The model you want is not available in Ollama’s official library
You want a specific quantized variant
You have already downloaded the GGUF file manually
You want finer control over how the model is packaged

If Ollama already provides an official version, using pull is usually simpler. But when you need a specific quantization or a custom wrapper, GGUF + Modelfile gives you more flexibility.

Common notes

The path after FROM must match the actual location of the .gguf file.
If the filename contains spaces or special characters, it is better to rename it first.
Different GGUF quantization levels can greatly affect memory use and speed, so successful import does not guarantee smooth runtime performance.
If the model is a chat model, you may still need to adjust the prompt template later for better results.

Conclusion

Downloading a GGUF file from Hugging Face and importing it into Ollama is not complicated. Prepare the model file, write a minimal Modelfile, then run ollama create, and you can bring a third-party GGUF model into your Ollama workflow.

GGUF on KnightLi Blog

Qwen3.6-35B-A3B jailbreak local deployment: uncensored GGUF, llama.cpp, and safety boundaries

What this model is

Why a 35B model can run locally

How to read the quantization choices

llama.cpp deployment approach

How multimodal support works

OpenAI-compatible API

Why Hermes and OpenClaw matter

Safety boundaries for uncensored models

Who should try it

Conclusion

Running Qwen3.6-35B Locally on an RTX 3070 8GB: llama.cpp Deployment Notes and Tuning Parameters

Test environment

Why 8GB VRAM can still run a 35B model

Preparing llama.cpp

Download the model and multimodal projection file

RTX 3070 8GB startup parameters

Understanding the key parameters

Common issues

Who should try this

Summary

llama.cpp b9196 Update: Windows Prebuilt Binaries Support CUDA 13.1, Vulkan, HIP, and SYCL

What llama.cpp Is Good For

How to Choose a Windows Prebuilt Build

Start a Regular GGUF Model

Start a Multimodal Vision Model

Use a bat Script to Manage Multiple Models

Three Things to Check When Choosing Models

Common Issues

Summary

Local LLM Models Recommended for an RTX 3060 GPU

Start With the VRAM Limit

Recommendation 1: Qwen3 8B

Recommendation 2: Llama 3.1 8B Instruct

Recommendation 3: Gemma 3 12B

Recommendation 4: DeepSeek R1 Distill Qwen 8B

Recommendation 5: Phi / MiniCPM / Smaller Models

Which Quantization to Use

Which Tool to Use

Do Not Set Context Too High

Choose by Use Case

Reasonable Expectations

Summary

References

Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

Quick Summary

Official Weight Sizes

Qwen3.6-27B VRAM Table

Qwen3.6-35B-A3B VRAM Table

27B vs 35B-A3B

Why Long Context Uses So Much VRAM

How to Choose

References

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Quick Summary

Gemma 4 E2B VRAM Table

Gemma 4 E4B VRAM Table

Gemma 4 26B A4B VRAM Table

Gemma 4 31B VRAM Table

Why Actual Usage Is Higher Than File Size

How to Choose

References

How to Use llama-quantize for GGUF Models

Basic workflow

Common options

How to choose a quant

Practical takeaway

How to Get GGUF Models from Hugging Face with llama.cpp

Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2

Understand 32, 16, and Q levels first

What K_M / K_S means

Practical picking strategy

Quality order (best to worst)

How to Download a GGUF Model from Hugging Face and Import It into Ollama

Step 1: Download the GGUF file from Hugging Face

Step 2: Write the Modelfile

Step 3: Import it into Ollama

Step 4: Run the model

How to inspect an existing model’s Modelfile

What `K_M` / `K_S` means