Quantization on KnightLi Blog

What Is AI-Trader? A Platform Where AI Agents Publish Trading Signals and Run Paper Trading

Tue, 19 May 2026 10:56:50 +0800

HKUDS/AI-Trader is a trading platform project for AI Agents. The README positions it as an “Agent-Native Trading Platform”, aiming to let AI Agents connect to the platform, publish trading signals, join discussions, copy trades, and use market data.

Project URL: https://github.com/HKUDS/AI-Trader

Platform URL: https://ai4trade.ai

At the time of writing, the GitHub API showed about 18k stars and Python as the main language. The repository API did not return a clear license value, so users should confirm licensing terms before formal use.

This article is only an introduction to the open source project and is not investment advice. Automated trading involves real capital risk. No strategy, signal, or agent output can guarantee returns.

Positioning

The core idea of AI-Trader is simple: humans have trading platforms, and AI Agents may also need their own trading platform.

According to the README, any AI Agent can read the platform Skill file and register quickly:

`1`	`Read https://ai4trade.ai/skill/ai4trade and register on the platform. Compatibility alias: https://ai4trade.ai/SKILL.md`

After connection, agents can publish trading signals, join community discussions, copy strategies from high-performing traders, sync signals to multiple brokers, and accumulate points through prediction performance.

Main Features

The README lists capabilities including:

Instant Agent Integration: quick access for AI Agents.
Collective Intelligence Trading: multiple agents discuss and collaborate on trading ideas.
Cross-Platform Signal Sync: sync trading signals across platforms.
One-Click Copy Trading: follow selected traders or agents.
Universal Market Access: stocks, crypto, FX, options, futures, and more.
Three Signal Types: strategy, action, and discussion signals.
Reward System: earn points through signals and attention.

From a product perspective, it is not just a local quantitative backtesting framework. It combines agents, signals, discussion, copy trading, and paper trading in one platform layer.

Two Types of Users

The README divides users into two groups.

The first group is Agent Traders. AI Agents read the Skill document, connect to the platform, install required components, and publish signals.

The second group is Human Traders. Regular users can visit the platform, create accounts, browse signals, or follow better-performing traders.

Together, this forms a structure where AI Agents produce signals, and humans or other agents consume those signals.

Architecture

The README shows the project structure as:

AI-Trader (GitHub - Open Source)
念岸岸 skills/              # Agent skill definitions
念岸岸 docs/api/            # OpenAPI specifications
念岸岸 service/             # Backend & frontend
岫   念岸岸 server/         # FastAPI backend
岫   弩岸岸 frontend/        # React frontend
弩岸岸 assets/              # Logo and images

The repository puts agent skills, API documentation, backend, and frontend in one place. The backend uses FastAPI and the frontend uses React. The README update notes also mention that the web service and backend workers have been separated so pricing, historical performance, settlement, and market intelligence jobs can run in the background without affecting pages and health checks.

Why It Is Worth Watching

AI-Trader is worth watching not because “AI can automatically make money”, but because it makes the interface between agents and financial scenarios more explicit.

There are several interesting points:

First, it uses a Skill document as the agent access point. This is close to how Codex, Claude Code, OpenClaw, and other agent tools work.

Second, it places trading signals, discussion, copy trading, and a reward system at the platform layer instead of only providing a local script.

Third, it provides OpenAPI documentation, making the platform interfaces easier for developers to understand.

Fourth, it supports paper trading. For research on agent decision-making, a simulated environment is much safer than giving agents direct access to real money.

Risks and Boundaries

Automated trading is a high-risk scenario.

First, signals generated by agents are not investment advice. Models can hallucinate, overfit, misread news, or fail to understand extreme market conditions.

Second, copy trading has contagion risk. If a wrong signal is widely followed, losses may concentrate.

Third, real capital access must be strictly isolated. Do not give agents unlimited order permissions.

Fourth, licensing and compliance need to be confirmed before commercial or production use, especially when brokers, financial data, and user accounts are involved.

Who It Is For

AI-Trader is suitable for researchers studying agent decision-making, developers exploring financial agent interfaces, and teams interested in paper trading or signal collaboration. It is not suitable for users looking for guaranteed profit tools.

Summary

AI-Trader is a signal and paper-trading platform designed around AI Agents. The useful way to read it is not “AI helps you earn money”, but how agents should connect to financial workflows, publish signals, and operate inside controlled risk boundaries.

Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

Fri, 01 May 2026 12:02:00 +0800

The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:

Qwen3.6-27B: a 27B dense model.
Qwen3.6-35B-A3B: a 35B total / 3B active MoE model.

There are also online product or API model names such as Qwen3.6-Plus and Qwen3.6-Max. If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table. This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.

As with the Gemma 4 table in /05/10, two concepts need to be separated first:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.

Qwen3.6 has a very long default context. The official model card states native support for 262,144 tokens and extension to 1,010,000 tokens. So the “minimum VRAM” column below only applies to short or medium context. If you really want 128K, 256K, or longer context, reserve much more room for KV cache.

Quick Summary

VRAM	Good Fit	Avoid
8GB	Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk	Q4 and above
12GB	27B Q2/Q3, 35B-A3B Q2/Q3 with short context	27B Q4 with long context
16GB	27B Q3/Q4, 35B-A3B Q3/IQ4_XS	35B-A3B Q4 with long context
24GB	27B Q4/Q5/Q6, 35B-A3B Q4	35B-A3B Q8, BF16
32GB	27B Q8, 35B-A3B Q5/Q6	BF16
48GB	35B-A3B Q8, 27B with longer context more comfortably	35B-A3B BF16
80GB+	27B / 35B-A3B BF16	No need to chase BF16 for ordinary local chat

If you have a 24GB GPU, focus on:

Qwen3.6-27B Q4_K_M
Qwen3.6-27B Q5_K_M
Qwen3.6-35B-A3B UD-Q4_K_M

If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.

Official Weight Sizes

The following BF16 weight sizes come from model.safetensors.index.json in the official Hugging Face repositories. They are useful as a reference for the original model scale.

Model	Architecture	Official BF16 Weight Size	Official Context
`Qwen3.6-27B`	27B dense	55.56GB	Native 262K, extendable to 1,010K
`Qwen3.6-35B-A3B`	35B total / 3B active MoE	71.90GB	Native 262K, extendable to 1,010K

Although 35B-A3B activates about 3B parameters per step, it still needs to load the full MoE weights. So it should not be estimated like a 3B small model.

Qwen3.6-27B VRAM Table

Qwen3.6-27B is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model. For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	9.39GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	10.85GB	12GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	11.85GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	11.99GB	14GB	18GB	VRAM-saving 3-bit
`Q3_K_S`	12.36GB	16GB	20GB	3-bit entry point
`Q3_K_M`	13.59GB	16GB	20GB	Common 3-bit compromise
`IQ4_XS`	15.44GB	20GB	24GB	Near-Q4, more VRAM efficient
`IQ4_NL`	16.07GB	20GB	24GB	Quality/size balance
`Q4_K_M`	16.82GB	20GB	24GB	Recommended 27B default
`Q5_K_M`	19.51GB	24GB	32GB	Higher-quality quantization
`Q6_K`	22.52GB	28GB	32GB	Quality first
`Q8_0`	28.60GB	32GB	40GB	Near-original precision
`BF16`	53.80GB	64GB	80GB	Research, evaluation, precision comparison

For ordinary local coding and chat, Q4_K_M is the easiest starting point to recommend. A 24GB GPU can run Q4_K_M fairly comfortably, but for long context, reduce quantization size or context length.

Qwen3.6-35B-A3B VRAM Table

Qwen3.6-35B-A3B is an MoE model with 35B total parameters and about 3B active parameters per step. Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.

But note that MoE 3B active mainly affects compute. It does not mean VRAM usage is comparable to a 3B model. Full operation still needs the expert weights.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	10.76GB	12GB	16GB	Extreme low-VRAM tests
`UD-IQ2_M`	11.52GB	14GB	16GB	Low-VRAM usability
`UD-Q2_K_XL`	12.29GB	14GB	18GB	Low-bit compromise
`UD-IQ3_XXS`	13.21GB	16GB	20GB	VRAM-saving 3-bit
`UD-Q3_K_S`	15.36GB	18GB	24GB	3-bit entry point
`UD-Q3_K_M`	16.60GB	20GB	24GB	Common 3-bit compromise
`UD-IQ4_XS`	17.73GB	20GB	24GB	Quality/size balance
`UD-IQ4_NL`	18.04GB	20GB	24GB	Near-Q4 recommended option
`UD-Q4_K_M`	22.13GB	24GB	32GB	Recommended 35B-A3B default
`UD-Q5_K_M`	26.46GB	32GB	40GB	Higher-quality quantization
`UD-Q6_K`	29.31GB	32GB	48GB	Quality first
`Q8_0`	36.90GB	48GB	64GB	Near-original precision
`BF16`	69.37GB	80GB	96GB	Research, evaluation, precision comparison

With 24GB VRAM, UD-Q4_K_M is a key option, but do not set the context too high. If you want room for 128K+ context, UD-IQ4_XS, UD-IQ4_NL, or 3-bit versions are more realistic.

27B vs 35B-A3B

Need	Better Choice
Stable dense-model behavior	`Qwen3.6-27B`
Faster response, agents, and tool use	`Qwen3.6-35B-A3B`
Daily local use on 24GB VRAM	`35B-A3B UD-Q4_K_M` or `27B Q4_K_M`
Testing on 16GB VRAM	Use 2-bit/3-bit for both; avoid long context
Long context first	Use lower-bit quantization and leave more KV cache room
Quality first with 32GB+ VRAM	`27B Q5/Q6` or `35B-A3B Q5/Q6`

If you mainly write code, run agents, or use tools, 35B-A3B is worth trying first. If you care more about dense-model stability and consistency, 27B is more straightforward.

Why Long Context Uses So Much VRAM

The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning. But for local deployment, long context means a much larger KV cache.

Actual VRAM usage is affected by:

KV cache: longer context means higher usage.
Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.
Whether --language-model-only is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.
Batch size and concurrency: more concurrency requires more VRAM.
KV cache quantization: q8_0, q4_0, and similar settings can save VRAM, but may affect details.
Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.

So do not look only at GGUF file size. If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.

How to Choose

If you just want to try Qwen3.6 locally:

12GB VRAM: try 27B UD-IQ2_M or 35B-A3B UD-IQ2_M, with short context.
16GB VRAM: try 27B Q3_K_M or 35B-A3B UD-IQ3_XXS.
24GB VRAM: prefer 27B Q4_K_M, 35B-A3B UD-IQ4_NL, or 35B-A3B UD-Q4_K_M.
32GB VRAM: consider 27B Q5/Q6 or 35B-A3B Q5/Q6.
48GB and above: try Q8_0, or reserve more room for long context.

Most users do not need BF16. The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.

References

Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions

Fri, 01 May 2026 11:55:25 +0800

DeepSeek V4 and Gemma 4 are not in the same class for local deployment. With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.

The official DeepSeek V4 Preview release mainly includes two inference models:

DeepSeek-V4-Pro: 1.6T total / 49B active params
DeepSeek-V4-Flash: 284B total / 13B active params

The official Hugging Face collection also includes two Base models:

DeepSeek-V4-Pro-Base
DeepSeek-V4-Flash-Base

This article only discusses rough VRAM requirements when the full model weights are loaded. For MoE models, active params mainly affects per-token compute. It does not mean only those parameters need to be loaded. Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.

Quick Summary

VRAM Scale	What Is Realistic	Do Not Expect
24GB	Cannot fully run DeepSeek V4; use smaller distilled models or API	Full V4-Flash / V4-Pro local loading
48GB	Still not suitable for full loading; good for small models or remote API clients	Stable V4-Flash Q4
80GB	Theoretically try V4-Flash Q2/Q3 or heavy offload	V4-Pro
128GB	V4-Flash Q4 becomes more realistic; Q5/Q6 still tight	V4-Pro Q4
192GB	V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range	V4-Pro Q4
256GB	V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested	V4-Pro Q5 and above
512GB	V4-Pro Q4 starts to become discussable	V4-Pro FP8
1TB+	V4-Pro FP8 and low-bit Pro-Base are more realistic	Low-cost single-machine deployment
2TB+	Pro-Base FP8 class	Ordinary workstation deployment

If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target. More realistic options are:

Use the official DeepSeek API or compatible services.
Wait for stable community GGUF/EXL2/MLX quantizations and inference support.
Use smaller DeepSeek distilled models.
Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.

Official Weight Sizes

The following figures come from model.safetensors.index.json in the official Hugging Face repositories. They reflect current public weight file sizes, not full runtime VRAM use under long context.

Model	Parameter Scale	Official Weight Size	Notes
`DeepSeek-V4-Flash`	284B total / 13B active	159.61GB	Inference model, smallest in this group
`DeepSeek-V4-Pro`	1.6T total / 49B active	864.70GB	Inference model, stronger but enormous
`DeepSeek-V4-Flash-Base`	284B total	294.67GB	Base model, closer to full FP8 weight size
`DeepSeek-V4-Pro-Base`	1.6T total	1606.03GB	Base model, about 1.6TB

Even the smallest V4-Flash is already close to 160GB of official weights. That is why it should not be treated like a 13B model just because it has 13B active params.

DeepSeek V4 Flash VRAM Estimate

V4-Flash is the most approachable DeepSeek V4 variant for local experiments. But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.

The table below uses the official 159.61GB weight size as the baseline. Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	159.61GB	192GB	256GB	Multi-GPU servers, inference service
`Q6`	120GB	160GB	192GB	Quality-first quantization tests
`Q5`	100GB	128GB	160GB	Quality/size balance
`Q4`	80GB	96GB	128GB	More realistic starting point for Flash
`Q3`	60GB	80GB	96GB	Large-VRAM single GPU or multi-GPU tests
`Q2`	40GB	48GB	64GB	Extreme low-bit experiments with clear quality risk

If mature V4-Flash Q4 builds appear later, it still probably will not be a 24GB GPU model. A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.

DeepSeek V4 Pro VRAM Estimate

V4-Pro is the flagship inference model, with official weights around 864.70GB. Even at 4-bit quantization, the full weights remain in the hundreds of GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	864.70GB	1TB	1.2TB+	Multi-node or multi-GPU inference service
`Q6`	648GB	768GB	1TB	High-quality quantized service
`Q5`	540GB	640GB	768GB	Quality/cost balance
`Q4`	432GB	512GB	640GB	Lowest practical quality line for Pro
`Q3`	324GB	384GB	512GB	Low-bit experiments
`Q2`	216GB	256GB	320GB	Extreme experiments with high quality and stability risk

For individual users, V4-Pro is better consumed through an API. If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.

DeepSeek V4 Flash-Base VRAM Estimate

Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment. V4-Flash-Base has official weights of about 294.67GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	294.67GB	384GB	512GB	Research, preprocessing, evaluation
`Q6`	221GB	256GB	320GB	High-quality quantization research
`Q5`	184GB	224GB	256GB	Quality/size balance
`Q4`	147GB	192GB	224GB	Lower-cost Base experiments
`Q3`	111GB	128GB	160GB	Low-bit experiments
`Q2`	74GB	96GB	128GB	Extreme experiments

If you only want to use DeepSeek V4 capabilities, do not start with the Base model. Base models cost more to deploy and tune; most applications should use the inference model or API.

DeepSeek V4 Pro-Base VRAM Estimate

V4-Pro-Base is the heaviest variant, with official weights around 1606.03GB. That is already a 1.6TB-class model file.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	1606.03GB	2TB	2.4TB+	Large-scale research clusters
`Q6`	1205GB	1.5TB	2TB	High-quality quantization research
`Q5`	1004GB	1.2TB	1.5TB	Research and evaluation
`Q4`	803GB	1TB	1.2TB	Low-bit research
`Q3`	602GB	768GB	1TB	Extreme low-bit research
`Q2`	402GB	512GB	640GB	Extreme experiments

This kind of model should not be discussed in the framework of “can a home GPU run it?” Even Q4 is already beyond the comfortable range of most single-machine workstations.

Why Active Params Are Not Enough

DeepSeek V4 is an MoE model. MoE means each token activates only part of the experts, so compute is much lower than the total parameter count. But this does not mean VRAM only needs to hold the active parameters.

Full local inference also depends on:

Whether all expert weights must stay resident on GPU.
Whether on-demand expert loading is supported.
CPU memory to GPU memory transfer costs.
NVMe offload latency.
KV cache growth under long context.
Extra runtime overhead under 1M context.
Multi-node and multi-GPU communication cost.

So V4-Pro with 49B active should not be deployed like a 49B model. V4-Flash with 13B active should not be treated like a 13B small model either.

How to Choose

If you are an ordinary individual user:

Do not try to fully self-host DeepSeek V4.
Use the official API when you need DeepSeek V4 capabilities.
For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.
With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.

If you have 128GB to 256GB total VRAM:

Watch for stable community implementations of V4-Flash Q4/Q5.
Do not treat V4-Pro as your main local model.

If you have 512GB+ total VRAM:

V4-Pro Q4 starts to become an engineering validation target.
You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.

The key question for DeepSeek V4 local deployment is not “which quantized file should I download?” It is “do I have the system-level inference capacity for this model?” It is closer to a server model than a desktop model.

References

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Fri, 01 May 2026 11:42:34 +0800

Gemma 4 currently has four main sizes for local deployment: E2B, E4B, 26B A4B, and 31B. E2B and E4B target lightweight and edge devices, 26B A4B uses an MoE architecture, and 31B is the larger dense model.

The easiest mistake in local inference is mixing up two numbers:

GGUF file size: how large the model weight file is.
Actual VRAM usage: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.

The tables below estimate VRAM requirements based on GGUF file size. The default assumption is local text inference with llama.cpp, LM Studio, Ollama, or similar runtimes, using short to medium context. If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.

Quick Summary

VRAM	Good Fit	Avoid
4GB	Low-bit E2B quantizations	E4B and above
6GB	E2B Q4/Q5, low-bit E4B	26B, 31B
8GB	E2B Q8, E4B Q4/Q5	26B Q4, 31B Q4
12GB	E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests	26B Q4 with long context, 31B Q4
16GB	Low-bit 26B, low-bit 31B	31B Q4 with long context, 26B Q5 and above
24GB	26B Q4/Q5, 31B Q4	31B Q8, BF16
32GB	26B Q6/Q8, 31B Q5/Q6	BF16
48GB	31B Q8 more comfortably, 26B Q8 with longer context	31B BF16
80GB+	26B/31B BF16	Single consumer GPU deployment

If you just want something usable locally, start with E4B Q4_K_M or E2B Q4_K_M. With 24GB VRAM, 26B A4B Q4_K_M and 31B Q4_K_M start to become realistic choices.

Gemma 4 E2B VRAM Table

E2B is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing. It is easy to run, but complex reasoning, coding, and long tasks are limited.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	2.29GB	4GB	6GB	Extreme low-VRAM tests
`UD-Q2_K_XL`	2.40GB	4GB	6GB	Low-VRAM usability
`Q3_K_M`	2.54GB	4GB	6GB	Lightweight chat and summaries
`IQ4_XS`	2.98GB	6GB	8GB	Balance of quality and size
`Q4_K_M`	3.11GB	6GB	8GB	Recommended E2B default
`Q5_K_M`	3.36GB	6GB	8GB	Slightly steadier than Q4
`Q6_K`	4.50GB	8GB	10GB	Higher-quality small model
`Q8_0`	5.05GB	8GB	10GB	Near-original precision for lightweight deployment
`BF16`	9.31GB	12GB	16GB	Debugging, comparison, research

For daily use, E2B Q4_K_M is already enough. With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.

Gemma 4 E4B VRAM Table

E4B is the more practical lightweight model. Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	3.53GB	6GB	8GB	Low-VRAM tests
`UD-Q2_K_XL`	3.74GB	6GB	8GB	Low-VRAM usability
`Q3_K_M`	4.06GB	6GB	10GB	Lightweight local assistant
`IQ4_XS`	4.72GB	8GB	12GB	Balance of quality and speed
`Q4_K_M`	4.98GB	8GB	12GB	Recommended E4B default
`Q5_K_M`	5.48GB	8GB	12GB	Steadier everyday use
`Q6_K`	7.07GB	10GB	16GB	Quality first
`Q8_0`	8.19GB	12GB	16GB	Near-original precision
`BF16`	15.05GB	20GB	24GB	Research, evaluation, precision comparison

If your GPU has 8GB VRAM, E4B Q4_K_M is a realistic starting point. With 12GB or 16GB VRAM, E4B Q8_0 is also worth considering.

Gemma 4 26B A4B VRAM Table

26B A4B is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference. It is better suited to more complex Q&A, coding, tool use, and agent workflows.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_M`	9.97GB	14GB	16GB	Extreme 16GB GPU tests
`UD-Q2_K_XL`	10.55GB	14GB	16GB	Running 26B with low VRAM
`UD-Q3_K_M`	12.53GB	16GB	20GB	Better quality while still VRAM-conscious
`UD-IQ4_XS`	13.42GB	16GB	24GB	Balance of quality and size
`UD-Q4_K_M`	16.87GB	20GB	24GB	Recommended 26B default
`UD-Q5_K_M`	21.15GB	24GB	32GB	Higher-quality quantization
`UD-Q6_K`	23.17GB	28GB	32GB	Quality first
`Q8_0`	26.86GB	32GB	40GB	Near-original precision
`BF16`	50.51GB	64GB	80GB	Not realistic for most single consumer GPUs

24GB VRAM is the comfortable dividing line for 26B A4B. A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.

Gemma 4 31B VRAM Table

31B is the larger dense model. Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.

Quantization	GGUF File Size	Minimum VRAM	Safer VRAM	Best For
`UD-IQ2_XXS`	8.53GB	12GB	16GB	Extreme low-VRAM tests with clear quality loss
`UD-IQ2_M`	10.75GB	14GB	18GB	Low-VRAM tests
`UD-Q2_K_XL`	11.77GB	16GB	20GB	16GB GPU experiments
`Q3_K_S`	13.21GB	16GB	24GB	More VRAM-efficient 3-bit
`Q3_K_M`	14.74GB	20GB	24GB	Common 3-bit compromise
`IQ4_XS`	16.37GB	20GB	24GB	Near-Q4 compromise
`Q4_K_M`	18.32GB	24GB	32GB	Recommended 31B default
`Q5_K_M`	21.66GB	28GB	32GB	Higher-quality quantization
`Q6_K`	25.20GB	32GB	40GB	Quality first
`Q8_0`	32.64GB	40GB	48GB	Near-original precision
`BF16`	61.41GB	80GB	96GB	Server or large-VRAM workstation

Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point. Q4_K_M is the balanced choice, while Q5_K_M and above make more sense with 32GB+ VRAM.

Why Actual Usage Is Higher Than File Size

The GGUF file size is only the weight size. Runtime usage also includes:

KV cache: longer context means higher memory use.
Batch size and concurrency: processing more tokens or more users increases VRAM.
Multimodal components: image, audio, or video input often requires mmproj or extra modules.
Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.
KV cache quantization: q8_0, q4_0, and similar modes can save VRAM, but may affect detail.

So the “minimum VRAM” column should be read as the threshold for startup and short-context inference. For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.

How to Choose

If you just want to try Gemma 4 locally:

4GB to 6GB VRAM: choose E2B Q3_K_M or E2B Q4_K_M.
8GB VRAM: prefer E4B Q4_K_M; E2B Q8_0 is also fine.
12GB VRAM: choose E4B Q8_0, or try low-bit 26B/31B variants.
16GB VRAM: try 26B A4B UD-Q3_K_M or 31B Q3_K_S, but do not expect long context to feel comfortable.
24GB VRAM: focus on 26B A4B UD-Q4_K_M and 31B Q4_K_M.
32GB and above: consider Q5_K_M, Q6_K, or longer context.

Most users do not need BF16. Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.

References

A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio

Wed, 22 Apr 2026 21:47:34 +0800

Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.

If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use MoE models inside LM Studio with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.

01 Why a 16GB GPU is not necessarily limited to 12B to 14B

The core idea is straightforward: VRAM size matters, but model architecture matters just as much.

If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.

But MoE models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.

That is exactly why a 16GB GPU still leaves some room to work with.

02 Key practical takeaway: 35B MoE models can run surprisingly fast

One representative case is a quantized MoE model such as Qwen 3.5 35B A3B. With a 16GB GPU and the right settings in LM Studio, Q6 quantization can reach something above 30 tokens/s, and Q4 can sometimes test even higher.

That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.

As a comparison, large models of a similar scale that are not MoE often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.

03 In LM Studio, the key is not just one parameter

If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:

GPU Offload
the setting that forces part of the expert layers into CPU memory

The first one is easy to understand. GPU Offload is basically something you push as high as possible, so the model prioritizes GPU computation.

The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since MoE models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.

A safer way to tune it is to start within a range and then adjust gradually for your machine:

start with related values somewhere between 20 and 35
then fine-tune based on VRAM usage and memory pressure

At its core, this method is using system memory to buy back VRAM headroom.

04 It can still run at 128K context, and smaller contexts reduce VRAM further

Another interesting point is that even with the context length pushed to 128K, a 35B-class MoE model can still maintain a relatively high speed.

That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like LM Studio, the real question is often not simply “can it run or not,” but rather:

are you willing to trade more system memory for less VRAM usage
are you willing to shorten the context length
are you willing to accept different capability tradeoffs across quantization levels

If the context is reduced further from 128K to 64K or 32K, VRAM pressure can drop even more. That means some 35B-class MoE models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.

05 The cost of this approach: much higher demands on RAM and virtual memory

This kind of setup is not free performance.

What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.

So if you want to try it yourself, it is worth checking a few things first:

whether your system RAM is large enough
whether your virtual memory allocation is large enough
whether too many background applications are already consuming resources

If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.

06 More aggressive quantization is not always better

There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.

The practical takeaway is that some models do run faster under Q4, but their original capability can also degrade more. By comparison, Q6 tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:

maximum speed and fitting into VRAM
or preserving more of the model’s original capability

Those two priorities do not necessarily lead to the same quantization choice.

07 What kinds of models are worth trying

From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:

models built on MoE architecture
models that are well supported in LM Studio and have complete quantized variants
models with clear advantages in long context or instruction following

And the idea does not stop at one 35B MoE model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.

The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.

08 Short conclusion

If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.

A more accurate way to put it is:

a 16GB GPU is not automatically ruled out for larger models
dense models and MoE models need to be considered separately
GPU Offload and expert-layer transfer to CPU memory inside LM Studio can significantly change VRAM usage
in practice, you are trading higher memory pressure for larger model scale and better usable speed

This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.

How to Use llama-quantize for GGUF Models

Sun, 12 Apr 2026 09:42:36 +0800

llama-quantize is the quantization tool in llama.cpp. It is used to convert high-precision GGUF models into smaller quantized versions.

Its most common use is turning formats such as F32, BF16, or FP16 into versions like Q4_K_M, Q5_K_M, or Q8_0 that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.

Basic workflow

A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py ./models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

After that, you can run the quantized model with llama-cli:

1
2

# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"

Common options

--allow-requantize: allows requantizing an already quantized model, usually not ideal for quality
--leave-output-tensor: keeps the output layer unquantized, increasing size but sometimes helping quality
--pure: disables mixed quantization and uses a more uniform quant type
--imatrix: uses an importance matrix to improve quantization quality
--keep-split: keeps the original shard layout instead of producing one merged file

If you just want a practical starting point, this is often enough:

`1`	`./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M`

How to choose a quant

You can think of quant levels as a tradeoff between size, speed, and quality:

Q8_0: larger, but usually safer for quality
Q6_K / Q5_K_M: common balanced choices
Q4_K_M: a very common default with a good size-quality balance
Q3 / Q2: useful when hardware is very limited, but quality loss is more visible

The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.

Practical takeaway

start with Q4_K_M or Q5_K_M
move up to Q6_K or Q8_0 if quality matters more
move down to Q3 or Q2 if memory is tight
compare versions with the same prompt set

In short, llama-quantize is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.

Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2

Sat, 11 Apr 2026 20:07:29 +0800

When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.

Understand 32, 16, and Q levels first

32: closest to original/uncompressed quality, but hardware demand is extreme.
16: still very close to original quality, around half the size of 32.
Q8: common entry point for quantized models (Q8_0 or Q8).
Q6, Q5, Q4, Q3, Q2: lower number means lower resource use and higher quality loss risk.

What `K_M` / `K_S` means

K_M and K_S are mixed quantization variants:

most weights stay at the target quantization level
important parts keep higher precision

So at the same level, Qx_K_M or Qx_K_S is usually slightly better than plain Qx.

Practical picking strategy

If hardware allows, start with Q8.
If memory is tight, step down through Q6 / Q5 / Q4.
Try not to go below Q4; Q4_K_M is a common lower bound.
Below Q4, quality degradation becomes increasingly visible.

Quality order (best to worst)

32
16

– Above this point, quality is effectively the same, but hardware requirements are extreme –

Q8
Q6_K_M
Q6_K_S
Q6
Q5_K_M
Q5_K_S
Q5

– This is the typical sweet spot –

Q4_K_M
Q4_K_S
Q4

– Below this point, quality loss becomes visible –

Q3_K_M
Q3_K_S
Q3
Q2_K_M
Q2_K_S
Q2

If you want one short rule: start with Q8 or Q6_K_M, then move down to Q5 or Q4_K_M only when needed.

LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

Sun, 05 Apr 2026 22:09:11 +0800

The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.

What Is Quantization

Quantization means compressing model parameters from higher-precision formats (such as FP16) into lower-bit formats (such as Q8 and Q4).

A simple analogy:

Original model: like a high-quality photo, clear but large.
Quantized model: like a compressed photo, slightly less detail but lighter and faster.

Common Quantization Formats

Quantization	Precision / Bit Width	Size	Quality Loss	Recommended Use
FP16	16-bit float	Largest	Almost none	Research, evaluation, max quality
Q8_0	8-bit integer	Larger	Almost none	High-end PCs, quality + performance
Q5_K_M	5-bit mixed	Medium	Slight	Daily driver, balanced choice
Q4_K_M	4-bit mixed	Smaller	Acceptable	General default, strong value
Q3_K_M	3-bit mixed	Very small	Noticeable	Low-spec devices, run-first
Q2_K	2-bit mixed	Smallest	Significant	Extreme resource limits, fallback

Quantization Naming Rules

Take gemma-4:4b-q4_k_m as an example:

gemma-4:4b: model name and parameter scale.
q4: 4-bit quantization.
k: K-quants (an improved quantization method).
m: medium level (common options also include s/small and l/large).

Quick Selection by VRAM

RAM / VRAM	Recommended Quantization
4 GB	Q3_K_M / Q2_K
8 GB	Q4_K_M
16 GB	Q5_K_M / Q8_0
32 GB+	FP16 / Q8_0

Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.

Practical Tips

Start with Q4_K_M by default and test real tasks first.
If response quality is not enough, move up to Q5_K_M or Q8_0.
If VRAM or speed is the main bottleneck, move down to Q3_K_M.
Use the same test set every time you switch quantization formats.

Conclusion

Quality first: FP16 or Q8_0.
Balance first: Q5_K_M.
General default: Q4_K_M.
Low-spec fallback: Q3_K_M or Q2_K.

The key is not “bigger is always better”, but “the most stable and usable result under your hardware limits.”

Quantization on KnightLi Blog

What Is AI-Trader? A Platform Where AI Agents Publish Trading Signals and Run Paper Trading

Positioning

Main Features

Two Types of Users

Architecture

Why It Is Worth Watching

Risks and Boundaries

Who It Is For

Summary

Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models

Quick Summary

Official Weight Sizes

Qwen3.6-27B VRAM Table

Qwen3.6-35B-A3B VRAM Table

27B vs 35B-A3B

Why Long Context Uses So Much VRAM

How to Choose

References

Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions

Quick Summary

Official Weight Sizes

DeepSeek V4 Flash VRAM Estimate

DeepSeek V4 Pro VRAM Estimate

DeepSeek V4 Flash-Base VRAM Estimate

DeepSeek V4 Pro-Base VRAM Estimate

Why Active Params Are Not Enough

How to Choose

References

Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models

Quick Summary

Gemma 4 E2B VRAM Table

Gemma 4 E4B VRAM Table

Gemma 4 26B A4B VRAM Table

Gemma 4 31B VRAM Table

Why Actual Usage Is Higher Than File Size

How to Choose

References

A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio

01 Why a 16GB GPU is not necessarily limited to 12B to 14B

02 Key practical takeaway: 35B MoE models can run surprisingly fast

03 In LM Studio, the key is not just one parameter

04 It can still run at 128K context, and smaller contexts reduce VRAM further

05 The cost of this approach: much higher demands on RAM and virtual memory

06 More aggressive quantization is not always better

07 What kinds of models are worth trying

08 Short conclusion

How to Use llama-quantize for GGUF Models

Basic workflow

Common options

How to choose a quant

Practical takeaway

Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2

Understand 32, 16, and Q levels first

What K_M / K_S means

Practical picking strategy

Quality order (best to worst)

LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

What Is Quantization

Common Quantization Formats

Quantization Naming Rules

Quick Selection by VRAM

Practical Tips

Conclusion

Related Posts

What `K_M` / `K_S` means