NVIDIA Qwen3.6-35B-A3B-NVFP4 Guide: FP4 Quantization and vLLM Deployment

nvidia/Qwen3.6-35B-A3B-NVFP4 is NVIDIA’s FP4/NVFP4 quantized release of Alibaba’s Qwen3.6-35B-A3B for vLLM deployment. The main search intent is practical: what the model is, what NVFP4 changes, what hardware it needs, and how to serve it.

NVIDIA has released nvidia/Qwen3.6-35B-A3B-NVFP4 on Hugging Face. It is a quantized version based on Alibaba’s Qwen3.6-35B-A3B, processed with NVIDIA Model Optimizer, with the goal of making it easier for developers to deploy the model in vLLM, Agent, RAG, chatbot, and other inference scenarios.

The model card shows that it uses the Apache-2.0 license and can be used in both commercial and non-commercial settings. One important detail is that NVIDIA explicitly states this is not an NVIDIA-built base model, but a quantized version of the third-party model Qwen3.6-35B-A3B.

NVIDIA Qwen3.6-35B-A3B-NVFP4 Model Information

According to the model card, the key parameters of Qwen3.6-35B-A3B-NVFP4 are as follows:

Base model: Qwen/Qwen3.6-35B-A3B
Publisher: NVIDIA
Quantization tool: NVIDIA Model Optimizer
License: Apache-2.0
Architecture: Transformer
Network structure: MoE with Hybrid Attention
Parameter scale: 35B total parameters, 3B activated parameters
Input: text, images, video
Output: text
Context length: up to 262K
Inference engine: vLLM
Recommended hardware: NVIDIA Hopper, NVIDIA Blackwell
Recommended system: Linux

The Hugging Face page sidebar also shows file size and tensor type information for the model files. When reading it, do not directly treat the sidebar’s file statistics as the architecture parameters of the base model.

What NVFP4 Quantization Does

The focus of this release is NVFP4 quantization. The model card states that NVIDIA applied NVFP4 quantization to the weights of Qwen3.6-35B-A3B so it can be used with vLLM inference.

This quantization does not simply force everything down to 4-bit. Instead, it processes the weights and activations of linear operators in the MoE Transformer block. The official result is that the bit width per parameter is reduced from 16 bit to 4 bit, while disk usage and GPU memory requirements are reduced by about 3.06x.

For deployment, the value of this kind of pre-quantized release is straightforward: you do not need to rerun the quantization workflow yourself, and can directly test throughput, memory usage, and long-context inference behavior.

vLLM Deployment Command

The basic launch command provided by the model card is:

1

vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 --quantization modelopt --max-model-len 262144 --reasoning-parser qwen3

This command keeps the 262K context length and is suitable for first validating the model’s capabilities in a high-memory environment. If GPU memory is tight, you can reduce --max-model-len first and then raise it gradually.

For NVIDIA DGX Spark, the model card provides another set of environment variables and vLLM parameters:

1
2
3
4
5


export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_FP8_MOE_BACKEND=flashinfer_cutlass
export FLASHINFER_DISABLE_VERSION_CHECK=1
export CUTE_DSL_ARCH=sm_121a
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 --tensor-parallel-size 1 --trust-remote-code --dtype auto --quantization modelopt --kv-cache-dtype fp8 --attention-backend flashinfer --moe-backend marlin --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4 --max-num-batched-tokens 8192 --enable-chunked-prefill --async-scheduling --enable-prefix-caching --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'

This parameter set is closer to practical deployment tuning: it lowers the context length to 65536, enables FP8 KV cache, chunked prefill, prefix caching, and configures speculative decoding. It is not something every machine can copy and run directly. In particular, parameters such as CUTE_DSL_ARCH=sm_121a, FlashInfer, and the MoE backend all depend on the specific GPU, driver, CUDA, and vLLM versions.

How to Read the Benchmark Results

The model card compares the BF16 baseline with the NVFP4 quantized version:

Precision	MMLU Pro	GPQA Diamond	τ²-Bench Telecom	SciCode	AIME 2025	AA-LCR	IFBench	MMMU Pro
BF16	85.6	84.9	95.5	40.8	89.2	62.0	62.3	74.1
NVFP4	85.0	84.8	94.7	40.6	88.8	62.0	62.8	74.5

From the table, NVFP4 shows small fluctuations compared with BF16: some metrics are slightly lower, while IFBench and MMMU Pro are slightly higher. A more cautious interpretation is that this quantized version stays close to BF16 on these public benchmarks, but it still needs to be tested with your own business data before deployment.

This is especially true for scenarios such as Agent workflows, RAG, code generation, and long-context retrieval. Public benchmarks can only provide a reference. Before going into production, you still need to check:

Whether the model follows instructions reliably under long context;
Whether it ignores referenced materials in RAG scenarios;
Whether tool calls tend to produce incorrect parameters;
Whether Chinese, English, and multimodal inputs meet your business requirements;
Whether throughput and latency are acceptable under low-memory configurations.

Suitable Scenarios

This model is better suited for teams that are already preparing to use NVIDIA GPUs and vLLM for inference services. Typical scenarios include:

Local or private chatbot deployments;
RAG knowledge-base question answering;
Planning and tool calling in Agent systems;
Long-document reading and summarization;
Large model inference testing with lower GPU memory usage;
Deployment teams that want to compare BF16 and FP4 quantization results.

If you only want to casually run it on a regular consumer GPU, first confirm the GPU memory, vLLM version, and quantization support. A pre-quantized model can lower the deployment barrier, but it does not mean every piece of hardware can run a 262K context smoothly.

Usage Limits

The model card also notes common limitations: the base model’s training data comes from the internet and may contain harmful content and social biases. As a result, the model may amplify biases under certain prompts, generate inaccurate content, omit key information, or produce inappropriate text.

If it is used in production, it is recommended to add at least several layers of protection:

Run safety evaluations for your business scenarios;
Add result validation for RAG and tool calls;
Add human review for high-risk outputs;
Record the inference version, quantization configuration, and vLLM parameters;
Keep a rollback plan to other models or the BF16 version for important tasks.

Summary

The value of nvidia/Qwen3.6-35B-A3B-NVFP4 is that it turns Qwen3.6-35B-A3B into an NVIDIA quantized version that can be deployed directly with vLLM. NVFP4 reduces GPU memory and disk pressure, and the official benchmarks also show performance close to BF16 across several metrics.

Still, it remains an inference model that requires engineering validation. Before real deployment, do not only look at benchmark scores. Test it against your own hardware, context length, RAG data, Agent toolchain, and safety requirements.

Reference links:

Validate More Than Throughput

Use a fixed evaluation set that includes your real prompts, long-context cases, tool calls, and retrieval data. Record VRAM use, first-token latency, steady-state throughput, and failure behavior separately; a model that looks fast in a short benchmark may not fit the context length or concurrency required by the actual service.