NVIDIA has released nvidia/Qwen3.6-35B-A3B-NVFP4 on Hugging Face. It is a quantized version based on Alibaba’s Qwen3.6-35B-A3B, processed with NVIDIA Model Optimizer, with the goal of making it easier for developers to deploy the model in vLLM, Agent, RAG, chatbot, and other inference scenarios.
The model card shows that it uses the Apache-2.0 license and can be used in both commercial and non-commercial settings. One important detail is that NVIDIA explicitly states this is not an NVIDIA-built base model, but a quantized version of the third-party model Qwen3.6-35B-A3B.
Basic Model Information
According to the model card, the key parameters of Qwen3.6-35B-A3B-NVFP4 are as follows:
- Base model:
Qwen/Qwen3.6-35B-A3B - Publisher: NVIDIA
- Quantization tool: NVIDIA Model Optimizer
- License: Apache-2.0
- Architecture: Transformer
- Network structure: MoE with Hybrid Attention
- Parameter scale: 35B total parameters, 3B activated parameters
- Input: text, images, video
- Output: text
- Context length: up to 262K
- Inference engine: vLLM
- Recommended hardware: NVIDIA Hopper, NVIDIA Blackwell
- Recommended system: Linux
The Hugging Face page sidebar also shows file size and tensor type information for the model files. When reading it, do not directly treat the sidebar’s file statistics as the architecture parameters of the base model.
What NVFP4 Quantization Does
The focus of this release is NVFP4 quantization. The model card states that NVIDIA applied NVFP4 quantization to the weights of Qwen3.6-35B-A3B so it can be used with vLLM inference.
This quantization does not simply force everything down to 4-bit. Instead, it processes the weights and activations of linear operators in the MoE Transformer block. The official result is that the bit width per parameter is reduced from 16 bit to 4 bit, while disk usage and GPU memory requirements are reduced by about 3.06x.
For deployment, the value of this kind of pre-quantized release is straightforward: you do not need to rerun the quantization workflow yourself, and can directly test throughput, memory usage, and long-context inference behavior.
vLLM Deployment Command
The basic launch command provided by the model card is:
|
|
This command keeps the 262K context length and is suitable for first validating the model’s capabilities in a high-memory environment. If GPU memory is tight, you can reduce --max-model-len first and then raise it gradually.
For NVIDIA DGX Spark, the model card provides another set of environment variables and vLLM parameters:
|
|
This parameter set is closer to practical deployment tuning: it lowers the context length to 65536, enables FP8 KV cache, chunked prefill, prefix caching, and configures speculative decoding. It is not something every machine can copy and run directly. In particular, parameters such as CUTE_DSL_ARCH=sm_121a, FlashInfer, and the MoE backend all depend on the specific GPU, driver, CUDA, and vLLM versions.
How to Read the Benchmark Results
The model card compares the BF16 baseline with the NVFP4 quantized version:
| Precision | MMLU Pro | GPQA Diamond | τ²-Bench Telecom | SciCode | AIME 2025 | AA-LCR | IFBench | MMMU Pro |
|---|---|---|---|---|---|---|---|---|
| BF16 | 85.6 | 84.9 | 95.5 | 40.8 | 89.2 | 62.0 | 62.3 | 74.1 |
| NVFP4 | 85.0 | 84.8 | 94.7 | 40.6 | 88.8 | 62.0 | 62.8 | 74.5 |
From the table, NVFP4 shows small fluctuations compared with BF16: some metrics are slightly lower, while IFBench and MMMU Pro are slightly higher. A more cautious interpretation is that this quantized version stays close to BF16 on these public benchmarks, but it still needs to be tested with your own business data before deployment.
This is especially true for scenarios such as Agent workflows, RAG, code generation, and long-context retrieval. Public benchmarks can only provide a reference. Before going into production, you still need to check:
- Whether the model follows instructions reliably under long context;
- Whether it ignores referenced materials in RAG scenarios;
- Whether tool calls tend to produce incorrect parameters;
- Whether Chinese, English, and multimodal inputs meet your business requirements;
- Whether throughput and latency are acceptable under low-memory configurations.
Suitable Scenarios
This model is better suited for teams that are already preparing to use NVIDIA GPUs and vLLM for inference services. Typical scenarios include:
- Local or private chatbot deployments;
- RAG knowledge-base question answering;
- Planning and tool calling in Agent systems;
- Long-document reading and summarization;
- Large model inference testing with lower GPU memory usage;
- Deployment teams that want to compare BF16 and FP4 quantization results.
If you only want to casually run it on a regular consumer GPU, first confirm the GPU memory, vLLM version, and quantization support. A pre-quantized model can lower the deployment barrier, but it does not mean every piece of hardware can run a 262K context smoothly.
Usage Limits
The model card also notes common limitations: the base model’s training data comes from the internet and may contain harmful content and social biases. As a result, the model may amplify biases under certain prompts, generate inaccurate content, omit key information, or produce inappropriate text.
If it is used in production, it is recommended to add at least several layers of protection:
- Run safety evaluations for your business scenarios;
- Add result validation for RAG and tool calls;
- Add human review for high-risk outputs;
- Record the inference version, quantization configuration, and vLLM parameters;
- Keep a rollback plan to other models or the BF16 version for important tasks.
Summary
The value of nvidia/Qwen3.6-35B-A3B-NVFP4 is that it turns Qwen3.6-35B-A3B into an NVIDIA quantized version that can be deployed directly with vLLM. NVFP4 reduces GPU memory and disk pressure, and the official benchmarks also show performance close to BF16 across several metrics.
Still, it remains an inference model that requires engineering validation. Before real deployment, do not only look at benchmark scores. Test it against your own hardware, context length, RAG data, Agent toolchain, and safety requirements.
Reference links: