Nvidia on KnightLi Blog

NVIDIA Releases Qwen3.6-35B-A3B-NVFP4: An FP4 Quantized Version for vLLM Deployment

Sun, 31 May 2026 13:05:55 +0800

NVIDIA has released nvidia/Qwen3.6-35B-A3B-NVFP4 on Hugging Face. It is a quantized version based on Alibaba’s Qwen3.6-35B-A3B, processed with NVIDIA Model Optimizer, with the goal of making it easier for developers to deploy the model in vLLM, Agent, RAG, chatbot, and other inference scenarios.

The model card shows that it uses the Apache-2.0 license and can be used in both commercial and non-commercial settings. One important detail is that NVIDIA explicitly states this is not an NVIDIA-built base model, but a quantized version of the third-party model Qwen3.6-35B-A3B.

Basic Model Information

According to the model card, the key parameters of Qwen3.6-35B-A3B-NVFP4 are as follows:

Base model: Qwen/Qwen3.6-35B-A3B
Publisher: NVIDIA
Quantization tool: NVIDIA Model Optimizer
License: Apache-2.0
Architecture: Transformer
Network structure: MoE with Hybrid Attention
Parameter scale: 35B total parameters, 3B activated parameters
Input: text, images, video
Output: text
Context length: up to 262K
Inference engine: vLLM
Recommended hardware: NVIDIA Hopper, NVIDIA Blackwell
Recommended system: Linux

The Hugging Face page sidebar also shows file size and tensor type information for the model files. When reading it, do not directly treat the sidebar’s file statistics as the architecture parameters of the base model.

What NVFP4 Quantization Does

The focus of this release is NVFP4 quantization. The model card states that NVIDIA applied NVFP4 quantization to the weights of Qwen3.6-35B-A3B so it can be used with vLLM inference.

This quantization does not simply force everything down to 4-bit. Instead, it processes the weights and activations of linear operators in the MoE Transformer block. The official result is that the bit width per parameter is reduced from 16 bit to 4 bit, while disk usage and GPU memory requirements are reduced by about 3.06x.

For deployment, the value of this kind of pre-quantized release is straightforward: you do not need to rerun the quantization workflow yourself, and can directly test throughput, memory usage, and long-context inference behavior.

vLLM Deployment Command

The basic launch command provided by the model card is:

`1`	`vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 --quantization modelopt --max-model-len 262144 --reasoning-parser qwen3`

This command keeps the 262K context length and is suitable for first validating the model’s capabilities in a high-memory environment. If GPU memory is tight, you can reduce --max-model-len first and then raise it gradually.

For NVIDIA DGX Spark, the model card provides another set of environment variables and vLLM parameters:

export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_FP8_MOE_BACKEND=flashinfer_cutlass
export FLASHINFER_DISABLE_VERSION_CHECK=1
export CUTE_DSL_ARCH=sm_121a
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 --tensor-parallel-size 1 --trust-remote-code --dtype auto --quantization modelopt --kv-cache-dtype fp8 --attention-backend flashinfer --moe-backend marlin --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4 --max-num-batched-tokens 8192 --enable-chunked-prefill --async-scheduling --enable-prefix-caching --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'

This parameter set is closer to practical deployment tuning: it lowers the context length to 65536, enables FP8 KV cache, chunked prefill, prefix caching, and configures speculative decoding. It is not something every machine can copy and run directly. In particular, parameters such as CUTE_DSL_ARCH=sm_121a, FlashInfer, and the MoE backend all depend on the specific GPU, driver, CUDA, and vLLM versions.

How to Read the Benchmark Results

The model card compares the BF16 baseline with the NVFP4 quantized version:

Precision	MMLU Pro	GPQA Diamond	τ²-Bench Telecom	SciCode	AIME 2025	AA-LCR	IFBench	MMMU Pro
BF16	85.6	84.9	95.5	40.8	89.2	62.0	62.3	74.1
NVFP4	85.0	84.8	94.7	40.6	88.8	62.0	62.8	74.5

From the table, NVFP4 shows small fluctuations compared with BF16: some metrics are slightly lower, while IFBench and MMMU Pro are slightly higher. A more cautious interpretation is that this quantized version stays close to BF16 on these public benchmarks, but it still needs to be tested with your own business data before deployment.

This is especially true for scenarios such as Agent workflows, RAG, code generation, and long-context retrieval. Public benchmarks can only provide a reference. Before going into production, you still need to check:

Whether the model follows instructions reliably under long context;
Whether it ignores referenced materials in RAG scenarios;
Whether tool calls tend to produce incorrect parameters;
Whether Chinese, English, and multimodal inputs meet your business requirements;
Whether throughput and latency are acceptable under low-memory configurations.

Suitable Scenarios

This model is better suited for teams that are already preparing to use NVIDIA GPUs and vLLM for inference services. Typical scenarios include:

Local or private chatbot deployments;
RAG knowledge-base question answering;
Planning and tool calling in Agent systems;
Long-document reading and summarization;
Large model inference testing with lower GPU memory usage;
Deployment teams that want to compare BF16 and FP4 quantization results.

If you only want to casually run it on a regular consumer GPU, first confirm the GPU memory, vLLM version, and quantization support. A pre-quantized model can lower the deployment barrier, but it does not mean every piece of hardware can run a 262K context smoothly.

Usage Limits

The model card also notes common limitations: the base model’s training data comes from the internet and may contain harmful content and social biases. As a result, the model may amplify biases under certain prompts, generate inaccurate content, omit key information, or produce inappropriate text.

If it is used in production, it is recommended to add at least several layers of protection:

Run safety evaluations for your business scenarios;
Add result validation for RAG and tool calls;
Add human review for high-risk outputs;
Record the inference version, quantization configuration, and vLLM parameters;
Keep a rollback plan to other models or the BF16 version for important tasks.

Summary

The value of nvidia/Qwen3.6-35B-A3B-NVFP4 is that it turns Qwen3.6-35B-A3B into an NVIDIA quantized version that can be deployed directly with vLLM. NVFP4 reduces GPU memory and disk pressure, and the official benchmarks also show performance close to BF16 across several metrics.

Still, it remains an inference model that requires engineering validation. Before real deployment, do not only look at benchmark scores. Test it against your own hardware, context length, RAG data, Agent toolchain, and safety requirements.

Reference links:

Behind Cerebras' IPO Surge: Can Wafer-Scale AI Chips Challenge Nvidia?

Mon, 18 May 2026 00:19:51 +0800

Cerebras Systems has finally entered the public market.

The company, known for its “wafer-scale AI chips”, began trading on Nasdaq on May 14, 2026 under the ticker CBRS. According to Cerebras’ official announcement, the IPO price was $185 per share, with 34.5 million shares of Class A common stock offered, including the underwriters’ full exercise of a 4.5 million share over-allotment option.

On its first trading day, Cerebras opened sharply higher and briefly approached $386. Based on the IPO price, the company raised more than $5.5 billion, making it one of the most closely watched AI hardware IPOs in the U.S. market in 2026.

That is why many media outlets call it an “Nvidia challenger”. But it is not accurate to simply describe Cerebras as “the next Nvidia”. What makes it unusual is that it has chosen a technical path very different from traditional GPUs.

Cerebras Is Not Building a Normal GPU

Cerebras’ core product is WSE, short for Wafer-Scale Engine.

Traditional chip manufacturing cuts a whole wafer into many small chips, then packages, tests, and ships them. Cerebras takes the opposite approach: it tries to turn an entire wafer directly into one giant chip.

The advantages of this route are straightforward:

Larger chip area.
More on-chip compute units.
On-chip SRAM closer to compute cores.
Shorter data movement inside the chip.
Better fit for certain AI inference and training workloads.

In AI computing, moving data is often harder to optimize than raw computation. Cerebras’ idea is to keep compute and storage on the same piece of silicon as much as possible, reducing the latency and energy cost caused by data repeatedly leaving the chip.

That is the most attractive part of the WSE approach. Instead of scaling along the same GPU path, it uses a much larger single chip to pursue higher on-chip bandwidth and lower data movement cost.

Why the Market Got Excited

The AI chip market is currently highly dependent on Nvidia. Whether companies are training large models, deploying inference services, or building AI data centers, Nvidia GPUs remain the mainstream choice.

That makes the market naturally interested in two kinds of companies:

Companies that can reduce dependence on Nvidia’s supply chain.
Companies that can offer higher performance or lower cost for certain AI workloads.

Cerebras fits both narratives.

It is not building a general-purpose CPU or an ordinary accelerator card. It designs systems directly around AI training and inference. The company has also repeatedly emphasized that its wafer-scale chips and cloud inference platform can deliver very high throughput in certain model inference scenarios.

This kind of story is easy for the market to amplify in 2026. AI infrastructure is still expanding, and enterprises, cloud providers, and model companies are all looking for more compute sources. If a chip company can prove that it is not just “another small GPU” in some scenarios, the market will pay attention.

The OpenAI Partnership Expands the Upside Story

Another reason Cerebras is closely watched is its relationship with OpenAI.

According to media reports, Cerebras signed a cooperation agreement with OpenAI worth more than $20 billion. The original Sohu article noted that, as of the end of 2025, the remaining performance obligations from that agreement reached $24.6 billion.

For a newly listed AI hardware company, such long-term agreements are important. They suggest that the company has not only a technical story, but also demand from major customers.

Still, long-term orders are not the same as realized revenue. AI data center deployment depends on manufacturing capacity, packaging, power supply, delivery schedules, customer budgets, and changes in model strategy. For chip companies, winning orders is only the first step. Delivering on time, scaling reliably, and building margins are harder.

Customer Concentration Remains a Major Risk

Cerebras also has an obvious risk: high customer concentration.

The Sohu article noted that G42 contributed 85% of Cerebras’ revenue in 2024, falling to 24% in 2025, while Mohamed bin Zayed University of Artificial Intelligence contributed 62% of revenue in 2025. This means that even after G42’s share declined, Cerebras’ revenue still depended heavily on a small number of large customers.

For AI infrastructure companies, customer concentration has two sides.

The benefit is that large customers can bring rapid growth, long-term contracts, and order visibility.

The risk is that if customers cut budgets, change technical direction, delay data center construction, or face regulatory changes, revenue volatility can be significant.

That is why Cerebras should not be judged only by its IPO pop. The first-day stock price reflects enthusiasm and expectations. Long-term valuation will still depend on revenue structure, delivery capability, margins, and customer diversification.

The Technical Limitation: Memory Capacity

WSE has clear strengths, but its limitations are also clear.

The Sohu article noted that the WSE-3 chip has 44GB of SRAM, while Nvidia’s B200 has 192GB of memory. Cerebras places a large amount of compute and SRAM on the same wafer, which reduces data movement, but also limits available memory capacity.

For large models, memory capacity directly affects context length, batch size, and deployment architecture. Context windows are getting longer, and flagship models are increasingly moving toward million-token context windows. In that trend, on-chip SRAM capacity becomes a real constraint.

Traditional GPUs can continue expanding memory through HBM stacking, packaging expansion, and multi-GPU interconnects. Cerebras’ wafer-scale approach is harder to expand in a simple way because the wafer area is already occupied by compute units and SRAM. Adding more SRAM may mean sacrificing compute area.

This does not mean the Cerebras architecture has failed. It means it is an architectural choice optimized for specific workloads. It may be very strong in certain inference scenarios, but it does not necessarily cover every AI training and inference need.

Can It Replace Nvidia?

In the short term, Cerebras is unlikely to replace Nvidia.

Nvidia’s advantage is not only GPU performance. It also includes the CUDA ecosystem, developer tools, system integration, networking, full-stack server solutions, cloud provider support, and customer migration costs. AI companies often choose Nvidia not because one chip wins on one metric, but because the entire ecosystem is the most stable.

Cerebras’ more realistic opportunity is to become a complementary option for specific AI workloads:

High-throughput inference.
Specific large-model services.
Tasks sensitive to latency and on-chip bandwidth.
Customers that want to reduce dependence on a single GPU supply chain.
Model companies willing to test new architectures for performance.

In other words, it is not an “Nvidia killer”. It is more like an aggressive alternative path in the AI compute market.

Summary

Cerebras’ IPO surge shows that capital markets are still willing to pay a high premium for AI infrastructure stories.

Its wafer-scale chip architecture is genuinely distinctive, separating it from ordinary AI accelerator companies. Together with major customer relationships such as OpenAI, Cerebras has a strong market narrative.

But the risks are just as real: customer concentration, delivery pressure, memory capacity limits, ecosystem barriers, and the system-level gap with Nvidia will all determine how far it can go.

For ordinary readers, the most interesting part of Cerebras is not how much the stock rose. It is that the company proves AI compute competition will not have only one GPU path. Future large-model infrastructure may include GPUs, wafer-scale chips, in-house accelerators, and cloud-based specialized inference platforms at the same time.

References

The U.S. Clears Nvidia H200 Sales: 10 Chinese Companies Approved, but Delivery Is Still Uncertain

Sat, 16 May 2026 17:12:09 +0800

The U.S. export license process for Nvidia H200 sales to China has finally made concrete progress.

According to Reuters-related reports, the U.S. Commerce Department has approved about 10 Chinese companies to buy Nvidia H200 AI chips. The approved list includes major internet companies and supply-chain firms, such as Alibaba, Tencent, ByteDance, JD.com, Lenovo, and Foxconn. However, as of May 14, 2026, H200 chips had still not been delivered to the Chinese market.

This needs to be read carefully: the U.S. side has granted some licenses, but that does not mean the chips have arrived, nor does it mean Chinese companies can immediately deploy them at scale.

What Was Approved

There are three key points in this approval.

First, the U.S. Commerce Department approved about 10 Chinese companies to purchase H200 chips. According to reports, approved customers may buy directly from Nvidia or through authorized intermediaries and distributors.

Second, each approved customer may buy up to about 75,000 H200 chips. If fully delivered, this volume would significantly improve high-end GPU supply for major cloud providers and large-model companies.

Third, Lenovo has confirmed that it is one of the companies that received Nvidia export licenses and is allowed to sell H200 in China. Companies like Lenovo and Foxconn are not only buyers; they may also handle server systems, rack integration, and distribution.

The most important caveat is that a license is not the same as delivery. Public reports emphasize that no H200 shipments to China have been completed yet.

Why H200 Matters

H200 belongs to Nvidia’s Hopper-generation accelerator lineup and is positioned above the H20, which was previously designed for the Chinese market. H20 was a reduced-spec product built to fit earlier export restrictions, while H200 offers stronger compute and memory capabilities.

Public information shows that H200 comes with 141GB of HBM3e memory, making it valuable for large-model training, inference, long-context services, and enterprise AI deployments. It is not Nvidia’s latest Blackwell-generation product, but for Chinese cloud providers and AI companies, it is still a high-end compute resource.

That is why H200 has remained sensitive in U.S.-China AI chip controls. The U.S. wants to limit China’s access to the most advanced AI compute while avoiding a complete loss of Nvidia’s China business. China, meanwhile, wants to reduce reliance on U.S. GPUs and direct more compute investment toward domestic chips and local ecosystems.

It Has Not Really Landed Yet

The easiest mistake is to read “approved to buy” as “supply has reopened.”

Based on current public information, there are still several variables:

U.S. approval is only the first step; orders, review, shipment, and compliance workflows still need to continue.
Whether China will allow actual import and deployment still requires clearer policy guidance.
Whether approved companies place orders immediately depends on price, delivery time, domestic alternatives, and long-term policy risk.
Nvidia may need to re-coordinate H200 capacity because its focus had already shifted to Blackwell and later products.

In other words, H200 sales to China now look more like an opened license window than a supply chain that is already moving chips into Chinese data centers at scale.

What It Means for Nvidia

For Nvidia, the China market remains too important to ignore.

After export restrictions tightened, Nvidia’s share in China’s high-end AI accelerator market was clearly affected. Jensen Huang has repeatedly argued that the U.S. should not casually give up the Chinese market, because doing so would hurt Nvidia’s revenue and weaken the influence of the U.S. technology ecosystem among global AI developers.

If H200 can eventually be delivered, Nvidia can partially recover Chinese customer orders and keep CUDA in Chinese large-model and cloud-computing workflows.

But this business will not return to the old frictionless state. Licenses, quotas, revenue-sharing arrangements, third-party verification, re-export restrictions, and customer identity review may all become long-term costs. For Nvidia, H200 is not just a product sale; it is a way to maintain market presence in a narrow policy corridor.

What It Means for Chinese Companies

For Chinese companies, H200 is short-term compute supply, not long-term certainty.

If approved companies can actually receive H200 chips, large-model training, inference services, AI cloud, agent platforms, and enterprise private deployments will all benefit. Teams already deeply tied to the CUDA toolchain face far lower migration costs with H200 than with a completely new hardware ecosystem.

But policy uncertainty will make companies cautious. Being able to buy H200 today does not mean stable procurement next year. Buying one batch does not mean a long-term expansion path exists. Even if major companies buy, they will likely continue pushing domestic GPUs, heterogeneous compute, inference optimization, and model compression to avoid being trapped again by a single supply chain.

So H200 is more of a buffer for Chinese AI companies than a final solution.

Pressure on Domestic Chips Will Not Disappear

U.S. approval of H200 does not reduce pressure on domestic AI chips. In some ways, it may make competition more direct.

If H200 really enters the Chinese market, domestic chip vendors will face a stronger benchmark in both performance and ecosystem. Customers will compare training stability, inference throughput, memory capacity, software toolchains, cluster communication, and operations cost.

Domestic chips still have room, however. As long as high-end GPU imports remain policy-sensitive, companies will not put their entire long-term compute base on Nvidia. Domestic solutions still have opportunities if they can provide controllable cost, stable supply, and usable software in specific scenarios.

A more realistic pattern may be: high-end training and critical inference continue to seek Nvidia resources such as H200, while large-scale inference, government and enterprise projects, and controllable supply-chain scenarios shift more toward domestic or mixed compute.

How to Read This

The most accurate reading is that U.S.-China AI chip friction has loosened temporarily, but has not returned to full openness.

The U.S. granted licenses to rebalance controls and commercial interests. Nvidia wants to use H200 to return to China’s high-end AI chip market. Chinese companies want stronger compute, but they also need to evaluate import uncertainty and domestic substitution strategy.

The key questions are not only whether the U.S. “allows” the sale, but what happens next:

Whether the first H200 batch is actually delivered to Chinese customers.
Whether approved companies disclose purchase scale and deployment scenarios.
Whether China provides clearer guidance on import, procurement, and usage.

Until those questions land, H200 remains an opened window for the Chinese market, not a fully restored supply chain.

References

What Jensen Huang Was Really Saying in His CMU Speech

Thu, 14 May 2026 20:59:50 +0800

Jensen Huang’s CMU speech looks, on the surface, like a mix of personal memory and startup storytelling. In reality, it was a cold shower for a group of top university graduates.

His core message was not “everything will become easier”. It was this: the AI era has arrived, and the old stable, respectable, linear career path may no longer hold. Young people need to prepare for hardship again, and they may also need to accept work that once looked less glamorous.

First Layer: I Had a Hard Childhood, and You May Have Hard Times Too

Huang talked about his childhood: waking up at 4 a.m. to deliver newspapers, then later washing dishes at Denny’s.

That story is motivational, of course, but it is not just nostalgia for struggle. He was speaking to Carnegie Mellon students, people who would normally have a clear path into investment banks, software companies, tech giants, and high-paying jobs.

So the real point was: do not assume you can graduate and keep walking along the comfortable path that worked for previous generations.

AI is rewriting the value of many jobs. The old model of rising through credentials, resumes, and big-company pipelines may be compressed. Many people may discover that they also have to go through a rougher, less polished, more foundational period of work.

Second Layer: Take Off the Gown and Do the Work That Is Actually Needed

Huang went from delivering newspapers to washing dishes at Denny’s, and described that as a major career advancement.

That sentence matters. He was saying that career value does not necessarily come from the title. It comes from whether you are inside real demand.

In today’s AI industry, the message may be: stop staring only at investment banks, internet software companies, consulting firms, and traditional white-collar jobs. The places that truly lack talent in the future may be more basic, more engineering-heavy, and more physically demanding.

For example:

building data centers;
working on power and cooling;
operating machine rooms;
handling electrical, plumbing, and infrastructure work;
deploying GPU clusters;
delivering AI factory engineering projects.

These jobs do not sound as polished as “joining a big company to write software”. But in the AI era, they may become the new key positions.

So “become a plumber, electrician, or data center builder” is not just a joke. It is a reminder to graduates: AI is not only models and code. It also needs electricity, land, data centers, networks, cooling, operations, and supply chains. Whoever can actually build those things stands in one of the hardest parts of the industry.

Third Layer: Hard Things Are Always Harder Than They Look

Huang also said that whenever NVIDIA ran into trouble, the team would ask: how hard can this be?

The answer, every time, was that it was harder than they first imagined.

That is a sentence every founder and engineer should hear. Many things look like just a project on a slide deck, just a roadmap item in a meeting, or just a trend inside a strategic narrative. But once you actually do them, you run into supply chains, capital, engineering, customers, organizations, competition, and time pressure.

This is especially true in the AI era.

Training models is hard. Deploying models is also hard. Making a demo is hard. Turning a demo into a reliable product is harder. Buying GPUs is hard. Keeping those GPUs fully utilized, stable, and commercially productive is even harder.

So Huang was not offering easy optimism. He was expressing engineering realism: you can be optimistic, but do not underestimate the difficulty.

The Real Reminder in This Speech

If the speech had to be compressed into one sentence, it would be this:

The AI era will not automatically reward smart people. It will reward people willing to enter real difficulty, real infrastructure, and real engineering work.

CMU students will of course still have many opportunities. But if they simply follow the path of previous graduates, find a stable role at a big company, and wait for career inertia to keep working, being left behind is not impossible.

What Huang was really telling them was: do not only imagine yourself walking from a graduation gown into a polished office. The future opportunities may be in data centers, power systems, cooling pipes, GPU clusters, and jobs that do not look elegant or white-collar at first.

AI will not only change software jobs. It will also redefine what counts as a good job.

NVIDIA Releases Nemotron 3 Nano Omni: An Open Omnimodal Reasoning Model for Agents

Fri, 01 May 2026 12:07:15 +0800

NVIDIA has released Nemotron 3 Nano Omni, an open omnimodal reasoning model designed for agent workflows. Its focus is not simply text question answering, but putting language, vision, and audio into the same reasoning framework so the model can handle inputs that are closer to real work.

In positioning, Nemotron 3 Nano Omni looks more like a foundation model prepared for AI Agents. It can understand information from screens, documents, images, speech, and video, then turn that information into actionable reasoning results. This kind of capability fits computer operation, document intelligence, video understanding, voice interaction, customer service, education, and enterprise process automation.

Model Specs

Nemotron 3 Nano Omni uses a MoE architecture. The key specs NVIDIA lists are:

Item	Information
Model name	`Nemotron 3 Nano Omni`
Architecture	MoE
Parameter scale	30B total / 3B active
Modalities	Text, image, audio, video
Context length	256K tokens
License	Apache 2.0
Main deployment direction	AI Agents, multimodal reasoning, enterprise agents

The most notable point here is 30B-A3B. It means the model has about 30B total parameters, but only activates about 3B parameters during each inference step. This is a tradeoff between capability and inference cost: the model keeps a larger expert capacity while using only part of it at runtime.

That said, MoE active params does not mean VRAM can be estimated as if this were only a 3B model. A full deployment still needs to account for expert weights, KV cache, vision and audio encoder modules, context length, and inference framework overhead.

It Is Not Solving a Single-Modality Problem

Traditional large language models mainly process text. Multimodal models add image understanding. Nemotron 3 Nano Omni has a broader target: it emphasizes omnimodal input, meaning text, images, audio, and video are all brought into a unified reasoning process.

This matters a lot for agents. Real agent tasks are often not “take a piece of text and generate another piece of text”; they are more like:

reading buttons, tables, and windows on a screen;
parsing PDFs, screenshots, charts, and webpages;
listening to spoken instructions or meeting recordings;
understanding actions, scenes, and timing in video;
combining those signals into the next operation.

If a model can only handle one modality, an Agent needs extra glue between multiple specialized models. The value of an omnimodal model is reducing that integration cost and letting the same model directly process more complex environmental inputs.

Built for Computer Operation and Document Intelligence

NVIDIA specifically notes that Nemotron 3 Nano Omni can be used for computer-operation tasks. These tasks usually require the model to understand user interfaces:

what controls are on the screen;
what state the current window is in;
which button or menu is the next target;
what the content in tables, dialogs, and input boxes means.

This is also one of the hard-to-avoid capabilities when AI Agents move into real deployment. If an agent is going to help people operate office software, browsers, enterprise backends, or developer tools, it has to understand the interface, not just read API docs.

Document intelligence follows a similar logic. Enterprise materials often mix text, tables, images, scanned pages, and charts. An omnimodal model can put all of that content into the same context for understanding, making it suitable for contract review, report analysis, invoice processing, knowledge-base QA, and process automation.

Audio and Video Bring Agents Closer to Real Scenarios

Audio and video inputs can noticeably expand the range of agent applications.

Audio scenarios include:

meeting recording summaries;
customer service call analysis;
voice command understanding;
education and training content organization.

Video scenarios include:

instructional video understanding;
security and industrial inspection;
screen recording analysis;
operation workflow review;
temporal reasoning in multi-step tasks.

If these tasks rely only on text transcription, a lot of visual and timing information is lost. An omnimodal model can directly combine voice, frames, and textual clues, giving Agents a more complete sense of their environment.

Deployment and Ecosystem

NVIDIA is placing Nemotron 3 Nano Omni inside an open ecosystem, and the model uses the Apache 2.0 license. That matters for developers and enterprises because it lowers the licensing barrier for experimentation, integration, and secondary development.

From NVIDIA’s introduction, this model is also closely tied to its inference ecosystem. For enterprise users, real deployment usually raises questions like:

whether it can run efficiently on NVIDIA GPUs;
whether it supports long context and multimodal input;
whether it can connect to existing Agent frameworks;
whether it can process internal documents, audio/video, and UI screenshots;
whether it can be deployed in private environments.

NVIDIA emphasizes that the model has a clear throughput advantage and says it can reach up to 9x the throughput of comparable open omnimodal reasoning models. The real value of that number still depends on the specific hardware, context length, input modalities, and inference framework. But the direction is clear: NVIDIA wants to bring open multimodal models and its inference infrastructure together into enterprise Agent scenarios.

Suitable Use Cases

Nemotron 3 Nano Omni is better suited to tasks such as:

Agents that need to understand text, images, audio, and video at the same time;
enterprise document intelligence and knowledge-base QA;
computer operation based on screenshots or web interfaces;
multimodal analysis of meetings, customer service, and teaching content;
video understanding, workflow review, and temporal reasoning;
teams that require open licensing and private deployment.

It is not necessarily a fit for every regular user. If the task is local chat, code completion, or simple QA, a single-modality language model may be lighter, faster, and more resource-efficient. The value of Nemotron 3 Nano Omni mainly appears in complex input and multimodal Agent workflows.

What This Means for AI Agents

For AI Agents to truly enter work scenarios, they cannot only write text. They need to understand interfaces, speech, documents, and changes in video, then turn that information into the next action.

That is where Nemotron 3 Nano Omni matters. It is not simply making the model larger; it is unifying the many kinds of input Agents face into one reasoning model. This can make it easier for developers to build agents for real tasks instead of building only around chat windows.

From this angle, the point of NVIDIA’s release is not just “another multimodal model”. It is part of a continuing effort to connect open models, GPU inference, enterprise Agents, and private deployment. What will be worth watching next is how it performs in concrete Agent frameworks, enterprise workflows, and local deployments.

References:

NVIDIA Technical Blog: NVIDIA Nemotron 3 Nano Omni

How to Pick a GPU in April 2026: Which Models to Avoid and Which Ones Are More Worth Considering

Mon, 27 Apr 2026 08:51:10 +0800

If you are getting ready to build a PC, the GPU is the one part where you really should not look only at whether a card is new. By April 2026, some models are already much harder to justify, while others are not perfect but still feel noticeably more reasonable than the alternatives around the same price.

So this article skips theory and goes straight to specific models.

Models I Would Not Prioritize

1. `RTX 5060 Ti 8GB`

The biggest issue with this card is not that it is unusable. The issue is that 8GB already feels caught in an awkward middle ground at this point.

If you mostly play lighter online games at 1080p medium to high settings, it can still do the job. But once you move into any of these areas, the limitation shows up quickly:

Newer AAA games
Higher texture settings
1440p
Mixed use with AI inference, editing, or productivity work

If you are already looking at the RTX 5060 Ti, the safer move is usually to go straight to the 16GB version instead of saving a bit of budget by taking the 8GB one.

In short:

RTX 5060 Ti 8GB: not recommended
RTX 5060 Ti 16GB: clearly more worth considering

2. Expensive older cards, especially `RTX 3080 10GB` and `RTX 3070 Ti` when they are still priced high

The problem with these cards is not that performance is completely bad. The problem is that, in today’s market, buying them often puts you in an awkward spot:

Power draw is not low
They are no longer new
VRAM is not especially generous
Used-market sources are often messy

RTX 3080 10GB is the clearest example. If it is still priced high, it quickly turns into a card that looks strong on paper but feels less balanced in real use.

RTX 3070 Ti follows the same logic. It is not absolutely unbuyable, but if the price gap is not meaningful, you are usually better off looking at something newer, something with more comfortable VRAM, or something more balanced in power and thermals.

3. Older flagships with unclear history, such as `RTX 3090` and `RTX 3080 Ti`

These two cards are easy to want for obvious reasons:

The names still sound strong
Paper performance is not weak
They are very visible in the used market

What you really need to watch out for is where they came from.

If you are buying:

A pulled card
A repaired card
A used card with unclear history

then the risk is usually much higher than with a normal retail card. A card like the RTX 3090 looks attractive because of the 24GB VRAM, but heat, power delivery, silicon condition, and past usage history all become bigger worries than they would be on a straightforward new card.

If you do not already know exactly what you are buying, and you are not planning to spend time checking the card carefully, these older flagships are generally not something I would touch casually.

4. `RTX 5070` when the price is not right

RTX 5070 is not a card that is automatically bad. The catch is that the price has to make sense.

Its awkwardness shows up when the gap between it and the RTX 5070 Ti is not large enough. In that case, a lot of buyers end up feeling oddly unsatisfied.

The pattern usually looks like this:

Buy the 5070: you keep thinking a little more would have gotten you the 5070 Ti
Do not stretch the budget: you still know you bought the “almost” card

So RTX 5070 is not something to ignore entirely, but it is worth considering only when the price is clearly right. If the pricing sits in an uncomfortable middle zone, it quickly becomes a card that makes theoretical sense but does not feel great in practice.

Models That Make More Sense

1. `RTX 5060 Ti 16GB`

If you are already shopping in the midrange, this card is usually the safer choice compared with the 8GB version.

The reasons are simple:

More headroom within the same product family
Less likely to be boxed in by VRAM over the next few years
Easier to live with if you mix gaming and productivity

It may not be the most explosive card at its price, but it is at least the kind of card you are less likely to regret immediately.

2. `RTX 5070 Ti`

If your budget can stretch, this is usually a more complete answer than the RTX 5070.

Its value is not that it dominates every single scenario. Its value is that it feels more like a card that can balance gaming, resolution, and longer-term use all at once.

It makes sense for people who:

Want 1440p high settings
Want the system to last for years
Do not want to start thinking about upgrades too soon

If you are already stuck between the 5070 and 5070 Ti, and the gap is not absurdly large, going straight to the 5070 Ti is often the less annoying decision.

3. Properly priced new cards are usually a better first stop than older high-end cards

If you are not a veteran used-GPU hunter, a simple and effective rule is this:

Prioritize normal retail new cards
Be cautious with older high-end cards that have messy origins

At this point, the more practical approach is often:

Midrange budget: start with RTX 5060 Ti 16GB
A tier higher: focus on RTX 5070 Ti
Consider RTX 5070 only when pricing is clearly favorable

That is usually a better path than gambling on older cards that sound stronger but come with more baggage.

If You Just Want the Short Version

You can remember it like this:

Not really recommended: RTX 5060 Ti 8GB
Not recommended unless priced well: RTX 5070
Be cautious with: RTX 3080 10GB, RTX 3070 Ti, and unclear-source RTX 3090 / RTX 3080 Ti
More worth considering: RTX 5060 Ti 16GB
Easier long-term pick if budget allows: RTX 5070 Ti

Final Line

At this point in the market, the real mistake is usually not spending a bit more. It is buying a card that looks acceptable on paper but always feels just a little compromised in real use.

If you want to minimize regret, RTX 5060 Ti 16GB and RTX 5070 Ti are generally safer than many cards that seem “good enough,” while RTX 5060 Ti 8GB, badly priced RTX 5070, and older high-end cards with unclear history are usually the first ones to cross off.

What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool

Fri, 24 Apr 2026 14:41:35 +0800

If you have recently been troubleshooting interconnect performance between multiple NVIDIA GPUs, or you want to verify the real bandwidth between PCIe, NVLink, host memory, and VRAM, NVIDIA/nvbandwidth is a small tool worth knowing about.

It is not a general benchmark utility, and it is not a hidden command inside a large model framework. It is an open-source tool from NVIDIA specifically designed to measure bandwidth and latency for GPU-related memory copies. Instead of only looking at theoretical bandwidth, nvbandwidth is better at answering a practical question: how much bandwidth can this machine and its current GPU interconnects actually deliver right now?

1. What does `nvbandwidth` do

According to the official README, nvbandwidth is a command-line tool for measuring bandwidth on NVIDIA GPUs.

It mainly focuses on transfer performance across different memcpy patterns, such as:

GPU -> GPU
CPU -> GPU
GPU -> CPU
Transfers between GPUs across multiple nodes

These tests are especially useful in scenarios like:

Troubleshooting interconnect bottlenecks in multi-GPU training or inference
Verifying the actual behavior of links such as NVLink, PCIe, and C2C
Comparing transfer differences across servers, topologies, drivers, or CUDA versions
Performing baseline hardware validation before cluster deployment

In short, nvbandwidth is not about model throughput. It is about the lower-level ability to move data.

2. It does not produce just one simple score

Many people think of a bandwidth test as something that ends with a single number, but nvbandwidth provides more detailed output than that.

It reports results as matrices for each test type. For example, in a test like device_to_device_memcpy_write_ce, it shows the bandwidth between each pair of GPUs by row and column. That means you can see more than just a rough system-wide speed estimate. You can also spot:

Which GPU pairs are especially fast
Which paths are clearly limited by PCIe
Whether certain GPU pairs show abnormally low bandwidth
Whether the multi-GPU topology matches your expectations

If you are working with an 8-GPU server, a dual-socket platform, or a multinode system, this matrix-style output is often more useful than a single average number.

3. How to understand `CE` and `SM` copies

The official documentation splits tests into two categories:

CE: copy engine transfers based on memcpy APIs
SM: kernel-based transfers

These two result types are not guaranteed to match exactly, because they represent different copy paths.
If you mainly want to understand regular device-to-device transfer behavior, you will usually look at CE first. If you want to study execution details more closely, then SM is worth checking too.

The README also explains that bandwidth results use the median across multiple test runs by default. Newer versions additionally include variability statistics, which makes it easier to judge how stable the numbers are.

4. What environment does it require

nvbandwidth is not a pure binary utility that you simply download and run. It expects a standard CUDA development environment.

The current README lists these basic requirements:

CUDA Toolkit 11.x or newer
A compiler with C++17 support
CMake 3.20+, with 3.24+ recommended
Boost program_options
A usable CUDA device and a compatible driver

The requirements are higher if you want the multinode version. The current README explicitly states:

Multinode builds require CUDA Toolkit 12.3
The driver must be 550 or newer
MPI is required
The nvidia-imex service must be configured

So this is much more of an engineering tool for Linux GPU servers and clusters than something aimed at casual desktop use.

5. How to build and run the single-node version

The single-node build process is straightforward:

1
2

cmake .
make

On Ubuntu / Debian, the project also provides a debian_install.sh script that installs common dependencies and builds the project.

After building, you can check the help output first:

`1`	`./nvbandwidth -h`

Some commonly used options include:

-l: list available tests
-t: run a specific test by name or index
-p: run tests by prefix
-b: set the memcpy buffer size, default 512 MiB
-i: set the number of benchmark iterations
-j: output JSON
-H: enable huge pages for host memory allocation

If you just want to run the default test suite once, use:

`1`	`./nvbandwidth`

If you only want to test one specific item, such as a device-to-device copy:

`1`	`./nvbandwidth -t device_to_device_memcpy_read_ce`

6. Multinode support is one of its standout features

nvbandwidth is not only for single-node multi-GPU testing. It also supports multinode scenarios.

According to the README, the multinode build is done like this:

1
2

cmake -DMULTINODE=1 .
make

At runtime, it is typically used together with mpirun, with one process launched per GPU.
The documentation also requires all participating ranks to belong to the same multinode clique, and it recommends mainly running tests with the multinode prefix under MPI.

That makes its positioning much closer to high-performance computing and large GPU systems than to simple workstation self-checks.

If you are working with NVLink multinode deployments or more complex platforms such as GB200 / Grace Hopper, the value of nvbandwidth is much higher than it would be on a typical consumer GPU setup.

7. What changed in `v0.9`

As of April 24, 2026, the GitHub Releases page shows that the latest version of nvbandwidth is v0.9, released on April 8, 2026.

The most notable updates in this release include:

Added variability statistics to bandwidth output
Added huge page support for host memory (Windows excluded)
Added pair sampling for device-to-device tests
Added a troubleshooting guide
Unified single-node and multinode execution paths

Two engineering-oriented changes are also worth noting:

Improved CUDA architecture detection without relying as much on direct GPU access
Deprecated Volta (sm_70 / sm_72) support in CUDA Toolkit 13.0+ environments

So if you only looked at early versions before, v0.9 is no longer just a basic bandwidth tester. It is clearly moving toward better automation, troubleshooting, and large-scale system validation.

8. When is it a good fit

nvbandwidth is especially suitable when:

You want to verify real interconnect bandwidth between multiple NVIDIA GPUs
You suspect one GPU is installed in a bandwidth-limited PCIe slot
You want to compare NVLink paths against non-NVLink paths
You are deploying a multinode GPU cluster and need to validate the links
You want test results in JSON for automation pipelines

But if your goal is only to answer questions like “how fast is training” or “how many tokens per second can inference reach,” this tool is not the whole answer.
In that case, you still need workload-level testing with your training framework, inference engine, or real application.

9. How to think about its value

Many GPU performance problems are not really caused by insufficient compute. They happen because the data path is not working as expected.

For example:

GPUs are not using the intended interconnect path
Cross-NUMA access is reducing speed
Certain GPU pairs have abnormal bandwidth
Multinode communication is only partially configured

These issues are often hard to diagnose if you only look at nvidia-smi or model throughput.
A lower-level, matrix-oriented tool like nvbandwidth is useful precisely because it exposes what is happening at the interconnect layer.

So a simple way to think about it is: nvbandwidth is a command-line health check tool for bandwidth on NVIDIA GPU systems.

GitHub project: https://github.com/NVIDIA/nvbandwidth
Releases: https://github.com/NVIDIA/nvbandwidth/releases

Nvidia on KnightLi Blog

NVIDIA Releases Qwen3.6-35B-A3B-NVFP4: An FP4 Quantized Version for vLLM Deployment

Basic Model Information

What NVFP4 Quantization Does

vLLM Deployment Command

How to Read the Benchmark Results

Suitable Scenarios

Usage Limits

Summary

Behind Cerebras' IPO Surge: Can Wafer-Scale AI Chips Challenge Nvidia?

Cerebras Is Not Building a Normal GPU

Why the Market Got Excited

The OpenAI Partnership Expands the Upside Story

Customer Concentration Remains a Major Risk

The Technical Limitation: Memory Capacity

Can It Replace Nvidia?

Summary

References

The U.S. Clears Nvidia H200 Sales: 10 Chinese Companies Approved, but Delivery Is Still Uncertain

What Was Approved

Why H200 Matters

It Has Not Really Landed Yet

What It Means for Nvidia

What It Means for Chinese Companies

Pressure on Domestic Chips Will Not Disappear

How to Read This

References

What Jensen Huang Was Really Saying in His CMU Speech

First Layer: I Had a Hard Childhood, and You May Have Hard Times Too

Second Layer: Take Off the Gown and Do the Work That Is Actually Needed

Third Layer: Hard Things Are Always Harder Than They Look

The Real Reminder in This Speech

NVIDIA Releases Nemotron 3 Nano Omni: An Open Omnimodal Reasoning Model for Agents

Model Specs

It Is Not Solving a Single-Modality Problem

Built for Computer Operation and Document Intelligence

Audio and Video Bring Agents Closer to Real Scenarios

Deployment and Ecosystem

Suitable Use Cases

What This Means for AI Agents

How to Pick a GPU in April 2026: Which Models to Avoid and Which Ones Are More Worth Considering

Models I Would Not Prioritize

1. RTX 5060 Ti 8GB

2. Expensive older cards, especially RTX 3080 10GB and RTX 3070 Ti when they are still priced high

3. Older flagships with unclear history, such as RTX 3090 and RTX 3080 Ti

4. RTX 5070 when the price is not right

Models That Make More Sense

1. RTX 5060 Ti 16GB

2. RTX 5070 Ti

3. Properly priced new cards are usually a better first stop than older high-end cards

If You Just Want the Short Version

Final Line

What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool

1. What does nvbandwidth do

2. It does not produce just one simple score

3. How to understand CE and SM copies

4. What environment does it require

5. How to build and run the single-node version

6. Multinode support is one of its standout features

7. What changed in v0.9

8. When is it a good fit

9. How to think about its value

Related links

1. `RTX 5060 Ti 8GB`

2. Expensive older cards, especially `RTX 3080 10GB` and `RTX 3070 Ti` when they are still priced high

3. Older flagships with unclear history, such as `RTX 3090` and `RTX 3080 Ti`

4. `RTX 5070` when the price is not right

1. `RTX 5060 Ti 16GB`

2. `RTX 5070 Ti`

1. What does `nvbandwidth` do

3. How to understand `CE` and `SM` copies

7. What changed in `v0.9`