NVIDIA Nemotron 3 Ultra Released: An Open Reasoning Model for Long-Running Agents

NVIDIA released Nemotron 3 Ultra on June 4, 2026. It is an open reasoning model for long-running agents, using a 550B-parameter Mixture-of-Experts architecture with about 55B active parameters per inference.

The release is not just another large chat model. Its direction is more specific: helping multi-turn, multi-tool, multi-agent workflows run faster, cheaper, and more reliably.

Why NVIDIA emphasizes long-running agents

Regular chat models handle one question and one answer. Agents handle chains of tasks.

A real long-running agent may:

make a plan;
call search, code, database, or enterprise tools;
delegate work to sub-agents;
receive tool results;
reason about the next step;
validate outputs;
recover from errors.

This process makes token counts grow quickly. The longer the task runs, the more history, tool output, reasoning steps, and intermediate results accumulate. Model-call cost rises, and the risk of goal drift also increases.

NVIDIA’s approach is to solve this with a system of models: stronger frontier reasoning models for key reasoning and orchestration, and efficient models for high-frequency execution, validation, and tool calls. Nemotron 3 Ultra sits in the high-capability orchestration layer.

Core positioning of Nemotron 3 Ultra

Nemotron 3 Ultra is a 550B-parameter MoE model, but each inference activates about 55B parameters. It is not aimed at lightweight chat. It is aimed at the hard calls inside agent workflows.

NVIDIA’s examples include:

maintaining architectural decisions across long coding sessions;
synthesizing conflicting evidence from hundreds of research sources;
verifying chip designs against thousands of constraints;
planning, calling tools, recovering from errors, and continuing across many turns.

In other words, Ultra is closer to a “chief orchestrator + deep reasoning” component in an agent system, not a cheap execution model for every small tool call.

Performance and efficiency

NVIDIA provides several benchmark figures in the official blog. Nemotron 3 Ultra performs competitively across agent and long-context evaluations:

PinchBench: 91%
EnterpriseOps-Gym: 33%
Terminal-Bench 2.0: 54%
IFBench: 82%
Ruler @1M: 95%

NVIDIA also says it can deliver up to 5x higher throughput compared with open models in its class. For long-running agents, that matters more than a single-turn benchmark because agent tasks usually require continuous calls across many turns.

Cost is another key point. NVIDIA says that in SWE-bench and Terminal-Bench 2.0 experiments, Nemotron 3 Ultra used fewer total tokens and fewer tokens per turn to complete tasks, lowering agentic task cost by up to 30%.

For developers, this means Nemotron 3 Ultra is not only trying to answer correctly. It is also optimizing how many tokens, how much time, and how much money it takes to finish the whole task.

Hybrid Mamba-Transformer for long-context efficiency

Long-context agents have two conflicting needs.

They need efficient handling of very long sequences because tool outputs and action traces keep growing. They also need precise recall of specific facts inside the context, such as a tool result, a file path, or a constraint.

Nemotron 3 Ultra uses a Hybrid Mamba-Transformer architecture to balance the two:

Mamba layers improve long-sequence efficiency;
Transformer layers preserve precise recall of contextual facts.

This is well matched to agent workflows. Agents do not only read long documents; they continuously write their own action traces into context. If long-context efficiency is poor, tasks slow down over time. If precise recall is weak, the agent may forget critical constraints late in the run.

NVFP4: one checkpoint across multiple NVIDIA GPU generations

NVIDIA also highlights NVFP4 precision.

According to NVIDIA, the same NVFP4 checkpoint can run on NVIDIA Hopper, Blackwell, and Ampere GPUs. With specialized NVFP4 quantization kernels, developers can use one checkpoint across multiple NVIDIA GPU architectures.

On Blackwell, NVIDIA says NVFP4 can provide up to 5x higher per-GPU throughput than BF16 at the same interactivity.

This is practical for enterprise deployment. Many companies do not have just one GPU generation. They may run Ampere, Hopper, and Blackwell at the same time. Maintaining different model versions for each hardware generation increases deployment and validation cost.

LatentMoE and MTP

Nemotron 3 Ultra also uses LatentMoE and Multi-token prediction.

LatentMoE handles more efficient expert routing. One core issue for MoE models is deciding which experts should handle each request. Agent workflows may include reasoning, code generation, tool calls, and domain-specific logic, so routing efficiency directly affects throughput and capability.

Multi-token prediction, or MTP, improves generation speed. Instead of predicting only the next token each time, it attempts to predict multiple future tokens in one forward pass, reducing wait time for long outputs and multi-turn tasks.

Together, these features show that NVIDIA is optimizing not just isolated model capability, but overall throughput, latency, and cost during long agent runs.

MOPD: Multi-Teacher On-Policy Distillation

One important training method in this release is Multi-Teacher On-Policy Distillation, or MOPD.

In simple terms, Ultra does not learn from a single teacher model. It learns from more than ten domain-specialized teacher models. Each teacher has its own domain-specific training pipeline and scores Ultra within its area of expertise.

MOPD has several traits:

the student model generates its own attempts;
domain-specific teacher models provide dense reward signals;
student rollouts, teacher scoring, and student optimization are asynchronously pipelined;
the training process is iterative, and new student checkpoints can become the starting point for later teacher training.

The goal is to improve across multiple domains, not just become better at general chat. For enterprise agents, that matters because real tasks often mix law, code, knowledge work, business processes, and safety rules.

Training data and open recipes

NVIDIA again emphasizes open data and training recipes.

On top of a 10T token pretraining foundation, Nemotron 3 Ultra adds 212B new tokens targeting three high-value gaps:

4B synthetic legal tokens;
35B synthesized Wiki-based tokens;
173B refreshed GitHub tokens through September 30, 2025.

For post-training, the release also adds:

10M new SFT samples;
1M new RL tasks;
15 new RL environments.

Cumulatively, Nemotron open data now reaches 50M SFT samples, 2M RL tasks, and 55 RL environments.

This matters for enterprise and sovereign AI projects. Capability is only one dimension. Training data transparency, provenance, and traceability also affect whether a model can enter production.

How developers can use it

Nemotron 3 Ultra is an open model, and NVIDIA says weights, data, and recipes are open so developers can adapt it to domain-specific workflows.

Officially mentioned usage paths include:

downloading weights from Hugging Face;
deploying with NVIDIA NIM microservice;
trying it on build.nvidia.com;
accessing it through OpenRouter, Anaconda, Perplexity Pro, and other entry points;
running inference with SGLang, TRT-LLM, vLLM, and similar tools;
fine-tuning with LoRA, SFT, and reinforcement learning through NeMo libraries.

If you are building enterprise agents, Nemotron 3 Ultra fits best in roles such as:

complex task planning;
multi-tool orchestration;
long-context evidence synthesis;
key decisions inside coding agents;
top-level control in multi-agent systems;
high-difficulty reasoning for domain agents.

It does not need to be called for every small request. A more realistic architecture is to use Ultra for key reasoning and smaller, cheaper models for high-frequency simple steps.

Safer agent execution: NemoClaw and OpenShell

NVIDIA also emphasizes safe agent runtime.

The official stack includes:

Hermes Agent and OpenClaw: agent harnesses for multi-turn workflows, providing orchestration loops, memory, and tools;
NVIDIA OpenShell: a secure runtime environment where autonomous agents and generated code execute under control;
NVIDIA NemoClaw: an open-source blueprint that installs the OpenShell runtime with a single command and connects the agent harness, runtime, and open models.

This is critical. The stronger an agent becomes, the less it should run directly on production machines without controls. If a model can write code, call tools, or operate files, it needs sandboxing, permission boundaries, logs, and human confirmation.

Nemotron 3.5 Content Safety and ASR

Alongside Nemotron 3 Ultra, NVIDIA also released two related models.

The first is Nemotron 3.5 Content Safety, an open 4B guardrail model for identifying unsafe, disallowed, or policy-violating content across text, images, and mixed inputs. It covers 23 safety categories and 12 languages, and can be used as an inference-time guardrail, an LLM safety evaluation judge, or with training data for safety post-training.

The second is Nemotron 3.5 ASR, an automatic speech recognition model for voice-native agents. It uses a cache-aware streaming architecture to process audio deltas with low latency. NVIDIA says it supports 40+ languages and continues the real-time voice design of Nemotron 3 ASR.

This shows NVIDIA is not only releasing a reasoning model. It is filling out an agent stack: reasoning, voice input, safety guardrails, runtime sandboxing, and deployment tools all in one ecosystem.

Open licensing and deployment ecosystem

Nemotron releases are moving to OpenMDW-1.1, a permissive Linux Foundation license designed for open AI model distributions. NVIDIA says it covers architecture, parameters, documentation, software, and related materials, reducing licensing ambiguity during evaluation and adoption.

For enterprises, license clarity matters. Many models are capable enough, but unclear terms around weights, data, recipes, commercial use, and redistribution slow down legal and compliance review.

NVIDIA also lists a large partner ecosystem across inference software, cloud services, model customization, and inference providers. The goal is clear: Nemotron 3 Ultra is intended not just as a research model, but as something that can enter real agent production pipelines.

Keep expectations realistic

Nemotron 3 Ultra is powerful, but it is not a model for casual local use on a personal computer.

550B MoE and 55B active parameters mean it is better suited to enterprise GPU clusters, cloud services, NIM, or professional inference platforms. For ordinary developers, the realistic entry points are APIs, managed services, build.nvidia.com, or deployment routes in the Hugging Face ecosystem.

Official benchmarks are useful, but they do not automatically equal your business outcome. Whether an agent system works well also depends on:

agent harness design;
tool permissions and reliability;
long-context trimming strategy;
task decomposition;
error recovery;
security sandboxing and audit.

A strong model is only one layer of an agent system. Production quality is usually determined by the combination of model, tools, context management, runtime, and evaluation.

Summary

Nemotron 3 Ultra pushes open reasoning models toward the real needs of long-running agents: longer context, higher throughput, lower cost to completion, clearer training data, and customizable deployment paths.

It is not an ordinary chat model launch. It is NVIDIA packaging a broader agent infrastructure move: Ultra handles hard reasoning and orchestration, Content Safety provides guardrails, ASR provides voice input, OpenShell and NemoClaw provide runtime, and NIM plus inference platforms provide deployment.

If you are building enterprise agents, coding agents, research automation, multi-tool orchestration, or sovereign AI projects, Nemotron 3 Ultra is worth watching closely. Its real competition is not single-turn Q&A, but whether long-chain tasks can finish faster, more reliably, and at lower cost.

Sources

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents