Qwythos-9B-Claude-Mythos-5-1M is a 9B reasoning model published by Empero AI on Hugging Face.
Model page: empero-ai/Qwythos-9B-Claude-Mythos-5-1M
Its most notable points are straightforward:
- Based on
Qwen3.5-9B; - 9B parameter scale;
- Apache-2.0 license;
- Default 1,048,576 token context;
- Supports Qwen3.5-style function calling;
- Built for long-text reasoning, tool use, and agentic workflows;
- The model card provides vLLM, SGLang, and Transformers examples.
If you are looking for a relatively small open-weight model with long context and tool calling, Qwythos-9B is worth a look.
Who it is for
Qwythos-9B is not quite a normal chat model.
It is more suitable for:
- Long document analysis;
- Reading multi-file codebases;
- Long agent workflows;
- Tool-assisted question answering;
- Tasks that need a Python executor or search tool for verification;
- Research, reasoning, math, code, and technical document processing;
- Testing 1M context in local or private deployments.
It is less suitable if you:
- Only want lightweight chatting;
- Do not have GPU resources;
- Do not want to handle
<think>reasoning blocks; - Want an out-of-the-box consumer chat experience;
- Do not have application-level safety controls.
The model card explicitly describes it as a reasoning model. Its answers first include a <think> reasoning block, then the final answer. If you connect it to a user-facing product, you need to handle or hide that part yourself.
Core model information
According to the Hugging Face model card, the basics are:
| Item | Information |
|---|---|
| Model name | empero-ai/Qwythos-9B-Claude-Mythos-5-1M |
| Publisher | Empero AI |
| Base model | Qwen/Qwen3.5-9B |
| Size | 9B |
| Format | Safetensors |
| License | Apache-2.0 |
| Context | 1,048,576 tokens |
| Features | reasoning, function calling, long-context, agentic |
It is not just prompt wrapping. It is a full-parameter fine-tune. The model card says training data includes more than 500M tokens of Claude Mythos / Claude Fable traces, plus chain-of-thought data generated by Empero AI’s internal rethink tool.
The question is not simply whether it can chat. The interesting part is whether it can reason over complex context, call tools, and correct itself.
What 1M context means
The most eye-catching capability in the model card is its default YaRN rope scaling, extending context to:
|
|
That is roughly 1M tokens.
The configuration includes parameters like:
|
|
This is attractive for:
- Putting a large codebase directly into context;
- Processing 10 to 20 papers plus notes;
- Keeping tool outputs in long-running agent tasks;
- Cross-document analysis;
- Reasoning over long traceback, logs, or API responses.
But be realistic: 1M context does not mean any consumer GPU can comfortably run the full window.
The model card also notes that the full 1M window is better suited to tensor-parallel multi-GPU setups or aggressive KV-cache offload. A single high-end GPU may be more practical at 256k to 512k, depending on backend, quantization, KV cache, and VRAM.
Deploying with vLLM
If you are used to OpenAI-compatible APIs, vLLM is one of the most direct options.
Install:
|
|
Start the model:
|
|
To explicitly request a near-1M context, follow the model card:
|
|
Call the API:
|
|
If startup fails with insufficient VRAM, do not start at 1M. First test smaller --max-model-len values such as 32k, 64k, or 128k, then increase gradually.
Deploying with SGLang
SGLang is also covered in the model card.
Install:
|
|
Start:
|
|
To try long context:
|
|
Call it:
|
|
The model card also includes a Docker example:
|
|
Before deployment, make sure HF_TOKEN is configured if your environment needs access to gated resources or private cache.
Loading with Transformers
The model card’s text-only example uses AutoModelForImageTextToText and AutoTokenizer.
The structure is roughly:
|
|
Two details matter:
- The response includes
<think>...</think>; - The model card recommends giving enough
max_new_tokens, such as 16384.
For product output, you usually need post-processing to remove the <think> part and show only the final answer.
Recommended sampling settings
The model card recommends:
|
|
Do not start with greedy decoding, and do not push temperature too low.
The model card notes that greedy decoding or very low temperature (T <= 0.3) can make this kind of reasoning model fall into repetition loops. Using the recommended parameters is usually safer.
Understanding tool calling
Qwythos-9B supports Qwen3.5-style function calling.
The model card explains that you can pass tools=[...] to the chat template, and the model can output a Qwen3.5-compatible <tool_call> block.
A simplified tool definition looks like:
|
|
The model generates something like a <tool_call> block. Your application must parse it, execute the tool, and feed the result back to the model.
In other words, Qwythos-9B does not magically browse the web by itself.
You must provide the tool runtime.
Realistic hardware expectations
9B parameters may sound modest, but 1M context creates pressure on another dimension.
For deployment, look separately at:
- Model weight VRAM;
- KV cache;
- Context length;
- Batch size;
- Concurrency;
- Whether quantization is used;
- Whether KV-cache offload is used;
- Backend choice: vLLM, SGLang, or Transformers.
If you only want to test the model, begin with smaller context:
|
|
Only increase after the setup is stable.
Do not treat 1M context as something you must use every time. A more practical approach is to increase context only for codebase analysis, multi-paper summarization, long agent traces, and similar tasks.
Limits and safety boundaries
Several model-card limitations deserve attention:
- It is a reasoning model and outputs
<think>; - Low temperature or greedy decoding may cause repetition loops;
- Concrete identifiers, CVEs, drug labels, exact numbers, and similar facts still need tool or retrieval verification;
- The model is uncensored and may not refuse complex technical requests easily;
- Vision capability is inherited from the base model, but this fine-tune is text-only, so visual behavior was not the focus of training or evaluation.
If you use it in a user-facing app, add:
- Output filtering;
- Safety policy;
- Tool-call allowlists;
- Rate limiting;
- Logging and audit;
- Human review for high-risk domains;
- Retrieval or tool-based verification.
For cybersecurity, medical, pharmacology, financial, and legal scenarios, do not treat model output as final truth. It can assist reasoning, but final decisions should rely on reliable sources, tool results, or human review.
How to test it
For the first test, do not start with ultra-long context.
Try this order:
- Run a short Q&A through Transformers or vLLM;
- Use the recommended sampling settings;
- Observe the
<think>and final-answer format; - Test a longer document summary;
- Add a Python executor;
- Then try web_search or RAG;
- Increase context only after the basics are stable.
Good test prompts:
|
|
Or:
|
|
These prompts better show the value of a reasoning model with tool use.
One-sentence summary
Qwythos-9B-Claude-Mythos-5-1M is a 9B reasoning model aimed at engineering and research use cases.
Its appeal is not just model size, but this combination:
|
|
If you want to test long-context codebase analysis, multi-document research, or tool-verified agent workflows, it is worth trying. But do not get carried away by 1M context: start small, stabilize the deployment, then scale based on VRAM, KV cache, and backend capability.