How to Use Qwythos-9B: vLLM, SGLang, and Transformers Deployment Guide

Qwythos-9B-Claude-Mythos-5-1M is a 9B reasoning model released by Empero AI. Based on Qwen3.5-9B, it focuses on 1M context, native tool calling, long-text reasoning, and an Apache-2.0 license. This guide covers model features, vLLM/SGLang deployment, sampling settings, and usage caveats.

Qwythos-9B-Claude-Mythos-5-1M is a 9B reasoning model published by Empero AI on Hugging Face.

Model page: empero-ai/Qwythos-9B-Claude-Mythos-5-1M

Its most notable points are straightforward:

  • Based on Qwen3.5-9B;
  • 9B parameter scale;
  • Apache-2.0 license;
  • Default 1,048,576 token context;
  • Supports Qwen3.5-style function calling;
  • Built for long-text reasoning, tool use, and agentic workflows;
  • The model card provides vLLM, SGLang, and Transformers examples.

If you are looking for a relatively small open-weight model with long context and tool calling, Qwythos-9B is worth a look.

Who it is for

Qwythos-9B is not quite a normal chat model.

It is more suitable for:

  • Long document analysis;
  • Reading multi-file codebases;
  • Long agent workflows;
  • Tool-assisted question answering;
  • Tasks that need a Python executor or search tool for verification;
  • Research, reasoning, math, code, and technical document processing;
  • Testing 1M context in local or private deployments.

It is less suitable if you:

  • Only want lightweight chatting;
  • Do not have GPU resources;
  • Do not want to handle <think> reasoning blocks;
  • Want an out-of-the-box consumer chat experience;
  • Do not have application-level safety controls.

The model card explicitly describes it as a reasoning model. Its answers first include a <think> reasoning block, then the final answer. If you connect it to a user-facing product, you need to handle or hide that part yourself.

Core model information

According to the Hugging Face model card, the basics are:

Item Information
Model name empero-ai/Qwythos-9B-Claude-Mythos-5-1M
Publisher Empero AI
Base model Qwen/Qwen3.5-9B
Size 9B
Format Safetensors
License Apache-2.0
Context 1,048,576 tokens
Features reasoning, function calling, long-context, agentic

It is not just prompt wrapping. It is a full-parameter fine-tune. The model card says training data includes more than 500M tokens of Claude Mythos / Claude Fable traces, plus chain-of-thought data generated by Empero AI’s internal rethink tool.

The question is not simply whether it can chat. The interesting part is whether it can reason over complex context, call tools, and correct itself.

What 1M context means

The most eye-catching capability in the model card is its default YaRN rope scaling, extending context to:

1
1,048,576 tokens

That is roughly 1M tokens.

The configuration includes parameters like:

1
2
3
4
5
6
7
8
9
"rope_parameters": {
  "rope_type": "yarn",
  "factor": 4.0,
  "original_max_position_embeddings": 262144,
  "mrope_interleaved": true,
  "mrope_section": [11, 11, 10],
  "rope_theta": 10000000
},
"max_position_embeddings": 1048576

This is attractive for:

  • Putting a large codebase directly into context;
  • Processing 10 to 20 papers plus notes;
  • Keeping tool outputs in long-running agent tasks;
  • Cross-document analysis;
  • Reasoning over long traceback, logs, or API responses.

But be realistic: 1M context does not mean any consumer GPU can comfortably run the full window.

The model card also notes that the full 1M window is better suited to tensor-parallel multi-GPU setups or aggressive KV-cache offload. A single high-end GPU may be more practical at 256k to 512k, depending on backend, quantization, KV cache, and VRAM.

Deploying with vLLM

If you are used to OpenAI-compatible APIs, vLLM is one of the most direct options.

Install:

1
pip install vllm

Start the model:

1
vllm serve "empero-ai/Qwythos-9B-Claude-Mythos-5-1M"

To explicitly request a near-1M context, follow the model card:

1
vllm serve empero-ai/Qwythos-9B-Claude-Mythos-5-1M --max-model-len 1010000

Call the API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "empero-ai/Qwythos-9B-Claude-Mythos-5-1M",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

If startup fails with insufficient VRAM, do not start at 1M. First test smaller --max-model-len values such as 32k, 64k, or 128k, then increase gradually.

Deploying with SGLang

SGLang is also covered in the model card.

Install:

1
pip install sglang

Start:

1
2
3
4
python3 -m sglang.launch_server \
  --model-path "empero-ai/Qwythos-9B-Claude-Mythos-5-1M" \
  --host 0.0.0.0 \
  --port 30000

To try long context:

1
2
3
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \
  --model-path empero-ai/Qwythos-9B-Claude-Mythos-5-1M \
  --context-length 1010000

Call it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
curl -X POST "http://localhost:30000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "empero-ai/Qwythos-9B-Claude-Mythos-5-1M",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

The model card also includes a Docker example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<secret>" \
  --ipc=host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path "empero-ai/Qwythos-9B-Claude-Mythos-5-1M" \
    --host 0.0.0.0 \
    --port 30000

Before deployment, make sure HF_TOKEN is configured if your environment needs access to gated resources or private cache.

Loading with Transformers

The model card’s text-only example uses AutoModelForImageTextToText and AutoTokenizer.

The structure is roughly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

model_id = "empero-ai/Qwythos-9B-Claude-Mythos-5-1M"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Explain how tool calling helps a reasoning model verify exact facts."
    }
]

text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=16384,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.05,
)

print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Two details matter:

  1. The response includes <think>...</think>;
  2. The model card recommends giving enough max_new_tokens, such as 16384.

For product output, you usually need post-processing to remove the <think> part and show only the final answer.

The model card recommends:

1
2
3
4
5
6
7
8
gen_kwargs = dict(
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.05,
    max_new_tokens=16384,
)

Do not start with greedy decoding, and do not push temperature too low.

The model card notes that greedy decoding or very low temperature (T <= 0.3) can make this kind of reasoning model fall into repetition loops. Using the recommended parameters is usually safer.

Understanding tool calling

Qwythos-9B supports Qwen3.5-style function calling.

The model card explains that you can pass tools=[...] to the chat template, and the model can output a Qwen3.5-compatible <tool_call> block.

A simplified tool definition looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "python_executor",
            "description": "Execute Python code and return stdout.",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string"}
                },
                "required": ["code"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for current facts and citations.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "max_results": {"type": "integer"}
                },
                "required": ["query"]
            }
        }
    }
]

The model generates something like a <tool_call> block. Your application must parse it, execute the tool, and feed the result back to the model.

In other words, Qwythos-9B does not magically browse the web by itself.

You must provide the tool runtime.

Realistic hardware expectations

9B parameters may sound modest, but 1M context creates pressure on another dimension.

For deployment, look separately at:

  • Model weight VRAM;
  • KV cache;
  • Context length;
  • Batch size;
  • Concurrency;
  • Whether quantization is used;
  • Whether KV-cache offload is used;
  • Backend choice: vLLM, SGLang, or Transformers.

If you only want to test the model, begin with smaller context:

1
32k -> 64k -> 128k -> 256k

Only increase after the setup is stable.

Do not treat 1M context as something you must use every time. A more practical approach is to increase context only for codebase analysis, multi-paper summarization, long agent traces, and similar tasks.

Limits and safety boundaries

Several model-card limitations deserve attention:

  • It is a reasoning model and outputs <think>;
  • Low temperature or greedy decoding may cause repetition loops;
  • Concrete identifiers, CVEs, drug labels, exact numbers, and similar facts still need tool or retrieval verification;
  • The model is uncensored and may not refuse complex technical requests easily;
  • Vision capability is inherited from the base model, but this fine-tune is text-only, so visual behavior was not the focus of training or evaluation.

If you use it in a user-facing app, add:

  • Output filtering;
  • Safety policy;
  • Tool-call allowlists;
  • Rate limiting;
  • Logging and audit;
  • Human review for high-risk domains;
  • Retrieval or tool-based verification.

For cybersecurity, medical, pharmacology, financial, and legal scenarios, do not treat model output as final truth. It can assist reasoning, but final decisions should rely on reliable sources, tool results, or human review.

How to test it

For the first test, do not start with ultra-long context.

Try this order:

  1. Run a short Q&A through Transformers or vLLM;
  2. Use the recommended sampling settings;
  3. Observe the <think> and final-answer format;
  4. Test a longer document summary;
  5. Add a Python executor;
  6. Then try web_search or RAG;
  7. Increase context only after the basics are stable.

Good test prompts:

1
请阅读下面这段代码,找出可能的边界条件问题,并给出最小修复建议。

Or:

1
请先列出你需要验证的关键事实,再说明哪些可以用工具确认,哪些需要人工复核。

These prompts better show the value of a reasoning model with tool use.

One-sentence summary

Qwythos-9B-Claude-Mythos-5-1M is a 9B reasoning model aimed at engineering and research use cases.

Its appeal is not just model size, but this combination:

1
Qwen3.5-9B base + 1M context + function calling + reasoning training + Apache-2.0

If you want to test long-context codebase analysis, multi-document research, or tool-verified agent workflows, it is worth trying. But do not get carried away by 1M context: start small, stabilize the deployment, then scale based on VRAM, KV cache, and backend capability.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy