Hermes + Qwen3.6: A Low-Cost Local Agent Deployment

A local deployment plan for Hermes Agent + Qwen3.6 GGUF: use WSL2, CUDA, and llama.cpp to start a local model service, then connect Hermes Agent to an OpenAI-compatible endpoint.

This article documents a local Agent deployment plan: run a Qwen3.6 GGUF model with llama.cpp inside WSL2, then connect Hermes Agent to the local OpenAI-compatible API. This gives you a long-running local AI assistant on your own computer, without paying by online service Token usage.

This setup is suitable for users who want to try local AI Agents while keeping data private and controllable over the long term. It can be used for daily Q&A, writing, coding assistance, document organization, and simple automation tasks. The larger the model, the higher the VRAM requirement. The original example uses Qwen3.6-27B, and 24GB VRAM is more stable. If your VRAM is smaller, choose a smaller model or a lower quantization.

Architecture

The overall chain is simple:

  1. Install WSL2 and Ubuntu 24.04 on Windows.
  2. Install CUDA Toolkit inside WSL2 and compile llama.cpp.
  3. Download the Qwen3.6 GGUF model.
  4. Start a local model service with llama-server.
  5. Install Hermes Agent and configure it to http://localhost:8080/v1.
  6. Optional: write a startup script so the model service starts automatically when WSL2 opens.

Hermes provides the Agent capability, while Qwen3.6 provides the local LLM capability. Together, they turn the computer into a private local AI assistant.

Install WSL2 and Ubuntu

Run in an administrator Windows PowerShell window:

1
2
wsl --install
wsl --set-default-version 2

After rebooting, install Ubuntu 24.04:

1
wsl --install -d Ubuntu-24.04

After installation, Ubuntu prompts you to set a username and password. Once inside Ubuntu, first check whether the NVIDIA GPU is visible in WSL2:

1
nvidia-smi

If the GPU cannot be detected, update the NVIDIA driver on Windows first. WSL2 inherits the Windows driver, but CUDA Toolkit still needs to be installed separately inside WSL2.

Install Python and Basic Tools

1
sudo apt update && sudo apt install -y python3-pip python3-venv

You also need build tools, Git, and CMake:

1
sudo apt install -y cmake build-essential git

Compile llama.cpp

Clone the repository:

1
2
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

If CUDA is already available in WSL2, compile directly:

1
2
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build -j$(nproc)

CMAKE_CUDA_ARCHITECTURES=89 is suitable for Ada GPUs, such as RTX 40 series cards. Adjust it according to your actual GPU architecture.

If compilation reports that CUDA Toolkit is missing, install CUDA Toolkit inside WSL2 first:

1
2
3
4
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-8

Configure environment variables:

1
2
3
4
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

Then rebuild:

1
2
3
4
cd ~/llama.cpp
rm -rf build
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build -j$(nproc)

Download the Qwen3.6 GGUF Model

The example uses Qwen3.6-27B-UD-Q4_K_XL.gguf from unsloth/Qwen3.6-27B-GGUF:

1
2
3
hf download unsloth/Qwen3.6-27B-GGUF \
Qwen3.6-27B-UD-Q4_K_XL.gguf \
--local-dir ~/models/

The file is about 17GB. If Hugging Face is slow, use a mirror such as ModelScope. Do not force a 27B model if your VRAM is insufficient; use a smaller model or lower quantization.

Start the Local Model Service

Start llama-server with your own model file name:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--port 8080

After startup, open this in a Windows browser:

1
http://localhost:8080

For Hermes Agent or other OpenAI-compatible clients, the API endpoint is usually:

1
http://localhost:8080/v1

Thinking Mode Tradeoff

Qwen3.6 may enable Thinking mode by default. It is suitable for complex reasoning, complicated coding problems, and multi-step analysis, but it is slower.

To disable Thinking mode, stop the service and add --chat-template-kwargs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--chat-template-kwargs '{"enable_thinking":false}' \
--port 8080

After disabling Thinking, simple Q&A, writing, code completion, and code explanation become faster. For complex algorithm design, difficult debugging, and architecture analysis, Thinking mode is still recommended.

Install Hermes Agent

Keep llama-server running, then open a new WSL2 terminal and install Hermes Agent:

1
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

The installer handles dependencies such as Python, Node.js, ripgrep, and ffmpeg. When configuring the model endpoint, choose a custom endpoint:

1
2
3
URL: http://localhost:8080/v1
API Key: 12345678
Model: auto-detect

For a local llama-server, the API Key can be any placeholder value. After configuration, you can connect Telegram, WeChat, QQ, Discord, and other chat tools, allowing Hermes Agent to call the local model and execute tasks from those entry points.

Auto-Start the Model Service

You can write a startup script so the model service starts automatically when a WSL2 terminal opens.

Create the script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
cat > ~/start-llm.sh << 'EOF'
#!/bin/bash
echo "Starting Qwen3.6-27B llama-server..."
~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 65536 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--port 8080 \
--host 0.0.0.0 &
echo "llama-server started, PID: $!"
echo "API: http://localhost:8080/v1"
echo "Chat UI: http://localhost:8080"
EOF
chmod +x ~/start-llm.sh

Write it into .bashrc:

1
2
3
4
echo '# Auto-start llama-server' >> ~/.bashrc
echo 'if ! pgrep -f "llama-server" > /dev/null 2>&1; then' >> ~/.bashrc
echo '    ~/start-llm.sh' >> ~/.bashrc
echo 'fi' >> ~/.bashrc

Each time you open a WSL2 terminal, it will start llama-server if it is not already running. If it is running, it skips startup and avoids duplicate processes.

Notes

  1. 27B models require substantial VRAM; 24GB VRAM is more stable. Use a smaller model if VRAM is limited.
  2. --ctx-size 65536 significantly increases VRAM and RAM pressure. If unstable, reduce it to 32768 or lower.
  3. Both CUDA Toolkit in WSL2 and the Windows GPU driver must work properly. Either side can cause CUDA compilation or runtime failures.
  4. Hermes Agent calls the local service through an OpenAI-compatible API. The key is that http://localhost:8080/v1 responds correctly.
  5. If accessing from a phone or another device, handle Windows Firewall, LAN addresses, and security isolation. Do not expose the local model service directly to the public internet.
记录并分享
Built with Hugo
Theme Stack designed by Jimmy