Deploying Holo 3.1 as a Local Agent: Connecting llama.cpp to OpenClaw

A practical Holo 3.1 local agent setup: download llama.cpp, choose a GGUF model, start a local OpenAI-compatible server, then configure API Base URL and browser automation skills in OpenClaw.

Holo 3.1 is a local computer-use Agent model family released by H Company. It is positioned as a vision-language model for operating computers. According to the official model card, Holo3.1 supports web, desktop, and mobile environments, offers sizes such as 0.8B, 4B, 9B, and 35B-A3B, and provides quantized versions suitable for local deployment.

It is suitable for users who want to run an AI Agent on their own machine: no cloud API, no token-based billing, and more control over browser automation, desktop actions, and local file workflows.

The following is a direct local setup flow: use llama.cpp to start an OpenAI-compatible service for Holo 3.1, then point OpenClaw to the local address.

Requirements

Prepare the following:

  • A Windows, macOS, or Linux computer.
  • A discrete GPU with enough VRAM, or an Apple Silicon Mac.
  • llama-server from llama.cpp.
  • The main Holo 3.1 GGUF model file and the vision mmproj file.
  • OpenClaw.

Choose the model size based on your hardware:

Hardware Recommended model
RTX 4090 / RTX 3090 24GB 35B-A3B Q4_K_M
RTX 5070 Ti / RTX 4060 Ti 16GB 9B
Apple Silicon 9B GGUF
12GB VRAM 4B
8GB VRAM 0.8B

If you only want to try browser automation and simple desktop tasks, 9B is easier to run. 35B-A3B is better suited to machines with 24GB VRAM or more, but it also consumes more context, VRAM, and loading time.

1. Download llama.cpp

You can download a prebuilt version from llama.cpp releases, or build it yourself. Windows users can download and extract it, then confirm that the directory contains:

1
llama-server.exe

Then create this folder under the llama.cpp directory:

1
models

Put the Holo 3.1 main model and mmproj file into this folder.

2. Download the Holo 3.1 Model

The official Hugging Face organization for Holo 3.1 is Hcompany. If you use llama.cpp, choose the GGUF format.

For 35B-A3B, download:

  • The main model, such as a Q4_K_M quantized GGUF.
  • The corresponding vision projection model, such as mmproj.f16.gguf.

After placing the files, the structure can look like this:

1
2
3
4
5
llama.cpp/
  llama-server.exe
  models/
    q4_k_m.gguf
    mmproj.f16.gguf

You can customize the file names, but the paths in the startup script must match.

3. Start the Local Holo 3.1 Service

The following is a Windows batch script example. Save it as start-holo31.bat and place it in the same directory as llama-server.exe.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
@echo off
chcp 65001 >nul
title Holo 3.1 VLM Launcher

set LLAMA=llama-server.exe

:MENU
cls
echo ==========================================
echo          Holo 3.1 VLM Launcher
echo ==========================================
echo.
echo 1. 8GB GPU  (0.8B)
echo 2. 12GB GPU (4B)
echo 3. 16GB GPU (9B)
echo 4. 24GB GPU (35B-A3B)
echo 5. CPU mode (4B)
echo 0. Exit
echo.
set /p CHOICE=Choose:

if "%CHOICE%"=="1" goto GPU8
if "%CHOICE%"=="2" goto GPU12
if "%CHOICE%"=="3" goto GPU16
if "%CHOICE%"=="4" goto GPU24
if "%CHOICE%"=="5" goto CPU
if "%CHOICE%"=="0" exit
goto MENU

:GPU8
"%LLAMA%" ^
-m models\holo-0.8b.gguf ^
--mmproj models\holo-0.8b-mmproj.gguf ^
-ngl 999 ^
-c 8192 ^
-fa ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--temp 0.2 ^
--top-p 0.9 ^
--host 127.0.0.1 ^
--port 1234
pause
goto MENU

:GPU12
"%LLAMA%" ^
-m models\holo-4b.gguf ^
--mmproj models\holo-4b-mmproj.gguf ^
-ngl 999 ^
-c 16384 ^
-fa ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--temp 0.2 ^
--top-p 0.9 ^
--host 127.0.0.1 ^
--port 1234
pause
goto MENU

:GPU16
"%LLAMA%" ^
-m models\holo-9b.gguf ^
--mmproj models\holo-9b-mmproj.gguf ^
-ngl 999 ^
-c 24576 ^
-fa ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--temp 0.2 ^
--top-p 0.9 ^
--host 127.0.0.1 ^
--port 1234
pause
goto MENU

:GPU24
"%LLAMA%" ^
-m models\q4_k_m.gguf ^
--mmproj models\mmproj.f16.gguf ^
-ngl 999 ^
-c 65536 ^
--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--temp 0.2 ^
--top-p 0.9 ^
--repeat-penalty 1.05 ^
--host 127.0.0.1 ^
--port 1234
pause
goto MENU

:CPU
"%LLAMA%" ^
-m models\holo-4b.gguf ^
--mmproj models\holo-4b-mmproj.gguf ^
-ngl 0 ^
-c 4096 ^
--threads 16 ^
--temp 0.2 ^
--host 127.0.0.1 ^
--port 1234
pause
goto MENU

Run the script and select the tier that matches your VRAM. If startup succeeds, llama-server will expose a local OpenAI-compatible API:

1
http://127.0.0.1:1234/v1

If startup fails, check these three things first:

  • Whether the model file names match the script.
  • Whether the mmproj file exists.
  • Whether your VRAM is enough for the selected model and context length.

4. Install OpenClaw

On Windows, open PowerShell as administrator and run:

1
powershell -c "irm https://openclaw.ai/install.ps1 | iex"

On macOS / Linux, run:

1
curl -fsSL https://openclaw.ai/install.sh | bash

After installation, open OpenClaw settings and configure the model provider as a local OpenAI-compatible service:

1
2
API Base URL: http://127.0.0.1:1234/v1
API Key: leave empty or enter any placeholder value

You can choose browser startup mode. After entering the OpenClaw visual interface, you should see the local model loaded at the bottom.

If there is a thinking mode switch in the interface, turn it off first. In computer-use Agent scenarios like Holo 3.1, action planning and UI execution matter more; enabling extra thinking may noticeably slow responses.

5. Install Browser Automation Skills

To help OpenClaw operate the browser better, install two common skills:

1
2
openclaw skills install agent-browser-cli
openclaw skills install use-my-browser

After installation, restart OpenClaw gateway:

1
openclaw gateway

You can also enter this in the OpenClaw chat box:

1
/new

This starts a new session and reloads capabilities.

6. Test a Simple Task

Start with a low-risk task:

1
Open the browser, search for the official Holo 3.1 model page, and summarize the model sizes and deployment methods it supports.

The key thing to observe is not whether the answer looks polished, but whether:

  • It can open the browser correctly.
  • It can recognize page content.
  • It can continuously search, click, read, and summarize.
  • It gets stuck or repeats actions frequently.
  • The local model response speed is acceptable.

If browser actions work normally, try more complex tasks such as organizing materials, comparing model pages, generating Markdown summaries, or analyzing web tables.

Usage Notes

The advantages of a local Agent are low cost, clear privacy boundaries, and no cloud token bill. But it also has practical limits:

  • Small models are suitable for lightweight browser tasks, not hard reasoning.
  • The vision model is critical for UI recognition; do not download only the main model.
  • Very large context settings can consume a lot of VRAM, so start with conservative parameters.
  • Automation can misclick. Do not start by letting it handle payments, deletion, production systems, or other high-risk tasks.
  • A local model is not automatically safe. Browser permissions, file permissions, and command execution permissions still need control.

For everyday web material organization, lightweight automation, and local experiments, Holo 3.1 + llama.cpp + OpenClaw is worth trying. Its key value is not the slogan “free unlimited tokens,” but keeping the Agent runtime, model, and data flow as local as possible.

References

记录并分享
Built with Hugo
Theme Stack designed by Jimmy