Qwen3.6-35B-A3B jailbreak local deployment: uncensored GGUF, llama.cpp, and safety boundaries

Freedidi recently introduced a popular local model: Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive. The original article describes it as a jailbreak or uncensored open model, and gives GGUF quantized files, a llama.cpp launch method, and ideas for connecting it to agents.

This kind of model is worth watching, but it should be understood calmly. The point is not only that it has fewer restrictions. It brings several important local AI capabilities together:

A 35B-class model with a MoE architecture.
GGUF quantization that can run on consumer GPUs.
An OpenAI-compatible local API through llama.cpp.
Multimodal vision input through mmproj.
Integration with local agent tools such as Hermes and OpenClaw.

If you care about local models, the more important trend is not the jailbreak label. It is that local models are moving from “can chat” toward “can use tools, understand images, and serve as agent backends.”

What this model is

The model name mentioned in the original article is:

1

Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive

The name contains several key pieces:

Qwen3.6: based on the Qwen model family.
35B: around 35B total parameters.
A3B: roughly 3B active parameters per inference step, following a MoE-style design.
Uncensored / Aggressive: fewer safety restrictions or a more aggressive tuning style.
GGUF: a quantized format for local inference tools such as llama.cpp.

One important note: Uncensored does not mean more reliable. It usually means the model refuses less often, but it may also generate unconstrained, unverified, or risky content more easily. It can be useful for technical experiments, but it should not be connected directly to public services, production systems, or unattended workflows.

Why a 35B model can run locally

Many people see 35B and assume it requires a server or high-end multi-GPU machine. The key point in the original article is that this model uses a MoE architecture.

MoE can be understood simply: the model has many total parameters, but each inference step activates only part of the experts. The original article says it activates roughly 3B parameters per run, so with quantization it can have much lower speed and VRAM pressure than a traditional dense 35B model.

After GGUF quantization, it becomes possible to run it on consumer GPUs. The article says the smallest quantized version is around 11GB, and 6GB/8GB GPUs can try it, though at least 8GB VRAM is recommended.

A more realistic expectation:

6GB VRAM: possible with low-bit quantization, but reduce expectations for context length and speed.
8GB VRAM: better for entry-level testing with smaller quantization.
16GB VRAM: more comfortable for longer context and more GPU offload.
24GB VRAM: better for higher-quality quantizations such as Q4_K_M and Q4_K_P.

Whether a local model is usable is not only about whether it starts. Context length, generation speed, remaining VRAM, KV cache, multimodal mode, concurrency, and task type all matter.

How to read the quantization choices

The original article roughly recommends:

Q4_K_P: better for RTX 4090 or other 24GB VRAM machines.
Q4_K_M: more stable and higher quality.
IQ4_NL: strong compression while preserving quality as much as possible.
IQ2_M: for 6GB/8GB VRAM users.

Think of this as a trade-off between quality and resource usage:

Q4 quantizations are usually more stable, but use more VRAM.
IQ2 / IQ3 quantizations save resources, but may reduce answer quality, long-text stability, and detail handling.
If you only want to test agent calls and a local API, low quantization can help you get the flow running.
If you plan to write code, analyze images, or do complex reasoning for long periods, choose higher-quality quantization when possible.

Do not treat “it starts” as “it is good enough for long-term use.” Low-VRAM startup and stable task completion are different things.

llama.cpp deployment approach

The original article recommends llama.cpp because it supports Windows, Linux, macOS, and backends such as NVIDIA CUDA, AMD, Intel, Vulkan, and CPU.

A typical launch command looks like:

1
2
3
4
5
6
7
8
9


llama-server.exe ^
  -m "model-path.gguf" ^
  --mmproj "mmproj.gguf" ^
  -ngl 999 ^
  -c 131072 ^
  -n 8192 ^
  --host 127.0.0.1 ^
  --port 8080 ^
  --jinja

Several parameters are worth understanding:

-m: path to the main GGUF model.
--mmproj: multimodal projection file required for vision input.
-ngl: offload layers to GPU as much as possible, depending on VRAM and backend.
-c: context length; higher values use more memory and VRAM.
-n: maximum generated tokens per response.
--host 127.0.0.1: listen only locally, safer than exposing publicly.
--port 8080: local API port.
--jinja: important for newer Qwen chat templates; without it, formatting issues, repetition, or Chinese output problems may occur.

The easiest trap is context length. -c 131072 looks attractive, but long context significantly increases KV cache usage. On low-VRAM machines, start smaller and increase gradually.

How multimodal support works

The article says this build supports multimodal vision, including image analysis, screenshots, OCR, complex UI analysis, and code screenshots.

In llama.cpp, multimodal support usually requires both the main model and the matching mmproj file. If --mmproj is not loaded correctly, image upload may be unavailable or the model may not understand images correctly.

Useful local multimodal scenarios include:

Analyzing UI screenshots.
OCR on image text.
Reading code screenshots or error screenshots.
Providing visual input to local agents.
Processing private images without uploading them to the cloud.

But vision understanding is not strict OCR or a guaranteed source of truth. For invoices, contracts, IDs, medical images, and other high-risk material, human review is still required.

OpenAI-compatible API

llama-server in llama.cpp can expose a local interface similar to the OpenAI API. The local base URL from the original article is:

1

http://127.0.0.1:8080/v1

This means many tools that support custom OpenAI-compatible providers can send requests to the local model. The API key can often be any placeholder value, depending on whether the client enforces validation.

This is useful because:

No cloud API key is needed.
There is no per-token billing.
Data can remain on the local machine.
It can connect to local agents, coding assistants, or chat frontends.
It can be used as a local OpenAI API replacement for experiments.

Do not expose the local API directly to the public internet. Even when the model runs locally, an open API can be abused, consume machine resources, or produce content you did not intend to generate.

Why Hermes and OpenClaw matter

The original article says the value becomes clearer when connecting this local model to Hermes or OpenClaw.

The meaning is that the model itself is only the inference core. Agent tools connect it to real tasks, such as:

Writing code.
Calling tools.
Reading files.
Analyzing images.
Searching the web.
Executing multi-step tasks.
Maintaining long-context workflows.

A local model used only for chat has limited value. If it can act as a stable agent backend, it becomes closer to a local AI workstation.

However, connecting an uncensored model to an agent requires extra caution. When the agent can operate files, run commands, visit web pages, and call tools, model output turns into real actions. The fewer restrictions the model has, the more important external permissions, human confirmation, and audit logs become.

Safety boundaries for uncensored models

The main selling point of these models is often that they refuse less. But fewer refusals also mean higher risk.

Keep in mind:

It may more easily produce illegal, dangerous, or misleading content.
It may not actively remind you of safety boundaries.
It may give overconfident advice on high-risk topics.
It may be induced by prompts to perform inappropriate tasks.
It is not suitable for direct public exposure.

A safer approach:

Test only on a local machine or controlled LAN.
Do not connect it to high-privilege tools.
Do not let it automatically delete, pay, publish, or bulk-submit.
Put file, command, network, and browser permission boundaries around agent tools.
Keep human review for high-risk outputs.

The freer the model is, the more external system constraints it needs.

Who should try it

This kind of model fits users who:

Want to study local LLM deployment.
Have at least 8GB VRAM and are willing to tune GGUF and llama.cpp.
Want to connect local models to OpenAI-compatible clients.
Care about local multimodal input, screenshot analysis, and agent backends.
Want to process some private data offline.

It is less suitable for:

Beginners who do not want to tune parameters.
Services that require stable production SLA.
Teams with strict security and compliance requirements.
Business workflows that require strict factual reliability.
People who want to expose the model directly to external users.

Conclusion

Models like Qwen3.6-35B-A3B Uncensored HauhauCS Aggressive show that local AI capabilities are moving quickly. Consumer GPUs can run larger models, GGUF quantization lowers deployment barriers, llama.cpp gives local models OpenAI-compatible APIs, and multimodal plus agent tools push them from chat toward task execution.

But it should not be understood only as a jailbreak model. The more valuable angle is that local AI is becoming composable infrastructure. The final experience depends on the model, inference engine, API server, frontend, agent tools, and permission controls together.

If you try it, start with low-risk local testing: choose an appropriate quantization, reduce context length, verify --jinja and --mmproj, then connect a client. After it is stable, consider connecting agent workflows.

References:

Freedidi article: https://www.freedidi.com/24284.html
llama.cpp GitHub: https://github.com/ggml-org/llama.cpp