MinerU 3.4 officially supports five backend names in its CLI:
|
|
The default backend is hybrid-engine, and Hybrid uses --effort medium by default. The confusing part is not the command syntax, but where the model actually runs, whether your local GPU is required, and which type of PDF each mode is best for.
Here is the short version: use pipeline for normal digital PDFs and batch jobs; use hybrid-engine --effort medium for the best overall local quality; try vlm-engine separately for difficult scanned files; only consider the two HTTP Client modes if the model is deployed on another GPU server.
Quick comparison of the five modes
| Backend | Compute location | Core method | Local GPU | Characteristics |
|---|---|---|---|---|
pipeline |
Local machine | Multiple specialized models such as OCR, layout analysis, and formula recognition | Optional | Best compatibility, stable, almost no hallucination |
hybrid-engine |
Local machine | Native text extraction + VLM + Pipeline | Required, about 8GB minimum | Best overall accuracy, suitable for most high-quality parsing |
vlm-engine |
Local machine | Mainly lets a vision-language model understand the whole page | Required, about 8GB minimum | Good for complex scans, tables, formulas, and unusual layouts |
hybrid-http-client |
Small local models + remote VLM | Hybrid, but the large model runs on a server | Local GPU can be avoided | Suitable when you already have a remote GPU server |
vlm-http-client |
Remote server | VLM runs entirely on the server | No local GPU required | Local machine only uploads files and receives results |
HTTP Client is not a “local mode that saves VRAM.” It is a remote deployment mode. Your local machine may avoid running the large model, but the remote server still has to perform VLM inference.
pipeline: stable, light on VRAM, good for batch jobs
Command:
|
|
pipeline does not send the whole page to one large model. It combines multiple specialized modules:
- Native PDF text extraction.
- OCR.
- Layout detection.
- Table recognition.
- Formula recognition.
- Reading-order reconstruction.
Its strengths are stability and low resource requirements. It can run on CPU only, and it can also use an NVIDIA GPU for acceleration. The official description emphasizes that it is fast, stable, and hallucination-free. The benchmark table lists an overall accuracy of about 86.47, and GPU mode needs about 4GB of VRAM at minimum.
pipeline is suitable for:
- Normal digital PDFs.
- Large batch processing jobs.
- Text-heavy documents.
- Scenarios where you do not want the model to guess content.
- 8GB GPUs where stability matters more than maximum accuracy.
If you use an RTX 4060 8GB, this is usually the safest local GPU mode. It is also a good first step for checking whether your CUDA environment works.
vlm-engine: let the vision-language model read the whole page
Command:
|
|
vlm-engine mainly uses MinerU’s vision-language model to understand each page as an image. It identifies titles, body text, table structures, formulas, reading order, and relationships between complex layout blocks.
Its table accuracy is about 95.30, much higher than pipeline. However, local execution requires about 8GB of VRAM at minimum, and CPU-only mode is not supported.
vlm-engine is suitable for:
- Scanned papers.
- Complex multi-column layouts.
- Tables with irregular borders.
- Formula-heavy pages.
- Handwritten or unusual layouts.
- Files where
pipelineperforms poorly.
The downside is higher VRAM pressure. Compared with hybrid-engine, it also lacks the combined benefit of first extracting native PDF text and then using VLM for difficult areas, so it is not always the best default mode.
hybrid-engine: Pipeline and VLM combined
Command:
|
|
hybrid-engine combines two approaches:
- For digital PDFs, it tries to extract native text directly.
- For scanned content, complex tables, formulas, and unusual layouts, it calls the VLM.
- It then uses parts of Pipeline for auxiliary processing.
This gives it VLM-level accuracy, the reliability of native text extraction, lower hallucination risk, and better support for multilingual digital PDFs. Officially, it is positioned as a high-accuracy, native-text-extraction, low-hallucination mode, and it is the current recommended local default.
Hybrid has two common effort levels.
Medium:
|
|
Its table accuracy is about 95.26. It is faster and suitable for most documents. The current default is medium, but Medium automatically disables image and chart analysis.
High:
|
|
Its table accuracy is about 95.39. It supports image and chart analysis, but processing is slower. In the official data, Medium is only about 0.13 points lower than High, while it can be noticeably faster in some Windows setups.
If your GPU is an RTX 4060 8GB, hybrid-engine --effort medium is the preferred high-quality local mode. Before running it, close games, browser hardware acceleration, and other programs that occupy VRAM, because 8GB is the lower end of the requirement.
vlm-http-client: the local machine does not run the model
Example:
|
|
In this mode, your computer is only a client:
|
|
The actual VLM runs on another GPU machine, a Linux GPU server, a LAN server, or an OpenAI API-compatible inference service. Therefore, the local machine does not need an NVIDIA GPU and can even use a lightweight MinerU installation. The official docs also describe vlm-http-client as suitable for edge devices with only CPU and network access.
The important detail: “no local GPU required” does not mean the whole system needs no GPU. The remote server still performs VLM inference.
hybrid-http-client: split work between local machine and server
Command:
|
|
hybrid-http-client is not the same as vlm-http-client. It usually works like this:
- The local machine handles PDF text extraction and some small-model tasks.
- The remote server handles VLM inference.
- MinerU combines the results.
So the local machine can run on CPU only. If it has a GPU, the local auxiliary steps can be faster. The official recommendation is to install mineru[pipeline] on the client. The roughly 2GB minimum VRAM listed in the table mainly refers to optional local GPU acceleration for the small Hybrid client-side models. It does not mean the remote VLM server only needs 2GB.
Why HTTP Client and Engine have the same accuracy
The official table shows results like this:
|
|
The reason is that both modes use basically the same parsing logic and models. The main difference is where the model runs:
hybrid-engine: the model runs on your local GPU.hybrid-http-client: the model runs on a remote server.
So HTTP Client is not a lower-accuracy edition. It is the remote deployment edition. It is useful for teams that already have a GPU server, not for single-machine users trying to casually save VRAM.
How to choose with an RTX 4060 8GB
If your GPU is an RTX 4060 8GB, choose in this order.
For daily stable use:
|
|
It has low VRAM pressure, is good for checking CUDA, and works well for batch processing normal PDFs.
For the best overall local quality:
|
|
This is the preferred high-quality mode on an 8GB GPU. Try to free VRAM before running it.
For image analysis or maximum accuracy:
|
|
It is slower, but it enables image and chart analysis.
For difficult scanned layouts where results are still poor:
|
|
You can compare it with Hybrid results, but it usually does not need to be your permanent default.
If you do not have a remote server, you do not need to consider:
|
|
They require an additional OpenAI-compatible inference server, or at least an available remote GPU machine.
One-line choice guide
Normal PDFs, batch jobs, stability first:
|
|
Best overall local quality:
|
|
Image analysis or maximum accuracy:
|
|
Very complex scanned layouts where you want to test VLM separately:
|
|
Models deployed on another GPU server:
|
|
Finally, check your PyTorch environment. If you are still on torch 2.8.0+cpu, pipeline can only run on CPU, and hybrid-engine plus vlm-engine cannot actually use your RTX 4060 until you install the CUDA build of PyTorch.