MinerU Guide: Parse PDFs, Office files, and images into RAG-ready Markdown/JSON

A practical overview of opendatalab/MinerU: its capabilities, installation options, CLI usage, deployment choices, and boundaries for turning PDFs, Office documents, and images into Markdown/JSON for RAG and Agent workflows.

opendatalab/MinerU is a document parsing tool for preparing data for large model applications. It can convert inputs such as PDF, images, DOCX, PPTX, and XLSX into Markdown, JSON, and intermediate structured outputs, making them easier to use in RAG, information extraction, knowledge base construction, and Agent workflows.

The problem it addresses is very concrete: real documents often contain multi-column layouts, tables, formulas, headers and footers, scanned pages, handwriting, and image captions. Sending this content directly to a large model can easily produce broken reading order, lost table structure, unreadable formulas, and too much OCR noise. MinerU first parses layout, text, tables, formulas, and OCR content, then outputs results that are closer to both machine-readable data and human reading order.

What problems it is suited for

MinerU is a good fit for scenarios such as:

  • Parsing papers, reports, contracts, and manuals into Markdown;
  • Preparing cleaner document chunks for RAG knowledge bases;
  • Extracting text, tables, and formulas from scanned PDFs or images;
  • Converting DOCX, PPTX, and XLSX into structured data that downstream workflows can consume;
  • Batch-processing documents in a local or private environment;
  • Preparing data for frameworks such as LangChain, LlamaIndex, Dify, RAGFlow, and FastGPT.

If your task is only to read a simple text-based PDF, a conventional PDF extraction tool may already be enough. MinerU is most valuable when complex layouts, tables, formulas, multiple input formats, or batch document data production start to matter.

Core capabilities

According to the project README, MinerU supports PDF, image, DOCX, PPTX, and XLSX inputs. It can output Markdown, JSON arranged by reading order, and visualization results for checking parsing quality.

Key capabilities include:

  • Automatically removing headers, footers, footnotes, page numbers, and other distractions;
  • Outputting text in human reading order for single-column, multi-column, and complex layouts;
  • Preserving document structure such as headings, paragraphs, and lists;
  • Extracting images, image captions, tables, table titles, and footnotes;
  • Recognizing formulas and converting them to LaTeX;
  • Recognizing tables and converting them to HTML;
  • Automatically detecting scanned PDFs and garbled PDFs, then enabling OCR;
  • Supporting OCR for 109 languages;
  • Providing CLI, FastAPI, Gradio WebUI, and mineru-router.

The 3.1.0 release in April 2026 introduced native parsing for PPTX and XLSX, and upgraded the main VLM model to MinerU2.5-Pro-2604-1.2B. The GitHub release page shows that 3.2.3, released on June 4, 2026, added superscript and subscript detection/output, along with a post-OCR fallback mechanism for handling private-use text.

Installation

For local testing, the official path is to install uv first, then install the full feature package:

1
2
3
pip install --upgrade pip
pip install uv
uv pip install -U "mineru[all]"

You can also install from source:

1
2
3
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
uv pip install -e .[all]

mineru[all] includes the core features and is described as compatible with Windows, Linux, and macOS. Document parsing is sensitive to hardware and dependency details, especially GPU support, inference frameworks, Python versions, and system environments. Before production deployment, run a small sample first, then decide whether to move into batch processing.

First document parse

The basic command specifies an input path and output path:

1
mineru -p <input_path> -o <output_path>

If the device does not meet GPU acceleration requirements, specify the pipeline backend to run on a pure CPU path:

1
mineru -p <input_path> -o <output_path> -b pipeline

<input_path> can be a single file or a directory. In practice, start with a small directory containing only a few representative documents:

1
mineru -p ./samples -o ./output -b pipeline

This lets you inspect output quality, runtime, memory usage, and file structure before scaling up to a full document library.

How to use the output

MinerU outputs can feed several downstream workflows.

The first is RAG. You can use Markdown as the input for chunking and vectorization, while keeping headings, paragraphs, lists, tables, and formulas as close to the original semantics as possible. Compared with directly OCRing everything into one large text block, structured Markdown is easier to chunk, cite, and trace back.

The second is information extraction. JSON and intermediate results are suitable for downstream scripts, such as extracting tables, formulas, image captions, or specific sections. For automatically organizing reports, papers, or contract fields, this is more stable than working with plain text only.

The third is human review. MinerU provides layout and span visualization results, which help you check whether content was missed, whether the order is reasonable, and whether tables were distorted. Before batch processing, it is best to sample and inspect these visualization outputs.

Backend choices

MinerU documentation mainly mentions several backend paths:

  • pipeline: good compatibility, runs on CPU or GPU, suitable for first trials and regular batch processing;
  • vlm-engine: higher accuracy with higher hardware requirements, suitable for complex documents and high-quality parsing;
  • hybrid-engine: combines native text extraction with high-accuracy parsing, suitable when you want to reduce hallucinations and improve complex layout quality;
  • *-http-client: connects to OpenAI API-compatible services, including local or remote inference services.

If you only want to validate results, start with pipeline. After you understand your document types, quality requirements, and processing volume, consider VLM or hybrid routes. For enterprise internal documents, backend choice also depends on whether data is allowed to leave the local environment.

Deployment options

MinerU supports CLI, local API, Gradio WebUI, Docker, and mineru-router. Different entry points suit different teams:

  • Personal testing: CLI is the most direct;
  • Non-technical users: Gradio WebUI is friendlier;
  • Integration into existing systems: FastAPI or REST API is a better fit;
  • Multiple services, multiple GPUs, high concurrency: consider mineru-router;
  • Lower environment setup cost: Docker is worth looking at on Linux or WSL2.

Docker deployment is currently more suitable for Linux and Windows with WSL2. macOS users usually start with the pip / uv installation route.

How it differs from ordinary OCR tools

Ordinary OCR tools mainly focus on recognizing text from images. That is important, but it is not enough for RAG. RAG also cares about paragraph order, heading hierarchy, table structure, formula expression, image context, and traceability.

MinerU is more like a preprocessing tool for document understanding. It is not just OCR: it also handles layout analysis, reading order, table HTML, formula LaTeX, multi-format input, and structured output. It is better suited for turning complex documents into data that downstream models can consume reliably.

This also means heavier is not always better. For simple invoices, single-page images, or plain text PDFs, lightweight OCR or PDF text extraction may be faster. MinerU is more suitable when document complexity is already clearly affecting downstream results.

Choosing between PaddleOCR, Marker, and Unstructured

These tools overlap, but their entry points are different.

PaddleOCR focuses more on OCR foundations and text recognition components, which is useful when you need to build a finer-grained OCR pipeline yourself. Marker focuses more on PDF-to-Markdown conversion, making it a good fit for quickly turning documents into readable Markdown. Unstructured is closer to document data extraction and enterprise data pipelines, useful for sending multiple document types into search or ETL workflows.

MinerU is oriented toward LLM, RAG, and Agent data preparation. It emphasizes complex layouts, tables, formulas, multi-format input, VLM + OCR dual engines, and private deployment. If your documents are mainly papers, reports, textbooks, PPTs, and spreadsheets, and the downstream target is a large model application, it is worth testing separately.

Batch processing advice

Before formal batch processing, run a small validation in this order:

  1. Select 10 to 20 representative documents, covering scans, complex tables, multi-column papers, PPT, and Excel.
  2. Parse them with the pipeline backend first, recording runtime, memory, output size, and failed samples.
  3. Sample-check Markdown, JSON, and visualization results, focusing on reading order, tables, formulas, and image captions.
  4. For samples with insufficient quality, try VLM or the hybrid backend.
  5. After confirming the output structure, connect it to RAG chunking, vectorization, and citation tracing.

Do not throw the whole document library in at the start. Document parsing failures are often specific: a certain kind of scan, table, font, language direction, or cross-page content. Find the boundaries first, then scale up; it saves a lot of time.

Privacy and compliance

If you are processing internal company documents, customer data, contracts, financial statements, or unpublished research materials, confirm the deployment mode and data flow first.

Pay special attention to:

  • Whether file content is sent to an external model service;
  • Whether local inference, remote inference, or an OpenAI API-compatible service is being used;
  • Whether intermediate files contain full text, images, tables, or sensitive business information;
  • Whether Markdown / JSON outputs enter logs, object storage, or shared directories;
  • Whether failed batch samples will be uploaded to issues, communities, or third-party debugging platforms.

MinerU supports private and offline deployment, but that does not mean every configuration is automatically offline. Before real deployment, map the full data path from input files, temporary directories, model inference, and output directories to the logging system.

When not to use it

You can skip MinerU for now in these situations:

  • The document is simple, and ordinary PDF text extraction is enough;
  • You only need to read a few pages once and do not need structured output;
  • The current machine lacks resources, and parsing cost is higher than the benefit;
  • Document quality is so poor that OCR results require heavy manual correction;
  • Private documents cannot enter the current inference chain;
  • The team does not yet have a clear downstream need for RAG, extraction, or a knowledge base.

A document parsing tool should serve a downstream workflow, not exist for parsing alone. If there is no clear consumer, first align output samples with downstream requirements, then decide whether to invest in batch processing.

Summary

MinerU is suitable for converting complex documents into Markdown and JSON that large model applications can use more easily. It covers PDF, images, Office documents, tables, formulas, OCR, multilingual recognition, and local deployment, making it especially useful for RAG, knowledge bases, and Agent workflow data preparation.

A steady adoption path is to evaluate quality with an online demo or small local sample, run the workflow with the pipeline backend, and then decide whether to switch to VLM, hybrid, API, or multi-service deployment based on accuracy and throughput requirements. For complex documents, it can significantly reduce preprocessing cost; for simple documents, be careful not to make the workflow heavier than necessary.

References

记录并分享
Built with Hugo
Theme Stack designed by Jimmy