<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>MinerU on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/mineru/</link>
        <description>Recent content in MinerU on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sun, 07 Jun 2026 23:41:50 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/mineru/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>MinerU Guide: Parse PDFs, Office files, and images into RAG-ready Markdown/JSON</title>
        <link>https://knightli.com/en/2026/06/07/mineru-document-parsing-rag-markdown-json/</link>
        <pubDate>Sun, 07 Jun 2026 23:41:50 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/06/07/mineru-document-parsing-rag-markdown-json/</guid>
        <description>&lt;p&gt;&lt;code&gt;opendatalab/MinerU&lt;/code&gt; is a document parsing tool for preparing data for large model applications. It can convert inputs such as &lt;code&gt;PDF&lt;/code&gt;, images, &lt;code&gt;DOCX&lt;/code&gt;, &lt;code&gt;PPTX&lt;/code&gt;, and &lt;code&gt;XLSX&lt;/code&gt; into Markdown, JSON, and intermediate structured outputs, making them easier to use in RAG, information extraction, knowledge base construction, and Agent workflows.&lt;/p&gt;
&lt;p&gt;The problem it addresses is very concrete: real documents often contain multi-column layouts, tables, formulas, headers and footers, scanned pages, handwriting, and image captions. Sending this content directly to a large model can easily produce broken reading order, lost table structure, unreadable formulas, and too much OCR noise. MinerU first parses layout, text, tables, formulas, and OCR content, then outputs results that are closer to both machine-readable data and human reading order.&lt;/p&gt;
&lt;h2 id=&#34;what-problems-it-is-suited-for&#34;&gt;What problems it is suited for
&lt;/h2&gt;&lt;p&gt;MinerU is a good fit for scenarios such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Parsing papers, reports, contracts, and manuals into Markdown;&lt;/li&gt;
&lt;li&gt;Preparing cleaner document chunks for RAG knowledge bases;&lt;/li&gt;
&lt;li&gt;Extracting text, tables, and formulas from scanned PDFs or images;&lt;/li&gt;
&lt;li&gt;Converting &lt;code&gt;DOCX&lt;/code&gt;, &lt;code&gt;PPTX&lt;/code&gt;, and &lt;code&gt;XLSX&lt;/code&gt; into structured data that downstream workflows can consume;&lt;/li&gt;
&lt;li&gt;Batch-processing documents in a local or private environment;&lt;/li&gt;
&lt;li&gt;Preparing data for frameworks such as LangChain, LlamaIndex, Dify, RAGFlow, and FastGPT.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your task is only to read a simple text-based PDF, a conventional PDF extraction tool may already be enough. MinerU is most valuable when complex layouts, tables, formulas, multiple input formats, or batch document data production start to matter.&lt;/p&gt;
&lt;h2 id=&#34;core-capabilities&#34;&gt;Core capabilities
&lt;/h2&gt;&lt;p&gt;According to the project README, MinerU supports &lt;code&gt;PDF&lt;/code&gt;, image, &lt;code&gt;DOCX&lt;/code&gt;, &lt;code&gt;PPTX&lt;/code&gt;, and &lt;code&gt;XLSX&lt;/code&gt; inputs. It can output Markdown, JSON arranged by reading order, and visualization results for checking parsing quality.&lt;/p&gt;
&lt;p&gt;Key capabilities include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automatically removing headers, footers, footnotes, page numbers, and other distractions;&lt;/li&gt;
&lt;li&gt;Outputting text in human reading order for single-column, multi-column, and complex layouts;&lt;/li&gt;
&lt;li&gt;Preserving document structure such as headings, paragraphs, and lists;&lt;/li&gt;
&lt;li&gt;Extracting images, image captions, tables, table titles, and footnotes;&lt;/li&gt;
&lt;li&gt;Recognizing formulas and converting them to LaTeX;&lt;/li&gt;
&lt;li&gt;Recognizing tables and converting them to HTML;&lt;/li&gt;
&lt;li&gt;Automatically detecting scanned PDFs and garbled PDFs, then enabling OCR;&lt;/li&gt;
&lt;li&gt;Supporting OCR for 109 languages;&lt;/li&gt;
&lt;li&gt;Providing CLI, FastAPI, Gradio WebUI, and &lt;code&gt;mineru-router&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;3.1.0&lt;/code&gt; release in April 2026 introduced native parsing for &lt;code&gt;PPTX&lt;/code&gt; and &lt;code&gt;XLSX&lt;/code&gt;, and upgraded the main VLM model to &lt;code&gt;MinerU2.5-Pro-2604-1.2B&lt;/code&gt;. The GitHub release page shows that &lt;code&gt;3.2.3&lt;/code&gt;, released on June 4, 2026, added superscript and subscript detection/output, along with a post-OCR fallback mechanism for handling private-use text.&lt;/p&gt;
&lt;h2 id=&#34;installation&#34;&gt;Installation
&lt;/h2&gt;&lt;p&gt;For local testing, the official path is to install &lt;code&gt;uv&lt;/code&gt; first, then install the full feature package:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install --upgrade pip
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install uv
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;uv pip install -U &lt;span class=&#34;s2&#34;&gt;&amp;#34;mineru[all]&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can also install from source:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/opendatalab/MinerU.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; MinerU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;uv pip install -e .&lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;all&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;mineru[all]&lt;/code&gt; includes the core features and is described as compatible with Windows, Linux, and macOS. Document parsing is sensitive to hardware and dependency details, especially GPU support, inference frameworks, Python versions, and system environments. Before production deployment, run a small sample first, then decide whether to move into batch processing.&lt;/p&gt;
&lt;h2 id=&#34;first-document-parse&#34;&gt;First document parse
&lt;/h2&gt;&lt;p&gt;The basic command specifies an input path and output path:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;mineru -p &amp;lt;input_path&amp;gt; -o &amp;lt;output_path&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the device does not meet GPU acceleration requirements, specify the &lt;code&gt;pipeline&lt;/code&gt; backend to run on a pure CPU path:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;mineru -p &amp;lt;input_path&amp;gt; -o &amp;lt;output_path&amp;gt; -b pipeline
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;&amp;lt;input_path&amp;gt;&lt;/code&gt; can be a single file or a directory. In practice, start with a small directory containing only a few representative documents:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;mineru -p ./samples -o ./output -b pipeline
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This lets you inspect output quality, runtime, memory usage, and file structure before scaling up to a full document library.&lt;/p&gt;
&lt;h2 id=&#34;how-to-use-the-output&#34;&gt;How to use the output
&lt;/h2&gt;&lt;p&gt;MinerU outputs can feed several downstream workflows.&lt;/p&gt;
&lt;p&gt;The first is RAG. You can use Markdown as the input for chunking and vectorization, while keeping headings, paragraphs, lists, tables, and formulas as close to the original semantics as possible. Compared with directly OCRing everything into one large text block, structured Markdown is easier to chunk, cite, and trace back.&lt;/p&gt;
&lt;p&gt;The second is information extraction. JSON and intermediate results are suitable for downstream scripts, such as extracting tables, formulas, image captions, or specific sections. For automatically organizing reports, papers, or contract fields, this is more stable than working with plain text only.&lt;/p&gt;
&lt;p&gt;The third is human review. MinerU provides layout and span visualization results, which help you check whether content was missed, whether the order is reasonable, and whether tables were distorted. Before batch processing, it is best to sample and inspect these visualization outputs.&lt;/p&gt;
&lt;h2 id=&#34;backend-choices&#34;&gt;Backend choices
&lt;/h2&gt;&lt;p&gt;MinerU documentation mainly mentions several backend paths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pipeline&lt;/code&gt;: good compatibility, runs on CPU or GPU, suitable for first trials and regular batch processing;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;vlm-engine&lt;/code&gt;: higher accuracy with higher hardware requirements, suitable for complex documents and high-quality parsing;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hybrid-engine&lt;/code&gt;: combines native text extraction with high-accuracy parsing, suitable when you want to reduce hallucinations and improve complex layout quality;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;*-http-client&lt;/code&gt;: connects to OpenAI API-compatible services, including local or remote inference services.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only want to validate results, start with &lt;code&gt;pipeline&lt;/code&gt;. After you understand your document types, quality requirements, and processing volume, consider VLM or hybrid routes. For enterprise internal documents, backend choice also depends on whether data is allowed to leave the local environment.&lt;/p&gt;
&lt;h2 id=&#34;deployment-options&#34;&gt;Deployment options
&lt;/h2&gt;&lt;p&gt;MinerU supports CLI, local API, Gradio WebUI, Docker, and &lt;code&gt;mineru-router&lt;/code&gt;. Different entry points suit different teams:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Personal testing: CLI is the most direct;&lt;/li&gt;
&lt;li&gt;Non-technical users: Gradio WebUI is friendlier;&lt;/li&gt;
&lt;li&gt;Integration into existing systems: FastAPI or REST API is a better fit;&lt;/li&gt;
&lt;li&gt;Multiple services, multiple GPUs, high concurrency: consider &lt;code&gt;mineru-router&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;Lower environment setup cost: Docker is worth looking at on Linux or WSL2.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Docker deployment is currently more suitable for Linux and Windows with WSL2. macOS users usually start with the pip / uv installation route.&lt;/p&gt;
&lt;h2 id=&#34;how-it-differs-from-ordinary-ocr-tools&#34;&gt;How it differs from ordinary OCR tools
&lt;/h2&gt;&lt;p&gt;Ordinary OCR tools mainly focus on recognizing text from images. That is important, but it is not enough for RAG. RAG also cares about paragraph order, heading hierarchy, table structure, formula expression, image context, and traceability.&lt;/p&gt;
&lt;p&gt;MinerU is more like a preprocessing tool for document understanding. It is not just OCR: it also handles layout analysis, reading order, table HTML, formula LaTeX, multi-format input, and structured output. It is better suited for turning complex documents into data that downstream models can consume reliably.&lt;/p&gt;
&lt;p&gt;This also means heavier is not always better. For simple invoices, single-page images, or plain text PDFs, lightweight OCR or PDF text extraction may be faster. MinerU is more suitable when document complexity is already clearly affecting downstream results.&lt;/p&gt;
&lt;h2 id=&#34;choosing-between-paddleocr-marker-and-unstructured&#34;&gt;Choosing between PaddleOCR, Marker, and Unstructured
&lt;/h2&gt;&lt;p&gt;These tools overlap, but their entry points are different.&lt;/p&gt;
&lt;p&gt;PaddleOCR focuses more on OCR foundations and text recognition components, which is useful when you need to build a finer-grained OCR pipeline yourself. Marker focuses more on PDF-to-Markdown conversion, making it a good fit for quickly turning documents into readable Markdown. Unstructured is closer to document data extraction and enterprise data pipelines, useful for sending multiple document types into search or ETL workflows.&lt;/p&gt;
&lt;p&gt;MinerU is oriented toward LLM, RAG, and Agent data preparation. It emphasizes complex layouts, tables, formulas, multi-format input, VLM + OCR dual engines, and private deployment. If your documents are mainly papers, reports, textbooks, PPTs, and spreadsheets, and the downstream target is a large model application, it is worth testing separately.&lt;/p&gt;
&lt;h2 id=&#34;batch-processing-advice&#34;&gt;Batch processing advice
&lt;/h2&gt;&lt;p&gt;Before formal batch processing, run a small validation in this order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Select 10 to 20 representative documents, covering scans, complex tables, multi-column papers, PPT, and Excel.&lt;/li&gt;
&lt;li&gt;Parse them with the &lt;code&gt;pipeline&lt;/code&gt; backend first, recording runtime, memory, output size, and failed samples.&lt;/li&gt;
&lt;li&gt;Sample-check Markdown, JSON, and visualization results, focusing on reading order, tables, formulas, and image captions.&lt;/li&gt;
&lt;li&gt;For samples with insufficient quality, try VLM or the hybrid backend.&lt;/li&gt;
&lt;li&gt;After confirming the output structure, connect it to RAG chunking, vectorization, and citation tracing.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Do not throw the whole document library in at the start. Document parsing failures are often specific: a certain kind of scan, table, font, language direction, or cross-page content. Find the boundaries first, then scale up; it saves a lot of time.&lt;/p&gt;
&lt;h2 id=&#34;privacy-and-compliance&#34;&gt;Privacy and compliance
&lt;/h2&gt;&lt;p&gt;If you are processing internal company documents, customer data, contracts, financial statements, or unpublished research materials, confirm the deployment mode and data flow first.&lt;/p&gt;
&lt;p&gt;Pay special attention to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether file content is sent to an external model service;&lt;/li&gt;
&lt;li&gt;Whether local inference, remote inference, or an OpenAI API-compatible service is being used;&lt;/li&gt;
&lt;li&gt;Whether intermediate files contain full text, images, tables, or sensitive business information;&lt;/li&gt;
&lt;li&gt;Whether Markdown / JSON outputs enter logs, object storage, or shared directories;&lt;/li&gt;
&lt;li&gt;Whether failed batch samples will be uploaded to issues, communities, or third-party debugging platforms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MinerU supports private and offline deployment, but that does not mean every configuration is automatically offline. Before real deployment, map the full data path from input files, temporary directories, model inference, and output directories to the logging system.&lt;/p&gt;
&lt;h2 id=&#34;when-not-to-use-it&#34;&gt;When not to use it
&lt;/h2&gt;&lt;p&gt;You can skip MinerU for now in these situations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The document is simple, and ordinary PDF text extraction is enough;&lt;/li&gt;
&lt;li&gt;You only need to read a few pages once and do not need structured output;&lt;/li&gt;
&lt;li&gt;The current machine lacks resources, and parsing cost is higher than the benefit;&lt;/li&gt;
&lt;li&gt;Document quality is so poor that OCR results require heavy manual correction;&lt;/li&gt;
&lt;li&gt;Private documents cannot enter the current inference chain;&lt;/li&gt;
&lt;li&gt;The team does not yet have a clear downstream need for RAG, extraction, or a knowledge base.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A document parsing tool should serve a downstream workflow, not exist for parsing alone. If there is no clear consumer, first align output samples with downstream requirements, then decide whether to invest in batch processing.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;MinerU is suitable for converting complex documents into Markdown and JSON that large model applications can use more easily. It covers PDF, images, Office documents, tables, formulas, OCR, multilingual recognition, and local deployment, making it especially useful for RAG, knowledge bases, and Agent workflow data preparation.&lt;/p&gt;
&lt;p&gt;A steady adoption path is to evaluate quality with an online demo or small local sample, run the workflow with the &lt;code&gt;pipeline&lt;/code&gt; backend, and then decide whether to switch to VLM, hybrid, API, or multi-service deployment based on accuracy and throughput requirements. For complex documents, it can significantly reduce preprocessing cost; for simple documents, be careful not to make the workflow heavier than necessary.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/opendatalab/MinerU&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;opendatalab/MinerU - GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendatalab.github.io/MinerU/quick_start/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;MinerU Quick Start&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/opendatalab/MinerU/releases/tag/mineru-3.2.3-released&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;mineru-3.2.3-released&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
