PageIndex on KnightLi Blog

What Is PageIndex? A Reasoning-Based RAG Document Index Without Vector Databases

Wed, 20 May 2026 23:51:37 +0800

VectifyAI/PageIndex is an interesting RAG project. Instead of starting with “build another vector database,” it first organizes long documents into a tree structure similar to a table of contents, then lets an LLM perform reasoning-based retrieval along that tree.

Project link: VectifyAI/PageIndex

At the time of writing, the GitHub page shows about 31.8k stars and 2.7k forks, with an MIT license. The README positions it as Vectorless, Reasoning-based RAG: RAG without a vector database, based on reasoning.

What Problem It Tries to Solve

The common path for traditional RAG is: chunk the document, vectorize the chunks, store them in a vector database, then retrieve passages by similarity search. This approach is simple, general, and mature, but it often runs into several problems with long professional documents:

Similarity is not the same as true relevance.
Document structure is broken apart by chunking, and section relationships are lost.
Retrieval results are hard to explain, making it difficult to say why a passage was selected.
For financial reports, regulatory filings, legal documents, and technical manuals, questions often require reasoning across sections.

PageIndex takes the opposite route: first organize the document into a semantic tree, then let the model search it like a human reading a table of contents, jumping into sections, and narrowing down to details.

The Basic PageIndex Workflow

The README describes PageIndex retrieval in two steps:

Generate a Table-of-Contents-like tree index for the document.
Perform reasoning-based retrieval through tree search.

This tree is not just a file directory. It is a document structure designed for LLM use. Nodes can contain titles, page ranges, summaries, child nodes, and other metadata. When answering a question, the model does not need to face a pile of fragmented chunks immediately. It can first decide which section to enter, then continue searching downward.

This method is better suited to documents that are well structured but very long, such as:

Financial reports and SEC filings.
Regulatory and compliance documents.
Academic textbooks and papers.
Legal documents.
Technical manuals and product documentation.
Large PDFs that exceed the model context window.

How It Differs From Traditional Vector RAG

PageIndex’s main selling points can be summarized in five areas.

First, it does not require a Vector DB. It relies on document structure and LLM reasoning to locate content, rather than only using vector similarity search.

Second, it does not use traditional chunking. Documents are organized by natural sections instead of fixed-length text fragments.

Third, explainability is stronger. The retrieval path can map back to pages, sections, and tree nodes, making it easier to trace than “this text was hit by vector similarity.”

Fourth, retrieval is context-aware. The question, conversation history, and domain background can all affect the tree search path.

Fifth, it is closer to how human experts read documents. People usually do not cut an entire document into small chunks and calculate similarity; they first inspect the table of contents, locate sections, and then read details.

This does not mean vector databases have no value. A more accurate view is that PageIndex fits scenarios where “semantic similarity is not enough, and structure plus reasoning need to participate” in long-document retrieval.

How to Run It Locally

The README provides a local self-hosting path. First install dependencies:

`1`	`pip3 install --upgrade -r requirements.txt`

Then create a .env file in the project root and write your LLM API key. The project supports multiple models through LiteLLM:

`1`	`OPENAI_API_KEY=your_openai_key_here`

Generate a PageIndex structure for a PDF:

`1`	`python3 run_pageindex.py --pdf_path /path/to/your/document.pdf`

Markdown is also supported:

`1`	`python3 run_pageindex.py --md_path /path/to/your/document.md`

Common optional parameters include:

--model
--toc-check-pages
--max-pages-per-node
--max-tokens-per-node
--if-add-node-id
--if-add-node-summary
--if-add-doc-description

The README also notes that the local open-source version uses standard PDF parsing. For complex PDFs, the project’s cloud service provides enhanced OCR, tree building, and retrieval pipelines.

Agentic Vectorless RAG Example

The project also provides an agentic vectorless RAG example using self-hosted PageIndex and OpenAI Agents SDK. Install the optional dependency and run it:

1
2

pip3 install openai-agents
python3 examples/agentic_vectorless_rag_demo.py

The value of this example is that it pushes PageIndex from “generate a document tree” to “let an Agent use the document tree for retrieval.” If you are building an enterprise knowledge base, financial report Q&A, regulatory Q&A, or technical documentation Agent, this example is more worth running than only reading the README.

Cloud Service, MCP, and API

PageIndex is not just a GitHub repo. The project page also lists several entry points:

Self-hosting: run the open-source code locally, suitable for experiments and controlled deployments.
Chat Platform: a ChatGPT-style document analysis platform.
MCP / API: useful for integrating with existing Agents or automation workflows.
Enterprise: for private or on-premises deployment.

This shows that its positioning is not a simple demo. It aims to turn “reasoning-based document retrieval” into an integrable document intelligence infrastructure.

Suitable Scenarios

PageIndex is suitable for tasks such as:

Long PDF Q&A.
Financial reports, annual reports, prospectuses, and regulatory filing analysis.
Legal and compliance document retrieval.
Technical manual Q&A.
Multi-section textbook or paper retrieval.
Enterprise knowledge bases that need explainable retrieval paths.
Providing structured document context to Agents.

If your material is short, has little structure, or is just a normal FAQ, traditional embedding + vector DB may already be enough. PageIndex’s advantages are more likely to appear in long documents, strong structure, professional domains, and questions that require reasoning.

Things to Watch

First, PageIndex still depends on LLMs. Tree building, summaries, and retrieval quality are affected by model capability, prompts, and document parsing quality.

Second, the local version uses standard PDF parsing. Complex scanned documents, chart-heavy PDFs, or messy layouts may require OCR and stronger preprocessing.

Third, vectorless does not mean zero cost. Tree building itself also consumes model calls and time, especially for large-scale document collections.

Fourth, PageIndex is more like a document structure indexing and reasoning retrieval framework. It does not directly replace every RAG stack. In production, it may also be combined with vector retrieval, keyword retrieval, permission control, caching, and audit systems.

Summary

What makes PageIndex interesting is that it shifts RAG from “text similarity retrieval” toward “document structure + LLM reasoning.” For long and professional documents, this direction is worth watching.

If you are building enterprise document Q&A, financial report analysis, regulatory retrieval, or technical manual Agents, PageIndex is a new RAG architecture reference: give documents structure first, then let the model reason along that structure, instead of breaking everything into chunks and putting it all into a vector database from the beginning.

References:

GitHub: VectifyAI/PageIndex

OpenKB: Compiling Documents into a Continuously Updated LLM Knowledge Base

Sun, 17 May 2026 17:15:08 +0800

OpenKB is an open-source LLM knowledge base tool from VectifyAI.

It is not a traditional RAG system that chunks documents, vectorizes them, and then stitches context back together at query time. Instead, it first compiles raw documents into a structured wiki: document summaries, concept pages, cross-references, follow-up queries, and lint checks. In other words, it feels more like a knowledge-base CLI that keeps organizing your material over time.

Project link: https://github.com/VectifyAI/OpenKB

The Short Version

OpenKB is worth watching for three reasons:

It outputs the knowledge base as ordinary Markdown files instead of locking it inside a dedicated database.
It uses PageIndex for long PDFs, focusing on vector-database-free retrieval for long documents.
It emphasizes “knowledge compilation”: the LLM generates summaries, concept pages, and cross-links instead of retrieving from scratch on every question.

That makes OpenKB better suited to long-term knowledge accumulation: paper reading, project documentation, internal company materials, technical standards, product research, and personal knowledge bases.

It is not a universal replacement. If you need high-concurrency online Q&A, complex permissions, a web admin console, enterprise audit trails, or large-scale multi-tenancy, OpenKB currently looks more like a developer tool and knowledge-base prototype than a complete enterprise knowledge platform.

What OpenKB Is

OpenKB stands for Open Knowledge Base.

It works as a CLI: it converts, organizes, summarizes, and writes documents into a set of wiki files. The official README describes it directly: OpenKB uses LLMs to compile raw documents into a structured, interlinked wiki-style knowledge base, with PageIndex providing vectorless long-document retrieval.

Supported input formats include:

PDF
Word
Markdown
PowerPoint
HTML
Excel
Plain text
Other formats that markitdown can convert

The generated knowledge base lives under wiki/ and mainly includes:

index.md: knowledge base overview
log.md: operation timeline
AGENTS.md: knowledge base structure and maintenance instructions
sources/: converted source text
summaries/: summaries for each document
concepts/: cross-document concept pages
explorations/: saved query results
reports/: lint reports

The biggest benefit of this design is transparency. You can open the Markdown files directly instead of only receiving answers through a black-box retrieval interface.

How It Differs from Traditional RAG

A typical traditional RAG pipeline looks like this:

Chunk the documents.
Generate embeddings.
Store them in a vector database.
Retrieve relevant chunks at query time.
Feed those chunks to the LLM to generate an answer.

That workflow is mature and works well for Q&A systems. But it has one problem: the knowledge itself does not really accumulate. Every question repeats the work of finding chunks, assembling context, and generating an answer.

OpenKB is closer to “organize first, ask later”:

Documents enter raw/.
Short documents are converted to Markdown with markitdown.
Long PDFs go through PageIndex to produce tree indexes and summaries.
The LLM generates document summaries.
The LLM reads existing concept pages and creates or updates cross-document concepts.
The knowledge base index, log, and cross-links are updated.

As a result, adding one document does more than create another searchable file. It may update a dozen wiki pages. Knowledge is written into concept pages and connected to existing material.

This is closer to how humans maintain knowledge bases: when new material arrives, you do not just archive it; you update topic pages, summarize differences, and add references.

What PageIndex Solves

Long documents have always been difficult for RAG and LLM knowledge bases.

If you simply split a long PDF into many chunks, several problems appear:

Chapter relationships are lost.
Tables, images, and footnotes are hard to handle.
Retrieved snippets are too fragmented, so answers lack global structure.
Even a large context window is not ideal for stuffing an entire document into the prompt.
Long summary chains can compress away important details.

OpenKB uses PageIndex for long PDFs. According to the project description, PageIndex builds tree indexes and summaries for long documents, letting the LLM reason over the document tree instead of reading the whole document directly.

The focus is not “the few text snippets with the highest vector similarity.” It is about helping the model use document hierarchy to find relevant content. For research reports, papers, manuals, prospectuses, and compliance documents, this direction makes a lot of sense.

OpenKB can use the open-source PageIndex locally by default. If you need OCR, complex PDF handling, or faster structure generation, you can configure PAGEINDEX_API_KEY to use PageIndex Cloud.

Install and Quick Start

Install OpenKB with pip:

`1`	`pip install openkb`

Or install the latest GitHub version:

`1`	`pip install git+https://github.com/VectifyAI/OpenKB.git`

For editable source installation:

1
2
3

git clone https://github.com/VectifyAI/OpenKB.git
cd OpenKB
pip install -e .

Create a knowledge base directory:

1
2

mkdir my-kb && cd my-kb
openkb init

Add documents:

1
2

openkb add paper.pdf
openkb add ~/papers/

Ask a question:

`1`	`openkb query "What are the main findings?"`

Start an interactive chat:

`1`	`openkb chat`

If you want OpenKB to process new files automatically, use watch mode:

`1`	`openkb watch`

After that, drop files into raw/, and OpenKB will update the wiki automatically.

LLM Configuration

OpenKB uses LiteLLM to support multiple model providers, including OpenAI, Claude, and Gemini.

You can set the model during initialization, or configure it in .openkb/config.yaml:

1
2
3

model: gpt-5.4
language: en
pageindex_threshold: 20

Model names follow LiteLLM’s provider/model format. OpenAI models can omit the provider prefix:

`1`	`model: gpt-5.4`

Models such as Anthropic and Gemini are usually written like this:

`1`	`model: anthropic/claude-sonnet-4-6`

`1`	`model: gemini/gemini-3.1-pro-preview`

Put the API key in .env:

`1`	`LLM_API_KEY=your_llm_api_key`

If you enable PageIndex Cloud, add:

`1`	`PAGEINDEX_API_KEY=your_pageindex_api_key`

Common Commands

OpenKB’s commands are developer-friendly:

openkb init: initialize a knowledge base.
openkb add <file_or_dir>: add a file or directory.
openkb remove <doc>: remove a document and clean up related wiki pages, images, registry entries, and PageIndex state.
openkb query "question": ask a one-off question against the knowledge base.
openkb chat: enter a multi-turn conversation.
openkb watch: monitor raw/ and update automatically.
openkb lint: check knowledge base structure and content health.
openkb list: list indexed documents and concepts.
openkb status: show knowledge base statistics.

openkb chat is better than openkb query for continuous exploration. It supports session resume, session listing, deletion, and slash commands such as /status, /list, /add <path>, /save, and /lint.

Why a Markdown Wiki Matters

Many knowledge-base tools are painful because of migration cost.

Once material enters a proprietary database, index, or format, it becomes hard to inspect, edit, back up, or migrate directly. OpenKB writes the result as ordinary Markdown, which makes it naturally compatible with existing tools.

The most direct use is opening wiki/ in Obsidian:

Summary pages can be read directly.
Concept pages can connect through [[wikilinks]].
Graph view can show relationships between knowledge items.
Query results can be saved to explorations/.
AGENTS.md can define how the knowledge base should be maintained.

That makes OpenKB more than a Q&A tool. It can become a knowledge-organizing pipeline for individuals or teams.

Best-Fit Scenarios

OpenKB is especially useful for:

Reading papers and technical reports.
Organizing project documentation.
Building product research archives.
Creating documentation knowledge bases around open-source projects.
Organizing internal policies, meeting notes, and explanatory documents.
Maintaining a personal Obsidian knowledge base automatically.
Structuring long PDFs, PPTs, Word files, and web materials.

If you often face piles of documents and want more than “ask one question, get one answer,” OpenKB’s direction is a good fit: it gradually turns material into a browsable, reusable, and traceable knowledge base.

What to Watch Out For

First, OpenKB depends on LLM quality.

Summaries, concept pages, and cross-links are generated by models. Stronger models usually produce more stable knowledge compilation; weaker models may struggle with concept extraction, contradiction detection, and cross-document synthesis.

Second, estimate cost early.

If you import many long documents at once, LLM calls may become expensive. Start with a small dataset, check the output structure and quality, and then expand.

Third, the generated wiki still needs human review.

OpenKB can organize material, but it does not automatically guarantee factual correctness. Important knowledge bases still need humans to review summaries, concept pages, and references.

Fourth, be careful with sensitive material.

If you use cloud LLMs or PageIndex Cloud, pay attention to privacy, trade secrets, and compliance requirements. For internal materials, confirm the model provider, data retention policy, and access boundaries first.

Fifth, it is currently more of a CLI tool.

The roadmap mentions a future Web UI, database-backed storage, support for large collections, and hierarchical concept indexing. At this stage, if teammates are not comfortable with the command line, there is still some adoption friction.

Relationship with Obsidian, NotebookLM, and Enterprise RAG

OpenKB and Obsidian are best understood as an “automatic organization layer” plus a “reading and editing layer.”

Obsidian is good for humans to write, edit, browse, and link notes. OpenKB is good for turning raw documents into a wiki that can enter Obsidian.

OpenKB and NotebookLM differ more around local control and open file formats.

NotebookLM is more direct for quickly asking questions and generating summaries after dropping in materials. OpenKB is better for developers who want the organized result to remain in a local directory and continue evolving as Markdown.

OpenKB does not replace enterprise RAG; it complements it.

Enterprise RAG cares more about permissions, auditability, service deployment, access isolation, monitoring, and stable throughput. OpenKB is better for building a readable, editable, long-lived knowledge layer. If you later build online Q&A, the wiki generated by OpenKB can also become a higher-quality corpus.

A Recommended Workflow

If you want to try OpenKB, start like this:

Create a test knowledge base directory.
Add 3 to 5 documents on the same topic.
Run openkb add.
Open wiki/ and inspect the summaries and concept pages.
Ask a few specific questions with openkb query.
Run openkb lint to check knowledge-base health.
Open wiki/ in Obsidian and see whether the link graph is meaningful.
Once quality looks good, import a larger document collection.

Do not throw in hundreds of files at the beginning. First see whether it understands your material type well, especially tables, images, long PDFs, and multi-document concept merging.

Summary

OpenKB’s value is that it moves an LLM knowledge base one step earlier than “assemble context at query time”: organize the material into a wiki first, then ask questions, chat, lint, and keep maintaining that wiki.

This direction is not right for every Q&A system, but it is well suited to knowledge work that needs long-term accumulation. Markdown files, Obsidian compatibility, PageIndex long-document handling, multi-model support, and a CLI workflow combine into a useful tool for developers and research-oriented users.

If you have many PDFs, reports, web pages, papers, and project documents, OpenKB is worth trying. It may not immediately replace a mature enterprise knowledge base, but it can become a practical entry point for organizing material: first turn documents into readable, linked, traceable knowledge, then let the LLM work on top of that knowledge.

References: