VectifyAI/PageIndex is an interesting RAG project. Instead of starting with “build another vector database,” it first organizes long documents into a tree structure similar to a table of contents, then lets an LLM perform reasoning-based retrieval along that tree.
Project link: VectifyAI/PageIndex
At the time of writing, the GitHub page shows about 31.8k stars and 2.7k forks, with an MIT license. The README positions it as Vectorless, Reasoning-based RAG: RAG without a vector database, based on reasoning.
What Problem It Tries to Solve
The common path for traditional RAG is: chunk the document, vectorize the chunks, store them in a vector database, then retrieve passages by similarity search. This approach is simple, general, and mature, but it often runs into several problems with long professional documents:
- Similarity is not the same as true relevance.
- Document structure is broken apart by chunking, and section relationships are lost.
- Retrieval results are hard to explain, making it difficult to say why a passage was selected.
- For financial reports, regulatory filings, legal documents, and technical manuals, questions often require reasoning across sections.
PageIndex takes the opposite route: first organize the document into a semantic tree, then let the model search it like a human reading a table of contents, jumping into sections, and narrowing down to details.
The Basic PageIndex Workflow
The README describes PageIndex retrieval in two steps:
- Generate a
Table-of-Contents-like tree index for the document. - Perform reasoning-based retrieval through tree search.
This tree is not just a file directory. It is a document structure designed for LLM use. Nodes can contain titles, page ranges, summaries, child nodes, and other metadata. When answering a question, the model does not need to face a pile of fragmented chunks immediately. It can first decide which section to enter, then continue searching downward.
This method is better suited to documents that are well structured but very long, such as:
- Financial reports and SEC filings.
- Regulatory and compliance documents.
- Academic textbooks and papers.
- Legal documents.
- Technical manuals and product documentation.
- Large PDFs that exceed the model context window.
How It Differs From Traditional Vector RAG
PageIndex’s main selling points can be summarized in five areas.
First, it does not require a Vector DB. It relies on document structure and LLM reasoning to locate content, rather than only using vector similarity search.
Second, it does not use traditional chunking. Documents are organized by natural sections instead of fixed-length text fragments.
Third, explainability is stronger. The retrieval path can map back to pages, sections, and tree nodes, making it easier to trace than “this text was hit by vector similarity.”
Fourth, retrieval is context-aware. The question, conversation history, and domain background can all affect the tree search path.
Fifth, it is closer to how human experts read documents. People usually do not cut an entire document into small chunks and calculate similarity; they first inspect the table of contents, locate sections, and then read details.
This does not mean vector databases have no value. A more accurate view is that PageIndex fits scenarios where “semantic similarity is not enough, and structure plus reasoning need to participate” in long-document retrieval.
How to Run It Locally
The README provides a local self-hosting path. First install dependencies:
|
|
Then create a .env file in the project root and write your LLM API key. The project supports multiple models through LiteLLM:
|
|
Generate a PageIndex structure for a PDF:
|
|
Markdown is also supported:
|
|
Common optional parameters include:
|
|
The README also notes that the local open-source version uses standard PDF parsing. For complex PDFs, the project’s cloud service provides enhanced OCR, tree building, and retrieval pipelines.
Agentic Vectorless RAG Example
The project also provides an agentic vectorless RAG example using self-hosted PageIndex and OpenAI Agents SDK. Install the optional dependency and run it:
|
|
The value of this example is that it pushes PageIndex from “generate a document tree” to “let an Agent use the document tree for retrieval.” If you are building an enterprise knowledge base, financial report Q&A, regulatory Q&A, or technical documentation Agent, this example is more worth running than only reading the README.
Cloud Service, MCP, and API
PageIndex is not just a GitHub repo. The project page also lists several entry points:
- Self-hosting: run the open-source code locally, suitable for experiments and controlled deployments.
- Chat Platform: a ChatGPT-style document analysis platform.
- MCP / API: useful for integrating with existing Agents or automation workflows.
- Enterprise: for private or on-premises deployment.
This shows that its positioning is not a simple demo. It aims to turn “reasoning-based document retrieval” into an integrable document intelligence infrastructure.
Suitable Scenarios
PageIndex is suitable for tasks such as:
- Long PDF Q&A.
- Financial reports, annual reports, prospectuses, and regulatory filing analysis.
- Legal and compliance document retrieval.
- Technical manual Q&A.
- Multi-section textbook or paper retrieval.
- Enterprise knowledge bases that need explainable retrieval paths.
- Providing structured document context to Agents.
If your material is short, has little structure, or is just a normal FAQ, traditional embedding + vector DB may already be enough. PageIndex’s advantages are more likely to appear in long documents, strong structure, professional domains, and questions that require reasoning.
Things to Watch
First, PageIndex still depends on LLMs. Tree building, summaries, and retrieval quality are affected by model capability, prompts, and document parsing quality.
Second, the local version uses standard PDF parsing. Complex scanned documents, chart-heavy PDFs, or messy layouts may require OCR and stronger preprocessing.
Third, vectorless does not mean zero cost. Tree building itself also consumes model calls and time, especially for large-scale document collections.
Fourth, PageIndex is more like a document structure indexing and reasoning retrieval framework. It does not directly replace every RAG stack. In production, it may also be combined with vector retrieval, keyword retrieval, permission control, caching, and audit systems.
Summary
What makes PageIndex interesting is that it shifts RAG from “text similarity retrieval” toward “document structure + LLM reasoning.” For long and professional documents, this direction is worth watching.
If you are building enterprise document Q&A, financial report analysis, regulatory retrieval, or technical manual Agents, PageIndex is a new RAG architecture reference: give documents structure first, then let the model reason along that structure, instead of breaking everything into chunks and putting it all into a vector database from the beginning.
References: