What is document chunking in RAG?

Document chunking is the process of splitting source documents into smaller passages before embedding and storing them in a vector database. Each chunk becomes an independently retrievable unit. The size, boundaries and overlap of chunks directly determine what content the retrieval system can surface in response to a query.

What is the best chunk size for RAG?

There is no universal best chunk size — it depends on document type, query patterns and the embedding model used. As a starting point, 300–600 tokens with 10–20% overlap between adjacent chunks works well for most prose documents. Empirical evaluation against representative queries is more reliable than relying on default values.

What causes poor retrieval quality in RAG systems?

The most common causes are: chunks that split across natural semantic boundaries (mid-sentence, mid-table, mid-list), chunks that are too small to carry meaningful context, chunks that are too large to match narrow queries, and missing or inconsistent metadata. Poor pre-processing of source documents — especially scanned PDFs or complex layouts — compounds all of these problems.

Document Chunking and Retrieval Quality in RAG

Quick answer

In a retrieval-augmented generation (RAG) system, documents are split into chunks before they are embedded and stored. Each chunk becomes an independently retrievable unit — the atomic piece of content the system can surface in response to a query. This splitting process is called chunking, and it is one of the most consequential decisions in RAG system design. Poor chunking degrades retrieval quality regardless of how capable the embedding model, vector database or language model is. Good chunking makes a well-engineered system dramatically more accurate and useful.

What this means

When a user asks a question, the RAG system searches for chunks whose embeddings are most similar to the query embedding. The system can only retrieve what exists as a chunk. If the relevant information is split across two chunks by an arbitrary boundary, neither chunk may match the query well enough to be retrieved. If chunks are too small, they lack the context needed for the embedding to accurately represent their meaning. If chunks are too large, they may match the query loosely but return large volumes of irrelevant surrounding content.

The chunking decision is upstream of every other retrieval quality concern. It cannot be compensated for by a better embedding model or a more sophisticated re-ranker. It must be designed deliberately for each document type and query pattern.

Why it matters for business

Enterprise organisations accumulate documents with very different structures: dense legal contracts, tabular financial reports, narrative policy manuals, structured FAQ pages, technical specifications with mixed prose and tables, and scanned legacy documents. A single chunking strategy applied uniformly to all of these will perform acceptably on some and poorly on others.

When retrieval quality is poor, the consequences are direct and visible: the AI assistant gives incorrect or incomplete answers, users lose confidence in the system, and the promised productivity benefits fail to materialise. In regulated sectors — compliance documentation, HR policy, medical protocols — a system that retrieves and presents outdated or partially relevant content may create genuine risk, not merely inconvenience.

The organisational cost of poor chunking is frequently misattributed. When a RAG system produces inaccurate answers, teams often suspect the language model's capability rather than the retrieval pipeline. Diagnosing retrieval quality separately from generation quality is essential for identifying the true source of failure.

How it works technically

Chunking strategies range from simple to sophisticated:

Fixed-size chunking: The document is split every N tokens or characters, with optional overlap. This is the simplest approach and the default in many frameworks. It is fast and consistent but ignores document structure — it may cut mid-sentence, mid-table or mid-list with no regard for semantic coherence.

Sentence-based chunking: Text is split at sentence boundaries. Chunks may then be grouped into fixed-size windows of adjacent sentences to ensure minimum meaningful size. Better than character splitting for prose documents; less effective for structured content.

Semantic chunking: An embedding model is used to detect where meaning shifts in the document — boundaries are placed where the semantic similarity between consecutive sentences drops below a threshold. More computationally expensive but better preserves thematic coherence within chunks.

Structural chunking: Document structure is respected explicitly — chunks align with headings, sections, paragraphs, list items, table rows or other structural elements. For well-structured documents (policies, manuals, specifications), this typically produces the best retrieval results because the chunks match how the document's authors organised information.

Hierarchical chunking: Both a small chunk (for precise matching) and its containing larger parent chunk (for complete context) are stored. Retrieval uses the small chunk for similarity matching; the parent chunk is passed to the language model for context. This addresses the tension between retrieval precision and generation context.

Overlap between adjacent chunks — typically 10–20% of chunk size — preserves context at boundaries. A sentence that ends one chunk and a sentence that begins the next may together form a coherent thought; overlap ensures neither boundary creates an orphaned fragment.

Practical implementation considerations

Different document types benefit from different chunking approaches. A useful framework:

Document type	Recommended approach
Policy/procedure manuals	Structural chunking by section/heading
Contracts and legal documents	Structural chunking by clause; overlap at clause boundaries
Financial reports	Structural chunking preserving table integrity; separate table/prose handling
FAQ pages	One Q&A pair per chunk
Technical specifications	Structural + hierarchical; preserve context for code or formula references
Scanned or poorly structured PDFs	Fixed-size with generous overlap; invest in pre-processing first

The practical implementation process should include:

Sample and categorise your document corpus before choosing a strategy. Different document types may require different pipelines.
Set a baseline using fixed-size chunking, then measure retrieval quality on a test set of representative queries. This gives you something to improve against.
Evaluate iteratively: change one chunking parameter at a time and measure the effect on retrieval metrics (precision at k, recall at k) before committing to a configuration.
Invest in pre-processing for complex formats: PDFs with multi-column layouts, embedded tables and footnotes require layout-aware parsing (tools such as Unstructured, Azure Document Intelligence or AWS Textract) before chunking can be applied meaningfully.

When designing document processing pipelines for RAG implementations, Edison AI's AI implementation team treats chunking strategy as a first-class design decision — not a framework default — because it is the layer most directly responsible for production retrieval quality.

Common mistakes

Using framework defaults without validation. LangChain's RecursiveCharacterTextSplitter defaults and similar convenience defaults are starting points, not optimal configurations. Always validate against your actual documents and queries.
Applying the same chunking strategy to all document types. A strategy optimised for policy prose will perform poorly on financial tables or structured data sheets. Build document-type-aware pipelines.
Setting chunk size based on model context window rather than retrieval requirements. The context window constrains how many chunks can fit in a single prompt; it does not tell you the optimal chunk size for retrieval. These are separate decisions.
Neglecting pre-processing quality. Chunking a corrupted or poorly extracted document produces corrupted chunks. Garbage in, garbage out — pre-processing is not optional.
Not measuring retrieval quality independently. If you can only measure end-to-end answer quality, you cannot determine whether errors come from retrieval or generation. Build evaluation pipelines that measure retrieval precision and recall independently.

What leaders should do next

Before finalising the technical design of any RAG system, conduct a structured audit of the documents the system will process: How many distinct document types exist? What are their formats and structural characteristics? Are any scanned or poorly formatted? Use the answers to define a chunking strategy for each document category, not a single universal approach. Allocate evaluation effort — and time — to testing retrieval quality on representative queries before the system is presented to end users.

Edison AI builds bespoke AI systems — including retrieval over your own documents — for Australian businesses.

Frequently asked

Questions, answered.

What is document chunking in RAG?
Document chunking is the process of splitting source documents into smaller passages before embedding and storing them in a vector database. Each chunk becomes an independently retrievable unit. The size, boundaries and overlap of chunks directly determine what content the retrieval system can surface in response to a query.
What is the best chunk size for RAG?
There is no universal best chunk size — it depends on document type, query patterns and the embedding model used. As a starting point, 300–600 tokens with 10–20% overlap between adjacent chunks works well for most prose documents. Empirical evaluation against representative queries is more reliable than relying on default values.
What causes poor retrieval quality in RAG systems?
The most common causes are: chunks that split across natural semantic boundaries (mid-sentence, mid-table, mid-list), chunks that are too small to carry meaningful context, chunks that are too large to match narrow queries, and missing or inconsistent metadata. Poor pre-processing of source documents — especially scanned PDFs or complex layouts — compounds all of these problems.

Take the next step

Ready to put this into practice?

Edison AI helps Australian businesses move from AI curiosity to practical implementation, with workflow design, team training and measurable outcomes. Tell us about your setup and we'll come back with a sequenced plan grounded in the same thinking you just read.

Book an AI readiness call