Text Chunking
Split long texts into manageable chunks for LLM processing, embeddings generation, and semantic search applications.
Why Text Chunking?
Text chunking is essential for working with LLMs and embeddings because:
- Context Window Limits: LLMs have maximum input lengths (e.g., 8K, 128K tokens)
- Better Embeddings: Smaller, focused chunks create more precise semantic representations
- Improved Search: Granular chunks allow finding specific relevant passages
- Cost Optimization: Process only relevant chunks instead of entire documents
- RAG Systems: Retrieve the most relevant chunks to augment LLM responses
Chunking Strategies
By Max Size
Fixed-size chunks with optional sentence preservation
Best for: Consistent chunk sizes, token limits
By Sentences
Split at sentence boundaries
Best for: Maintaining grammatical structure
By Paragraphs
Split at paragraph boundaries
Best for: Preserving topic coherence
By Semantics
AI-powered semantic similarity chunking
Best for: Topically coherent chunks (advanced)
TextChunks Model
All chunking methods return a TextChunks
object containing:
from SimplerLLM.tools.text_chunker import chunk_by_sentences
text = "First sentence. Second sentence. Third sentence."
result = chunk_by_sentences(text)
# Access chunk information
print(f"Total chunks: {result.num_chunks}")
# Iterate through chunks
for i, chunk in enumerate(result.chunk_list, 1):
print(f"\nChunk {i}:")
print(f" Text: {chunk.text}")
print(f" Characters: {chunk.num_characters}")
print(f" Words: {chunk.num_words}")
Chunking by Max Size
Split text into chunks of a specified maximum size:
Basic Usage
from SimplerLLM.tools.text_chunker import chunk_by_max_chunk_size
text = """
Long document text here... This could be thousands of words from a PDF,
web article, or any other source. The chunker will split it into
manageable pieces based on the max_chunk_size parameter.
"""
# Chunk without preserving sentences (faster)
chunks = chunk_by_max_chunk_size(
text=text,
max_chunk_size=500,
preserve_sentence_structure=False
)
print(f"Created {chunks.num_chunks} chunks")
for i, chunk in enumerate(chunks.chunk_list, 1):
print(f"\nChunk {i}: {chunk.num_characters} chars, {chunk.num_words} words")
print(chunk.text[:100] + "...")
Preserving Sentence Structure
from SimplerLLM.tools.text_chunker import chunk_by_max_chunk_size
text = """
Artificial intelligence is transforming industries. Machine learning
enables computers to learn from data. Deep learning uses neural networks
to solve complex problems. These technologies are advancing rapidly.
"""
# Chunk while preserving sentence boundaries
chunks = chunk_by_max_chunk_size(
text=text,
max_chunk_size=200,
preserve_sentence_structure=True # Respects sentence endings
)
for chunk in chunks.chunk_list:
print(f"Chunk ({chunk.num_words} words):")
print(chunk.text)
print("-" * 50)
Chunking by Sentences
Split text at sentence boundaries for grammatically complete chunks:
from SimplerLLM.tools.text_chunker import chunk_by_sentences
text = """
SimplerLLM makes AI development easy. It provides a unified interface
for multiple LLM providers. You can build powerful applications quickly.
The library handles complexity for you.
"""
chunks = chunk_by_sentences(text)
print(f"Split into {chunks.num_chunks} sentences\n")
for i, chunk in enumerate(chunks.chunk_list, 1):
print(f"{i}. {chunk.text} ({chunk.num_words} words)")
Chunking by Paragraphs
Split text at paragraph boundaries to maintain topical coherence:
from SimplerLLM.tools.text_chunker import chunk_by_paragraphs
text = """
Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enables
computers to learn from data without explicit programming.
Types of Machine Learning
There are three main types: supervised learning, unsupervised learning,
and reinforcement learning. Each has its own use cases.
Applications
Machine learning powers recommendation systems, image recognition,
natural language processing, and many other applications.
"""
chunks = chunk_by_paragraphs(text)
print(f"Split into {chunks.num_chunks} paragraphs\n")
for i, chunk in enumerate(chunks.chunk_list, 1):
print(f"Paragraph {i}:")
print(f"{chunk.text}\n")
print(f"Stats: {chunk.num_words} words, {chunk.num_characters} chars\n")
print("-" * 60)
Semantic Chunking (Advanced)
Use AI to split text based on semantic similarity, creating topically coherent chunks:
from SimplerLLM.tools.text_chunker import chunk_by_semantics
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
# Create embeddings instance
embeddings = EmbeddingsOpenAI()
text = """
Machine learning is transforming healthcare. Doctors use AI to diagnose
diseases more accurately. Medical imaging analysis has improved dramatically.
The weather today is sunny and warm. Temperature is around 75 degrees.
It's a perfect day for outdoor activities.
Back to AI, natural language processing enables computers to understand
human language. Chatbots and virtual assistants are becoming more capable.
"""
# Chunk by semantic similarity
chunks = chunk_by_semantics(
text=text,
llm_embeddings_instance=embeddings,
threshold_percentage=90 # Higher = more chunks (90 is default)
)
print(f"Created {chunks.num_chunks} semantic chunks\n")
for i, chunk in enumerate(chunks.chunk_list, 1):
print(f"Chunk {i} ({chunk.num_words} words):")
print(chunk.text)
print("-" * 60 + "\n")
# Notice: AI-related sentences grouped together, weather separated
Tuning Semantic Chunking
from SimplerLLM.tools.text_chunker import chunk_by_semantics
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
embeddings = EmbeddingsOpenAI()
text = "Your long text here..."
# More chunks (stricter similarity threshold)
fine_chunks = chunk_by_semantics(
text,
llm_embeddings_instance=embeddings,
threshold_percentage=95
)
# Fewer chunks (looser similarity threshold)
coarse_chunks = chunk_by_semantics(
text,
llm_embeddings_instance=embeddings,
threshold_percentage=80
)
print(f"95% threshold: {fine_chunks.num_chunks} chunks")
print(f"80% threshold: {coarse_chunks.num_chunks} chunks")
Real-World Examples
RAG System with Optimal Chunking
from SimplerLLM.tools.generic_loader import load_content
from SimplerLLM.tools.text_chunker import chunk_by_max_chunk_size
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
from SimplerLLM.vectors.vector_db import VectorDB
from SimplerLLM.language.llm import LLM, LLMProvider
class OptimizedRAG:
def __init__(self, chunk_size=500):
self.chunk_size = chunk_size
self.embeddings = EmbeddingsOpenAI()
self.vector_db = VectorDB.create(
provider='local',
embeddings_instance=self.embeddings
)
self.llm = LLM.create(
provider=LLMProvider.OPENAI,
model_name="gpt-4o"
)
def add_document(self, source):
"""Load, chunk, and index a document"""
# Load content
doc = load_content(source)
print(f"Loaded: {doc.title or source}")
print(f"Total words: {doc.word_count}")
# Chunk the content
chunks = chunk_by_max_chunk_size(
text=doc.content,
max_chunk_size=self.chunk_size,
preserve_sentence_structure=True
)
print(f"Created {chunks.num_chunks} chunks")
# Generate embeddings and store
for i, chunk in enumerate(chunks.chunk_list):
embedding = self.embeddings.generate_embeddings(chunk.text)
self.vector_db.add(
vector=embedding,
metadata={
'text': chunk.text,
'source': doc.title or source,
'chunk_index': i,
'word_count': chunk.num_words
}
)
print(f"✓ Indexed {chunks.num_chunks} chunks\n")
def query(self, question, top_k=3):
"""Query with chunk-level retrieval"""
query_embedding = self.embeddings.generate_embeddings(question)
results = self.vector_db.search(query_embedding, top_k=top_k)
# Build context from relevant chunks
context = "\n\n".join([
f"[From {r['metadata']['source']}]\n{r['metadata']['text']}"
for r in results
])
prompt = f"""Answer based on these relevant excerpts:
{context}
Question: {question}
Answer:"""
answer = self.llm.generate_response(prompt=prompt)
return answer, results
# Usage
rag = OptimizedRAG(chunk_size=500)
# Add documents
rag.add_document("ai_research_paper.pdf")
rag.add_document("https://blog.example.com/ml-guide")
# Query
answer, sources = rag.query("What are the key machine learning algorithms?")
print(f"Answer: {answer}\n")
print("Based on chunks from:")
for s in sources:
print(f"- {s['metadata']['source']} (chunk {s['metadata']['chunk_index']})")
Comparing Chunking Strategies
from SimplerLLM.tools.text_chunker import (
chunk_by_max_chunk_size,
chunk_by_sentences,
chunk_by_paragraphs,
chunk_by_semantics
)
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
from SimplerLLM.tools.generic_loader import load_content
def compare_chunking_strategies(source):
"""Compare different chunking approaches"""
# Load document
doc = load_content(source)
text = doc.content
print(f"Document: {doc.word_count} words\n")
# Strategy 1: Max size
max_size_chunks = chunk_by_max_chunk_size(
text, max_chunk_size=500, preserve_sentence_structure=True
)
print(f"Max Size (500 chars):")
print(f" Chunks: {max_size_chunks.num_chunks}")
print(f" Avg words/chunk: {doc.word_count // max_size_chunks.num_chunks}")
# Strategy 2: Sentences
sentence_chunks = chunk_by_sentences(text)
print(f"\nSentences:")
print(f" Chunks: {sentence_chunks.num_chunks}")
avg_words = sum(c.num_words for c in sentence_chunks.chunk_list) // sentence_chunks.num_chunks
print(f" Avg words/chunk: {avg_words}")
# Strategy 3: Paragraphs
paragraph_chunks = chunk_by_paragraphs(text)
print(f"\nParagraphs:")
print(f" Chunks: {paragraph_chunks.num_chunks}")
avg_words = sum(c.num_words for c in paragraph_chunks.chunk_list) // paragraph_chunks.num_chunks
print(f" Avg words/chunk: {avg_words}")
# Strategy 4: Semantic (requires embeddings)
embeddings = EmbeddingsOpenAI()
semantic_chunks = chunk_by_semantics(text, embeddings, threshold_percentage=90)
print(f"\nSemantic:")
print(f" Chunks: {semantic_chunks.num_chunks}")
avg_words = sum(c.num_words for c in semantic_chunks.chunk_list) // semantic_chunks.num_chunks
print(f" Avg words/chunk: {avg_words}")
# Usage
compare_chunking_strategies("research_paper.pdf")
Choosing a Chunking Strategy
Decision Guide
Use Max Size when:
- • You need consistent chunk sizes for embeddings
- • Working with token limits
- • Speed is important (fastest method)
- • Content structure is irregular
Use Sentences when:
- • Chunks should be grammatically complete
- • Processing short-form content
- • Fine-grained search needed
Use Paragraphs when:
- • Content has clear paragraph structure
- • Topics align with paragraphs
- • Want to preserve original document structure
Use Semantic when:
- • Maximum retrieval accuracy is critical
- • Topics don't align with paragraphs
- • Cost of embeddings API calls is acceptable
- • Building high-quality RAG systems
Best Practices
1. Balance Chunk Size
Too small = loss of context. Too large = less precise retrieval. Aim for 300-800 characters or 50-150 words for most use cases.
2. Add Overlap for Context
Consider overlapping chunks by 10-20% to preserve context across chunk boundaries (implement custom logic if needed).
3. Store Chunk Metadata
Keep track of source document, chunk index, and position for better attribution and debugging.
4. Test with Your Content
Different content types (technical docs, narratives, lists) may need different strategies. Test to find what works best.
5. Consider Preprocessing
Clean text (remove extra whitespace, normalize formatting) before chunking for better results.
Performance Comparison
Speed & Cost
- Max Size: Very fast, no API calls
- Sentences: Fast, no API calls
- Paragraphs: Fast, no API calls
- Semantic: Slower, requires embeddings API calls (costs money)
Next Steps
🎯 Recommended Workflow
For most RAG applications:
- 1. Load content with Content Loading tools
- 2. Chunk with
chunk_by_max_chunk_size(preserve_sentence_structure=True)
- 3. Generate embeddings for each chunk
- 4. Store in vector database with metadata
- 5. Use semantic search to retrieve relevant chunks