Text Chunking

Split long texts into manageable chunks for LLM processing, embeddings generation, and semantic search applications.

Why Text Chunking?

Text chunking is essential for working with LLMs and embeddings because:

  • Context Window Limits: LLMs have maximum input lengths (e.g., 8K, 128K tokens)
  • Better Embeddings: Smaller, focused chunks create more precise semantic representations
  • Improved Search: Granular chunks allow finding specific relevant passages
  • Cost Optimization: Process only relevant chunks instead of entire documents
  • RAG Systems: Retrieve the most relevant chunks to augment LLM responses

Chunking Strategies

By Max Size

Fixed-size chunks with optional sentence preservation

Best for: Consistent chunk sizes, token limits

By Sentences

Split at sentence boundaries

Best for: Maintaining grammatical structure

By Paragraphs

Split at paragraph boundaries

Best for: Preserving topic coherence

By Semantics

AI-powered semantic similarity chunking

Best for: Topically coherent chunks (advanced)

TextChunks Model

All chunking methods return a TextChunks object containing:

from SimplerLLM.tools.text_chunker import chunk_by_sentences

text = "First sentence. Second sentence. Third sentence."
result = chunk_by_sentences(text)

# Access chunk information
print(f"Total chunks: {result.num_chunks}")

# Iterate through chunks
for i, chunk in enumerate(result.chunk_list, 1):
    print(f"\nChunk {i}:")
    print(f"  Text: {chunk.text}")
    print(f"  Characters: {chunk.num_characters}")
    print(f"  Words: {chunk.num_words}")

Chunking by Max Size

Split text into chunks of a specified maximum size:

Basic Usage

from SimplerLLM.tools.text_chunker import chunk_by_max_chunk_size

text = """
Long document text here... This could be thousands of words from a PDF,
web article, or any other source. The chunker will split it into
manageable pieces based on the max_chunk_size parameter.
"""

# Chunk without preserving sentences (faster)
chunks = chunk_by_max_chunk_size(
    text=text,
    max_chunk_size=500,
    preserve_sentence_structure=False
)

print(f"Created {chunks.num_chunks} chunks")
for i, chunk in enumerate(chunks.chunk_list, 1):
    print(f"\nChunk {i}: {chunk.num_characters} chars, {chunk.num_words} words")
    print(chunk.text[:100] + "...")

Preserving Sentence Structure

from SimplerLLM.tools.text_chunker import chunk_by_max_chunk_size

text = """
Artificial intelligence is transforming industries. Machine learning
enables computers to learn from data. Deep learning uses neural networks
to solve complex problems. These technologies are advancing rapidly.
"""

# Chunk while preserving sentence boundaries
chunks = chunk_by_max_chunk_size(
    text=text,
    max_chunk_size=200,
    preserve_sentence_structure=True  # Respects sentence endings
)

for chunk in chunks.chunk_list:
    print(f"Chunk ({chunk.num_words} words):")
    print(chunk.text)
    print("-" * 50)

Chunking by Sentences

Split text at sentence boundaries for grammatically complete chunks:

from SimplerLLM.tools.text_chunker import chunk_by_sentences

text = """
SimplerLLM makes AI development easy. It provides a unified interface
for multiple LLM providers. You can build powerful applications quickly.
The library handles complexity for you.
"""

chunks = chunk_by_sentences(text)

print(f"Split into {chunks.num_chunks} sentences\n")

for i, chunk in enumerate(chunks.chunk_list, 1):
    print(f"{i}. {chunk.text} ({chunk.num_words} words)")

Chunking by Paragraphs

Split text at paragraph boundaries to maintain topical coherence:

from SimplerLLM.tools.text_chunker import chunk_by_paragraphs

text = """
Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables
computers to learn from data without explicit programming.

Types of Machine Learning

There are three main types: supervised learning, unsupervised learning,
and reinforcement learning. Each has its own use cases.

Applications

Machine learning powers recommendation systems, image recognition,
natural language processing, and many other applications.
"""

chunks = chunk_by_paragraphs(text)

print(f"Split into {chunks.num_chunks} paragraphs\n")

for i, chunk in enumerate(chunks.chunk_list, 1):
    print(f"Paragraph {i}:")
    print(f"{chunk.text}\n")
    print(f"Stats: {chunk.num_words} words, {chunk.num_characters} chars\n")
    print("-" * 60)

Semantic Chunking (Advanced)

Use AI to split text based on semantic similarity, creating topically coherent chunks:

from SimplerLLM.tools.text_chunker import chunk_by_semantics
from SimplerLLM.language.embeddings import EmbeddingsOpenAI

# Create embeddings instance
embeddings = EmbeddingsOpenAI()

text = """
Machine learning is transforming healthcare. Doctors use AI to diagnose
diseases more accurately. Medical imaging analysis has improved dramatically.

The weather today is sunny and warm. Temperature is around 75 degrees.
It's a perfect day for outdoor activities.

Back to AI, natural language processing enables computers to understand
human language. Chatbots and virtual assistants are becoming more capable.
"""

# Chunk by semantic similarity
chunks = chunk_by_semantics(
    text=text,
    llm_embeddings_instance=embeddings,
    threshold_percentage=90  # Higher = more chunks (90 is default)
)

print(f"Created {chunks.num_chunks} semantic chunks\n")

for i, chunk in enumerate(chunks.chunk_list, 1):
    print(f"Chunk {i} ({chunk.num_words} words):")
    print(chunk.text)
    print("-" * 60 + "\n")

# Notice: AI-related sentences grouped together, weather separated

Tuning Semantic Chunking

from SimplerLLM.tools.text_chunker import chunk_by_semantics
from SimplerLLM.language.embeddings import EmbeddingsOpenAI

embeddings = EmbeddingsOpenAI()
text = "Your long text here..."

# More chunks (stricter similarity threshold)
fine_chunks = chunk_by_semantics(
    text,
    llm_embeddings_instance=embeddings,
    threshold_percentage=95
)

# Fewer chunks (looser similarity threshold)
coarse_chunks = chunk_by_semantics(
    text,
    llm_embeddings_instance=embeddings,
    threshold_percentage=80
)

print(f"95% threshold: {fine_chunks.num_chunks} chunks")
print(f"80% threshold: {coarse_chunks.num_chunks} chunks")

Real-World Examples

RAG System with Optimal Chunking

from SimplerLLM.tools.generic_loader import load_content
from SimplerLLM.tools.text_chunker import chunk_by_max_chunk_size
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
from SimplerLLM.vectors.vector_db import VectorDB
from SimplerLLM.language.llm import LLM, LLMProvider

class OptimizedRAG:
    def __init__(self, chunk_size=500):
        self.chunk_size = chunk_size
        self.embeddings = EmbeddingsOpenAI()
        self.vector_db = VectorDB.create(
            provider='local',
            embeddings_instance=self.embeddings
        )
        self.llm = LLM.create(
            provider=LLMProvider.OPENAI,
            model_name="gpt-4o"
        )

    def add_document(self, source):
        """Load, chunk, and index a document"""
        # Load content
        doc = load_content(source)
        print(f"Loaded: {doc.title or source}")
        print(f"Total words: {doc.word_count}")

        # Chunk the content
        chunks = chunk_by_max_chunk_size(
            text=doc.content,
            max_chunk_size=self.chunk_size,
            preserve_sentence_structure=True
        )
        print(f"Created {chunks.num_chunks} chunks")

        # Generate embeddings and store
        for i, chunk in enumerate(chunks.chunk_list):
            embedding = self.embeddings.generate_embeddings(chunk.text)
            self.vector_db.add(
                vector=embedding,
                metadata={
                    'text': chunk.text,
                    'source': doc.title or source,
                    'chunk_index': i,
                    'word_count': chunk.num_words
                }
            )

        print(f"✓ Indexed {chunks.num_chunks} chunks\n")

    def query(self, question, top_k=3):
        """Query with chunk-level retrieval"""
        query_embedding = self.embeddings.generate_embeddings(question)
        results = self.vector_db.search(query_embedding, top_k=top_k)

        # Build context from relevant chunks
        context = "\n\n".join([
            f"[From {r['metadata']['source']}]\n{r['metadata']['text']}"
            for r in results
        ])

        prompt = f"""Answer based on these relevant excerpts:

{context}

Question: {question}

Answer:"""

        answer = self.llm.generate_response(prompt=prompt)
        return answer, results

# Usage
rag = OptimizedRAG(chunk_size=500)

# Add documents
rag.add_document("ai_research_paper.pdf")
rag.add_document("https://blog.example.com/ml-guide")

# Query
answer, sources = rag.query("What are the key machine learning algorithms?")
print(f"Answer: {answer}\n")
print("Based on chunks from:")
for s in sources:
    print(f"- {s['metadata']['source']} (chunk {s['metadata']['chunk_index']})")

Comparing Chunking Strategies

from SimplerLLM.tools.text_chunker import (
    chunk_by_max_chunk_size,
    chunk_by_sentences,
    chunk_by_paragraphs,
    chunk_by_semantics
)
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
from SimplerLLM.tools.generic_loader import load_content

def compare_chunking_strategies(source):
    """Compare different chunking approaches"""
    # Load document
    doc = load_content(source)
    text = doc.content

    print(f"Document: {doc.word_count} words\n")

    # Strategy 1: Max size
    max_size_chunks = chunk_by_max_chunk_size(
        text, max_chunk_size=500, preserve_sentence_structure=True
    )
    print(f"Max Size (500 chars):")
    print(f"  Chunks: {max_size_chunks.num_chunks}")
    print(f"  Avg words/chunk: {doc.word_count // max_size_chunks.num_chunks}")

    # Strategy 2: Sentences
    sentence_chunks = chunk_by_sentences(text)
    print(f"\nSentences:")
    print(f"  Chunks: {sentence_chunks.num_chunks}")
    avg_words = sum(c.num_words for c in sentence_chunks.chunk_list) // sentence_chunks.num_chunks
    print(f"  Avg words/chunk: {avg_words}")

    # Strategy 3: Paragraphs
    paragraph_chunks = chunk_by_paragraphs(text)
    print(f"\nParagraphs:")
    print(f"  Chunks: {paragraph_chunks.num_chunks}")
    avg_words = sum(c.num_words for c in paragraph_chunks.chunk_list) // paragraph_chunks.num_chunks
    print(f"  Avg words/chunk: {avg_words}")

    # Strategy 4: Semantic (requires embeddings)
    embeddings = EmbeddingsOpenAI()
    semantic_chunks = chunk_by_semantics(text, embeddings, threshold_percentage=90)
    print(f"\nSemantic:")
    print(f"  Chunks: {semantic_chunks.num_chunks}")
    avg_words = sum(c.num_words for c in semantic_chunks.chunk_list) // semantic_chunks.num_chunks
    print(f"  Avg words/chunk: {avg_words}")

# Usage
compare_chunking_strategies("research_paper.pdf")

Choosing a Chunking Strategy

Decision Guide

Use Max Size when:

  • • You need consistent chunk sizes for embeddings
  • • Working with token limits
  • • Speed is important (fastest method)
  • • Content structure is irregular

Use Sentences when:

  • • Chunks should be grammatically complete
  • • Processing short-form content
  • • Fine-grained search needed

Use Paragraphs when:

  • • Content has clear paragraph structure
  • • Topics align with paragraphs
  • • Want to preserve original document structure

Use Semantic when:

  • • Maximum retrieval accuracy is critical
  • • Topics don't align with paragraphs
  • • Cost of embeddings API calls is acceptable
  • • Building high-quality RAG systems

Best Practices

1. Balance Chunk Size

Too small = loss of context. Too large = less precise retrieval. Aim for 300-800 characters or 50-150 words for most use cases.

2. Add Overlap for Context

Consider overlapping chunks by 10-20% to preserve context across chunk boundaries (implement custom logic if needed).

3. Store Chunk Metadata

Keep track of source document, chunk index, and position for better attribution and debugging.

4. Test with Your Content

Different content types (technical docs, narratives, lists) may need different strategies. Test to find what works best.

5. Consider Preprocessing

Clean text (remove extra whitespace, normalize formatting) before chunking for better results.

Performance Comparison

Speed & Cost

  • Max Size: Very fast, no API calls
  • Sentences: Fast, no API calls
  • Paragraphs: Fast, no API calls
  • Semantic: Slower, requires embeddings API calls (costs money)

Next Steps

🎯 Recommended Workflow

For most RAG applications:

  1. 1. Load content with Content Loading tools
  2. 2. Chunk with chunk_by_max_chunk_size(preserve_sentence_structure=True)
  3. 3. Generate embeddings for each chunk
  4. 4. Store in vector database with metadata
  5. 5. Use semantic search to retrieve relevant chunks