Embeddings

Generate vector embeddings from text using multiple providers through a unified interface.

What Are Embeddings?

Embeddings are vector representations of text that capture semantic meaning. Text with similar meanings will have similar vector representations, making embeddings essential for:

  • Semantic Search: Find contextually similar content
  • Clustering: Group similar documents together
  • Classification: Categorize text based on meaning
  • Recommendation Systems: Suggest related content
  • RAG (Retrieval Augmented Generation): Enhance LLM responses with relevant context

Supported Providers

SimplerLLM supports multiple embedding providers, each with different characteristics:

OpenAI

Popular and reliable embeddings

Models: text-embedding-3-small, text-embedding-3-large

Voyage AI

Specialized embeddings for various domains

Domain-specific models available

Cohere

Multilingual support

Models: embed-english-v3.0, embed-multilingual-v3.0

Basic Usage

OpenAI Embeddings

from SimplerLLM.language.embeddings import EmbeddingsOpenAI

# Create embeddings instance
embeddings = EmbeddingsOpenAI()

# Generate embedding for a single text
text = "SimplerLLM makes working with embeddings easy"
embedding_vector = embeddings.generate_embeddings(text)

print(f"Embedding dimension: {len(embedding_vector)}")
# Output: Embedding dimension: 1536 (for text-embedding-3-small)

Voyage AI Embeddings

from SimplerLLM.language.embeddings import EmbeddingsVoyageAI

# Create embeddings instance with specific model
embeddings = EmbeddingsVoyageAI(model_name="voyage-2")  # Example model

# Generate embedding
text = "Voyage AI provides domain-specific embeddings"
embedding_vector = embeddings.generate_embeddings(text)

Cohere Embeddings

from SimplerLLM.language.embeddings import EmbeddingsCohere

# Create embeddings instance
embeddings = EmbeddingsCohere(model_name="embed-english-v3.0")  # Example model

# Generate embedding
text = "Cohere supports multilingual embeddings"
embedding_vector = embeddings.generate_embeddings(text)

Batch Processing

Generate embeddings for multiple texts efficiently:

from SimplerLLM.language.embeddings import EmbeddingsOpenAI

embeddings = EmbeddingsOpenAI()

# Batch generate embeddings
texts = [
    "First document about machine learning",
    "Second document about artificial intelligence",
    "Third document about data science"
]

# Process all texts
embedding_vectors = [embeddings.generate_embeddings(text) for text in texts]

print(f"Generated {len(embedding_vectors)} embeddings")
print(f"Each embedding has {len(embedding_vectors[0])} dimensions")

Semantic Similarity

Calculate similarity between texts using cosine similarity:

from SimplerLLM.language.embeddings import EmbeddingsOpenAI
import numpy as np

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Create embeddings
embeddings = EmbeddingsOpenAI()

# Compare two texts
text1 = "Machine learning is a subset of artificial intelligence"
text2 = "AI includes machine learning and deep learning"
text3 = "The weather is sunny today"

# Generate embeddings
emb1 = embeddings.generate_embeddings(text1)
emb2 = embeddings.generate_embeddings(text2)
emb3 = embeddings.generate_embeddings(text3)

# Calculate similarities
sim_1_2 = cosine_similarity(emb1, emb2)
sim_1_3 = cosine_similarity(emb1, emb3)

print(f"Similarity between text1 and text2: {sim_1_2:.4f}")  # High similarity
print(f"Similarity between text1 and text3: {sim_1_3:.4f}")  # Low similarity

Configuration

Set up API keys in your .env file:

# .env file
OPENAI_API_KEY=your_openai_api_key
VOYAGE_API_KEY=your_voyage_api_key
COHERE_API_KEY=your_cohere_api_key

Or pass API keys directly:

from SimplerLLM.language.embeddings import EmbeddingsOpenAI

# Pass API key directly
embeddings = EmbeddingsOpenAI(api_key="your-api-key-here")

Choosing the Right Model

Model Selection Guide

  • Dimension Size: Larger dimensions (1536+) capture more nuance but require more storage and compute
  • Domain Specificity: Some providers offer models optimized for specific domains (code, finance, etc.)
  • Language Support: Choose multilingual models if working with non-English text
  • Cost: Different models have different pricing - check provider documentation
  • Performance: Test multiple models to find the best balance for your use case

Real-World Example: Document Search

Build a simple semantic search system:

from SimplerLLM.language.embeddings import EmbeddingsOpenAI
import numpy as np

class SimpleDocumentSearch:
    def __init__(self):
        self.embeddings = EmbeddingsOpenAI()
        self.documents = []
        self.document_embeddings = []

    def add_documents(self, documents):
        """Add documents to the search index"""
        self.documents.extend(documents)

        # Generate embeddings for new documents
        for doc in documents:
            emb = self.embeddings.generate_embeddings(doc)
            self.document_embeddings.append(emb)

    def search(self, query, top_k=3):
        """Search for most similar documents"""
        # Generate query embedding
        query_emb = self.embeddings.generate_embeddings(query)

        # Calculate similarities
        similarities = []
        for doc_emb in self.document_embeddings:
            sim = np.dot(query_emb, doc_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
            )
            similarities.append(sim)

        # Get top-k results
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'similarity': similarities[idx]
            })

        return results

# Usage
search = SimpleDocumentSearch()

# Add documents
docs = [
    "Python is a high-level programming language",
    "Machine learning helps computers learn from data",
    "SimplerLLM simplifies working with language models",
    "Embeddings convert text into vector representations",
    "The restaurant serves delicious Italian food"
]
search.add_documents(docs)

# Search
results = search.search("How do I work with AI models?", top_k=3)

for i, result in enumerate(results, 1):
    print(f"{i}. {result['document']}")
    print(f"   Similarity: {result['similarity']:.4f}\n")

Best Practices

1. Cache Embeddings

Embeddings are expensive to generate. Cache them to avoid regenerating for the same text.

2. Normalize Text

Clean and normalize text before generating embeddings (lowercase, remove special characters, etc.) for consistent results.

3. Batch When Possible

Process multiple texts together to reduce API calls and improve efficiency.

4. Use the Same Model

Always use the same embedding model for texts you want to compare - embeddings from different models are not comparable.

5. Store Efficiently

Use specialized vector databases for large-scale embedding storage and retrieval (see Vector Databases documentation).

Error Handling

from SimplerLLM.language.embeddings import EmbeddingsOpenAI

embeddings = EmbeddingsOpenAI()

try:
    text = "Generate embedding for this text"
    embedding = embeddings.generate_embeddings(text)
    print(f"Successfully generated {len(embedding)}-dimensional embedding")

except ValueError as e:
    # Handle invalid input
    print(f"Invalid input: {e}")

except ConnectionError as e:
    # Handle API connection issues
    print(f"Connection error: {e}")

except Exception as e:
    # Handle other errors
    print(f"Error generating embedding: {e}")

Next Steps

📚 Additional Resources

Learn more about embeddings and their applications: