Embeddings
Generate vector embeddings from text using multiple providers through a unified interface.
What Are Embeddings?
Embeddings are vector representations of text that capture semantic meaning. Text with similar meanings will have similar vector representations, making embeddings essential for:
- Semantic Search: Find contextually similar content
- Clustering: Group similar documents together
- Classification: Categorize text based on meaning
- Recommendation Systems: Suggest related content
- RAG (Retrieval Augmented Generation): Enhance LLM responses with relevant context
Supported Providers
SimplerLLM supports multiple embedding providers, each with different characteristics:
OpenAI
Popular and reliable embeddings
Models: text-embedding-3-small, text-embedding-3-large
Voyage AI
Specialized embeddings for various domains
Domain-specific models available
Cohere
Multilingual support
Models: embed-english-v3.0, embed-multilingual-v3.0
Basic Usage
OpenAI Embeddings
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
# Create embeddings instance
embeddings = EmbeddingsOpenAI()
# Generate embedding for a single text
text = "SimplerLLM makes working with embeddings easy"
embedding_vector = embeddings.generate_embeddings(text)
print(f"Embedding dimension: {len(embedding_vector)}")
# Output: Embedding dimension: 1536 (for text-embedding-3-small)
Voyage AI Embeddings
from SimplerLLM.language.embeddings import EmbeddingsVoyageAI
# Create embeddings instance with specific model
embeddings = EmbeddingsVoyageAI(model_name="voyage-2") # Example model
# Generate embedding
text = "Voyage AI provides domain-specific embeddings"
embedding_vector = embeddings.generate_embeddings(text)
Cohere Embeddings
from SimplerLLM.language.embeddings import EmbeddingsCohere
# Create embeddings instance
embeddings = EmbeddingsCohere(model_name="embed-english-v3.0") # Example model
# Generate embedding
text = "Cohere supports multilingual embeddings"
embedding_vector = embeddings.generate_embeddings(text)
Batch Processing
Generate embeddings for multiple texts efficiently:
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
embeddings = EmbeddingsOpenAI()
# Batch generate embeddings
texts = [
"First document about machine learning",
"Second document about artificial intelligence",
"Third document about data science"
]
# Process all texts
embedding_vectors = [embeddings.generate_embeddings(text) for text in texts]
print(f"Generated {len(embedding_vectors)} embeddings")
print(f"Each embedding has {len(embedding_vectors[0])} dimensions")
Semantic Similarity
Calculate similarity between texts using cosine similarity:
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
import numpy as np
def cosine_similarity(vec1, vec2):
"""Calculate cosine similarity between two vectors"""
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Create embeddings
embeddings = EmbeddingsOpenAI()
# Compare two texts
text1 = "Machine learning is a subset of artificial intelligence"
text2 = "AI includes machine learning and deep learning"
text3 = "The weather is sunny today"
# Generate embeddings
emb1 = embeddings.generate_embeddings(text1)
emb2 = embeddings.generate_embeddings(text2)
emb3 = embeddings.generate_embeddings(text3)
# Calculate similarities
sim_1_2 = cosine_similarity(emb1, emb2)
sim_1_3 = cosine_similarity(emb1, emb3)
print(f"Similarity between text1 and text2: {sim_1_2:.4f}") # High similarity
print(f"Similarity between text1 and text3: {sim_1_3:.4f}") # Low similarity
Configuration
Set up API keys in your .env
file:
# .env file
OPENAI_API_KEY=your_openai_api_key
VOYAGE_API_KEY=your_voyage_api_key
COHERE_API_KEY=your_cohere_api_key
Or pass API keys directly:
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
# Pass API key directly
embeddings = EmbeddingsOpenAI(api_key="your-api-key-here")
Choosing the Right Model
Model Selection Guide
- Dimension Size: Larger dimensions (1536+) capture more nuance but require more storage and compute
- Domain Specificity: Some providers offer models optimized for specific domains (code, finance, etc.)
- Language Support: Choose multilingual models if working with non-English text
- Cost: Different models have different pricing - check provider documentation
- Performance: Test multiple models to find the best balance for your use case
Real-World Example: Document Search
Build a simple semantic search system:
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
import numpy as np
class SimpleDocumentSearch:
def __init__(self):
self.embeddings = EmbeddingsOpenAI()
self.documents = []
self.document_embeddings = []
def add_documents(self, documents):
"""Add documents to the search index"""
self.documents.extend(documents)
# Generate embeddings for new documents
for doc in documents:
emb = self.embeddings.generate_embeddings(doc)
self.document_embeddings.append(emb)
def search(self, query, top_k=3):
"""Search for most similar documents"""
# Generate query embedding
query_emb = self.embeddings.generate_embeddings(query)
# Calculate similarities
similarities = []
for doc_emb in self.document_embeddings:
sim = np.dot(query_emb, doc_emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
)
similarities.append(sim)
# Get top-k results
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = []
for idx in top_indices:
results.append({
'document': self.documents[idx],
'similarity': similarities[idx]
})
return results
# Usage
search = SimpleDocumentSearch()
# Add documents
docs = [
"Python is a high-level programming language",
"Machine learning helps computers learn from data",
"SimplerLLM simplifies working with language models",
"Embeddings convert text into vector representations",
"The restaurant serves delicious Italian food"
]
search.add_documents(docs)
# Search
results = search.search("How do I work with AI models?", top_k=3)
for i, result in enumerate(results, 1):
print(f"{i}. {result['document']}")
print(f" Similarity: {result['similarity']:.4f}\n")
Best Practices
1. Cache Embeddings
Embeddings are expensive to generate. Cache them to avoid regenerating for the same text.
2. Normalize Text
Clean and normalize text before generating embeddings (lowercase, remove special characters, etc.) for consistent results.
3. Batch When Possible
Process multiple texts together to reduce API calls and improve efficiency.
4. Use the Same Model
Always use the same embedding model for texts you want to compare - embeddings from different models are not comparable.
5. Store Efficiently
Use specialized vector databases for large-scale embedding storage and retrieval (see Vector Databases documentation).
Error Handling
from SimplerLLM.language.embeddings import EmbeddingsOpenAI
embeddings = EmbeddingsOpenAI()
try:
text = "Generate embedding for this text"
embedding = embeddings.generate_embeddings(text)
print(f"Successfully generated {len(embedding)}-dimensional embedding")
except ValueError as e:
# Handle invalid input
print(f"Invalid input: {e}")
except ConnectionError as e:
# Handle API connection issues
print(f"Connection error: {e}")
except Exception as e:
# Handle other errors
print(f"Error generating embedding: {e}")
Next Steps
Vector Databases →
Store and query embeddings at scale
LLM Router →
Intelligently route queries using embeddings
📚 Additional Resources
Learn more about embeddings and their applications:
- • OpenAI Embeddings Guide
- • Cohere Embeddings Documentation
- • Check each provider's documentation for current model offerings