LLM Judge
Multi-provider orchestration and intelligent evaluation system for selecting, synthesizing, or comparing LLM responses.
What is LLM Judge?
LLM Judge is a powerful multi-provider orchestration system that sends the same prompt to multiple LLM providers simultaneously, evaluates all responses using a judge LLM, and produces optimal results through three different modes.
Instead of choosing one provider and hoping for the best, LLM Judge leverages the strengths of multiple models and intelligently selects, synthesizes, or compares their outputs.
⨠Why Use LLM Judge?
- Best Answer Guarantee: Get the highest quality response by evaluating multiple providers
- Answer Synthesis: Combine strengths from different models into one optimal response
- Provider Benchmarking: Compare and evaluate provider performance systematically
- Parallel Speed: Execute multiple providers concurrently for fast results
- Router Training: Generate data to train LLMRouter for smarter provider selection
Quick Start
Get started with LLM Judge in minutes by comparing multiple providers:
from SimplerLLM.language import LLM, LLMProvider, LLMJudge
# Create providers to evaluate
providers = [
LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash")
]
# Create judge (use stronger model for evaluation)
judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")
# Initialize LLM Judge
judge = LLMJudge(
providers=providers,
judge_llm=judge_llm,
parallel=True, # Execute providers concurrently
verbose=True
)
# Generate and evaluate
result = judge.generate(
prompt="Explain quantum computing in simple terms",
mode="synthesize", # Combine best elements from all responses
criteria=["accuracy", "clarity", "simplicity"]
)
# Access results
print(f"Final Answer: {result.final_answer}")
print(f"Mode Used: {result.mode}")
print(f"Number of Responses: {len(result.all_responses)}")
# View evaluation details
for provider, eval in zip(result.all_responses, result.evaluations):
print(f"\n{provider.provider_name}:")
print(f" Score: {eval.overall_score}/10")
print(f" Rank: #{eval.rank}")
print(f" Strengths: {eval.strengths}")
Three Evaluation Modes
LLM Judge supports three distinct modes for different use cases. Choose the mode that matches your goal:
š Select Best
Evaluates all responses, ranks them, and selects the winning answer as the final output.
Best for: Fast, high-quality answers
Output: Complete text of best response
Use when: You want the single best answer
⨠Synthesize
RecommendedCreates a new, improved response by combining the best elements from all provider responses.
Best for: Maximum quality output
Output: Synthesized answer
Use when: You want optimal quality
š Compare
Provides detailed comparative analysis of all responses with strengths and weaknesses.
Best for: Benchmarking, research
Output: Detailed comparison report
Use when: You need analysis
Mode Examples
Select Best Mode
Pick the winning response from multiple providers:
result = judge.generate(
prompt="Explain machine learning in 2-3 sentences",
mode="select_best",
criteria=["accuracy", "clarity", "conciseness"]
)
# Final answer is the complete text from the highest-ranked provider
print(result.final_answer) # Best response text
# See which provider won
winner = result.evaluations[0] # Rank 1 = winner
print(f"Winner: {result.all_responses[0].provider_name}")
print(f"Score: {winner.overall_score}/10")
Synthesize Mode (Recommended)
Combine strengths from all responses into an improved answer:
result = judge.generate(
prompt="What are the benefits of Python for data science?",
mode="synthesize",
criteria=["completeness", "accuracy", "clarity"]
)
# Final answer is a NEW synthesized response combining all strengths
print(result.final_answer) # Synthesized improved response
# Still see all original responses and their evaluations
for response, evaluation in zip(result.all_responses, result.evaluations):
print(f"{response.provider_name}: Score {evaluation.overall_score}/10")
print(f" Strengths: {evaluation.strengths}")
Compare Mode
Get detailed comparative analysis for benchmarking:
result = judge.generate(
prompt="Explain supervised vs unsupervised learning",
mode="compare",
criteria=["accuracy", "clarity", "depth"]
)
# Final answer is a comprehensive comparison summary
print(result.final_answer) # Detailed comparison text
# Access detailed evaluations for each provider
for response, evaluation in zip(result.all_responses, result.evaluations):
print(f"\n{response.provider_name} (Rank #{evaluation.rank}):")
print(f" Overall Score: {evaluation.overall_score}/10")
print(f" Strengths: {evaluation.strengths}")
print(f" Weaknesses: {evaluation.weaknesses}")
# Per-criterion scores
for criterion, score in evaluation.criterion_scores.items():
print(f" {criterion}: {score}/10")
Configuration Options
Customize LLM Judge behavior with these configuration parameters:
Initialization Parameters
LLMJudge(
providers: List[LLM], # Required: List of LLM instances to evaluate
judge_llm: LLM, # Required: LLM instance to act as judge
parallel: bool = True, # Execute providers in parallel (faster)
default_criteria: List[str] = None, # Default: ["accuracy", "clarity", "completeness"]
verbose: bool = False # Enable detailed logging
)
providers
List of LLM instances to evaluate. Minimum 1 provider required (though 2+ is recommended for comparison).
judge_llm
LLM instance used for evaluation. Typically a stronger model (e.g., GPT-4, Claude Opus) for better judgment quality.
parallel
When True, executes all providers concurrently using ThreadPoolExecutor for faster results.
default_criteria
Default evaluation criteria if not specified in generate(). Use 3-5 criteria for balanced evaluation.
Generate Method Parameters
judge.generate(
prompt: str, # Required: Prompt to send to all providers
mode: str = "synthesize", # "select_best", "synthesize", or "compare"
criteria: List[str] = None, # Custom criteria (uses default_criteria if None)
system_prompt: str = None, # Optional system prompt for providers
generate_summary: bool = False # Generate RouterSummary for LLMRouter training
)
mode
Evaluation mode: "select_best" (pick winner), "synthesize" (combine strengths), or "compare" (detailed analysis).
criteria
List of criteria for evaluation (e.g., ["accuracy", "clarity", "depth"]). If None, uses default_criteria.
generate_summary
When True, generates RouterSummary with recommendations for LLMRouter training.
Working with Results
The JudgeResult object provides comprehensive access to all responses, evaluations, and metadata:
result = judge.generate(prompt, mode="synthesize")
# Core Results
print(result.final_answer) # The final answer (best, synthesized, or comparison)
print(result.mode) # Mode used: "select_best", "synthesize", or "compare"
print(result.total_execution_time) # Total time in seconds
# All Provider Responses
for response in result.all_responses:
print(f"Provider: {response.provider_name}")
print(f"Model: {response.model_name}")
print(f"Response: {response.response_text}")
print(f"Time: {response.execution_time}s")
if response.error:
print(f"Error: {response.error}")
# Evaluations (sorted by rank)
for evaluation in result.evaluations:
print(f"\nRank #{evaluation.rank}")
print(f"Overall Score: {evaluation.overall_score}/10")
print(f"Reasoning: {evaluation.reasoning}")
print(f"Strengths: {evaluation.strengths}")
print(f"Weaknesses: {evaluation.weaknesses}")
# Per-criterion scores
for criterion, score in evaluation.criterion_scores.items():
print(f" {criterion}: {score}/10")
# Confidence Scores (normalized 0-1)
for provider, confidence in zip(result.all_responses, result.confidence_scores):
print(f"{provider.provider_name}: {confidence:.2%} confidence")
# Criteria Used
print(f"Evaluation criteria: {result.criteria_used}")
Batch Evaluation & Benchmarking
Evaluate multiple prompts to benchmark provider performance across different query types:
from SimplerLLM.language import LLM, LLMProvider, LLMJudge
# Setup providers and judge
providers = [
LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash"),
]
judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")
judge = LLMJudge(providers=providers, judge_llm=judge_llm, parallel=True)
# Define test prompts
prompts = [
"What is artificial intelligence?",
"Explain neural networks in simple terms",
"What are the applications of AI in healthcare?",
"Write a Python function to reverse a string",
"Explain the difference between machine learning and deep learning"
]
# Batch evaluate
print("Running batch evaluation...")
results = judge.evaluate_batch(
prompts=prompts,
mode="compare", # Use compare mode for detailed analysis
criteria=["accuracy", "clarity", "completeness"]
)
# Generate evaluation report
report = judge.generate_evaluation_report(
results=results,
export_format="json", # Export to JSON
export_path="benchmark_report.json"
)
# Access statistics
print(f"\nš Benchmark Results:")
print(f"Total Queries: {report.total_queries}")
print(f"\nProvider Win Counts:")
for provider, wins in report.provider_win_counts.items():
print(f" {provider}: {wins} wins")
print(f"\nAverage Scores:")
for provider, score in report.average_scores.items():
print(f" {provider}: {score:.2f}/10")
print(f"\nBest Provider Overall: {report.best_provider_overall}")
# Per-criterion winners
print(f"\nBest Providers by Criterion:")
for criterion, provider in report.best_provider_by_criteria.items():
print(f" {criterion}: {provider}")
š Export Formats
Evaluation reports can be exported to JSON or CSV for further analysis:
# Export to JSON (includes all evaluation details)
report = judge.generate_evaluation_report(
results=results,
export_format="json",
export_path="benchmark.json"
)
# Export to CSV (for Excel, pandas analysis)
report = judge.generate_evaluation_report(
results=results,
export_format="csv",
export_path="benchmark.csv"
)
Router Training with RouterSummary
Generate training data to configure LLMRouter for intelligent provider selection:
from SimplerLLM.language import LLM, LLMProvider, LLMJudge
# Create judge
judge = LLMJudge(
providers=[
LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
],
judge_llm=LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")
)
# Generate with router summary
result = judge.generate(
prompt="Write a Python function to calculate factorial",
mode="select_best",
generate_summary=True # Enable router summary generation
)
# Access router summary
summary = judge._router_summary
print(f"Query Type: {summary.query_type}") # e.g., "coding_task"
print(f"Winning Provider: {summary.winning_provider}")
print(f"\nProvider Scores:")
for provider, score in summary.provider_scores.items():
print(f" {provider}: {score}/10")
print(f"\nRecommendation: {summary.recommendation}")
# e.g., "For coding_task queries, Anthropic excels with average score 9.2/10"
# Use recommendations to configure LLMRouter
from SimplerLLM.language.llm_router import LLMRouter
router_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini")
router = LLMRouter(llm_instance=router_llm)
# Add routing choices based on benchmarks
router.add_choices([
("Use OpenAI for general queries", {"provider": "openai"}),
("Use Anthropic for coding tasks", {"provider": "anthropic"}),
("Use Gemini for creative writing", {"provider": "gemini"}),
])
# Now router can intelligently select providers
choice = router.route("Write a Python class for a linked list")
print(f"Router chose: {choice.choice_metadata['provider']}")
LLM Judge vs LLM Feedback Loop
Both features improve output quality, but they work differently. Choose the right tool for your use case:
| Aspect | LLM Judge | LLM Feedback Loop |
|---|---|---|
| Approach | Multiple providers answer once (breadth) | One answer improved multiple times (depth) |
| Providers | 2+ providers (parallel) | Single, dual, or rotating |
| Iterations | 1 round (all providers + judge) | Multiple rounds (3-5 typical) |
| Output | Best answer OR synthesized OR comparison | Iteratively refined single answer |
| Execution | Parallel (faster) | Sequential (slower) |
| Best For | Fast quality, benchmarking, synthesis | Maximum quality through refinement |
| Cost | N providers + 1 judge call | ~2N calls (N iterations Ć 2) |
| Use Case | Provider comparison, quick wins | Critical content, iterative polish |
Use LLM Judge when:
- ā You want to leverage multiple models simultaneously
- ā You need fast results with parallel execution
- ā You're benchmarking provider performance
- ā You want to synthesize insights from different models
- ā You're uncertain which provider is best for a task
- ā You need router training data
Use LLM Feedback Loop when:
- ā You want maximum quality through iteration
- ā You have time for multiple improvement cycles
- ā You need to track improvement trajectory
- ā You want detailed critique at each step
- ā You already know which provider(s) to use
- ā Content quality is more important than speed
š” Pro Tip: Use Both Together
Get the best of both worlds by using LLM Judge first for synthesis, then LLM Feedback Loop for refinement:
# Step 1: Use Judge to synthesize answer from multiple providers
judge_result = judge.generate(
prompt="Explain blockchain technology",
mode="synthesize"
)
# Step 2: Use Feedback Loop to refine the synthesized answer
from SimplerLLM.language import LLMFeedbackLoop
feedback = LLMFeedbackLoop(
llm=best_llm,
max_iterations=3,
quality_threshold=9.0
)
final_result = feedback.improve(
prompt="Explain blockchain technology",
initial_answer=judge_result.final_answer
)
# Result: Best synthesis from multiple models + iterative polish
print(final_result.final_answer)
Best Practices
Mode Selection Guidelines
Use SELECT_BEST when:
- ⢠You want the fastest results (one provider's complete answer)
- ⢠You trust the top-ranked provider's output as-is
- ⢠You're doing rapid prototyping or testing
Use SYNTHESIZE when:
- ⢠You want maximum output quality (recommended for most cases)
- ⢠You want to combine strengths from multiple models
- ⢠Output quality is more important than raw speed
Use COMPARE when:
- ⢠You're benchmarking providers systematically
- ⢠You need detailed analysis of response differences
- ⢠You're generating router training data
Provider Selection
# Mix provider tiers for cost efficiency
providers = [
LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"), # Fast, cheap
LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku"), # Balanced
LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash") # Cost-effective
]
# Use stronger judge for better evaluation quality
judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o") # Stronger judge
Evaluation Criteria Design
Use 3-5 specific, relevant criteria for balanced evaluation:
# General queries
criteria = ["accuracy", "clarity", "completeness"]
# Technical explanations
criteria = ["accuracy", "depth", "clarity", "examples"]
# Coding tasks
criteria = ["correctness", "efficiency", "readability", "best_practices"]
# Creative content
criteria = ["creativity", "engagement", "clarity", "originality"]
# Research/analysis
criteria = ["accuracy", "depth", "citations", "objectivity"]
Performance Optimization
- Enable parallel execution: Set
parallel=True(default) for concurrent provider execution - Limit provider count: 2-4 providers balances quality and cost; more isn't always better
- Use tier-appropriate judges: GPT-4 or Claude Opus for critical evaluations, GPT-4o-mini for routine tasks
- Batch similar queries: Use
evaluate_batch()for efficient multi-query benchmarking - Cache results: Store evaluation reports for recurring query patterns
Real-World Example: Provider Benchmarking
Complete workflow for benchmarking providers across different task types:
from SimplerLLM.language import LLM, LLMProvider, LLMJudge
import json
# Setup providers to benchmark
providers = [
LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash"),
LLM.create(LLMProvider.COHERE, model_name="command-r"),
]
# Strong judge for reliable evaluation
judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")
# Initialize judge
judge = LLMJudge(
providers=providers,
judge_llm=judge_llm,
parallel=True,
default_criteria=["accuracy", "clarity", "completeness"],
verbose=True
)
# Define diverse test prompts across task types
test_cases = [
# General knowledge
"What is quantum computing and how does it work?",
"Explain the theory of relativity in simple terms",
# Coding tasks
"Write a Python function to find the longest common subsequence",
"Create a JavaScript class for a binary search tree",
# Creative writing
"Write a short poem about artificial intelligence",
"Create a compelling product description for smart home devices",
# Analysis
"Compare renewable energy sources: solar, wind, and hydro",
"Analyze the pros and cons of remote work",
]
# Run benchmark
print("Running comprehensive benchmark...")
results = judge.evaluate_batch(
prompts=test_cases,
mode="compare", # Detailed analysis for benchmarking
criteria=["accuracy", "clarity", "depth", "relevance"]
)
# Generate comprehensive report
report = judge.generate_evaluation_report(
results=results,
export_format="json",
export_path="provider_benchmark_2024.json"
)
# Display results
print("\n" + "="*80)
print("PROVIDER BENCHMARK RESULTS")
print("="*80)
print(f"\nTotal Queries Tested: {report.total_queries}")
print("\nš Win Counts (Rank #1):")
for provider, wins in sorted(report.provider_win_counts.items(),
key=lambda x: x[1], reverse=True):
win_rate = (wins / report.total_queries) * 100
print(f" {provider}: {wins} wins ({win_rate:.1f}%)")
print("\nā Average Scores (out of 10):")
for provider, score in sorted(report.average_scores.items(),
key=lambda x: x[1], reverse=True):
print(f" {provider}: {score:.2f}/10")
print(f"\nš Best Overall Provider: {report.best_provider_overall}")
print("\nš Best Providers by Criterion:")
for criterion, provider in report.best_provider_by_criteria.items():
print(f" {criterion}: {provider}")
# Detailed per-query analysis
print("\n" + "="*80)
print("DETAILED QUERY ANALYSIS")
print("="*80)
for i, (prompt, result) in enumerate(zip(test_cases, results), 1):
print(f"\n{i}. {prompt[:60]}...")
# Show top 3 providers for this query
top_3 = sorted(
zip(result.all_responses, result.evaluations),
key=lambda x: x[1].rank
)[:3]
for response, evaluation in top_3:
print(f" #{evaluation.rank}. {response.provider_name}: "
f"{evaluation.overall_score}/10")
print(f" Strengths: {evaluation.strengths[:80]}...")
# Export detailed CSV for further analysis
report_csv = judge.generate_evaluation_report(
results=results,
export_format="csv",
export_path="provider_benchmark_2024.csv"
)
print("\nā
Benchmark complete! Results saved to:")
print(" - provider_benchmark_2024.json (comprehensive)")
print(" - provider_benchmark_2024.csv (for spreadsheet analysis)")
API Reference
LLMJudge Class
class LLMJudge:
"""Multi-provider orchestration and evaluation system."""
def __init__(
self,
providers: List[LLM],
judge_llm: LLM,
parallel: bool = True,
default_criteria: Optional[List[str]] = None,
verbose: bool = False
)
def generate(
self,
prompt: str,
mode: str = "synthesize",
criteria: Optional[List[str]] = None,
system_prompt: Optional[str] = None,
generate_summary: bool = False
) -> JudgeResult
def evaluate_batch(
self,
prompts: List[str],
mode: str = "compare",
criteria: Optional[List[str]] = None
) -> List[JudgeResult]
def generate_evaluation_report(
self,
results: List[JudgeResult],
export_format: Optional[str] = None,
export_path: Optional[str] = None
) -> EvaluationReport
JudgeResult
class JudgeResult:
final_answer: str # Final answer (best/synthesized/comparison)
all_responses: List[ProviderResponse] # All provider responses
evaluations: List[ProviderEvaluation] # Evaluations sorted by rank
confidence_scores: List[float] # Normalized scores (0-1)
mode: str # Mode used
criteria_used: List[str] # Evaluation criteria
total_execution_time: float # Total time in seconds
judge_execution_time: float # Judge evaluation time
timestamp: datetime # Execution timestamp
ProviderEvaluation
class ProviderEvaluation:
overall_score: float # Overall score (1-10)
rank: int # Ranking (1 = best)
criterion_scores: Dict[str, float] # Score per criterion
reasoning: str # Evaluation reasoning
strengths: str # Identified strengths
weaknesses: str # Identified weaknesses
EvaluationReport
class EvaluationReport:
total_queries: int # Number of queries evaluated
provider_win_counts: Dict[str, int] # Win counts per provider
average_scores: Dict[str, float] # Average scores per provider
best_provider_overall: str # Overall best provider
best_provider_by_criteria: Dict[str, str] # Best provider per criterion
timestamp: datetime # Report generation time