LLM Judge

Multi-provider orchestration and intelligent evaluation system for selecting, synthesizing, or comparing LLM responses.

Multi-Provider Parallel Execution Intelligent Evaluation Batch Benchmarking

What is LLM Judge?

LLM Judge is a powerful multi-provider orchestration system that sends the same prompt to multiple LLM providers simultaneously, evaluates all responses using a judge LLM, and produces optimal results through three different modes.

Instead of choosing one provider and hoping for the best, LLM Judge leverages the strengths of multiple models and intelligently selects, synthesizes, or compares their outputs.

✨ Why Use LLM Judge?

  • Best Answer Guarantee: Get the highest quality response by evaluating multiple providers
  • Answer Synthesis: Combine strengths from different models into one optimal response
  • Provider Benchmarking: Compare and evaluate provider performance systematically
  • Parallel Speed: Execute multiple providers concurrently for fast results
  • Router Training: Generate data to train LLMRouter for smarter provider selection

Quick Start

Get started with LLM Judge in minutes by comparing multiple providers:

from SimplerLLM.language import LLM, LLMProvider, LLMJudge

# Create providers to evaluate
providers = [
    LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
    LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
    LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash")
]

# Create judge (use stronger model for evaluation)
judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")

# Initialize LLM Judge
judge = LLMJudge(
    providers=providers,
    judge_llm=judge_llm,
    parallel=True,          # Execute providers concurrently
    verbose=True
)

# Generate and evaluate
result = judge.generate(
    prompt="Explain quantum computing in simple terms",
    mode="synthesize",      # Combine best elements from all responses
    criteria=["accuracy", "clarity", "simplicity"]
)

# Access results
print(f"Final Answer: {result.final_answer}")
print(f"Mode Used: {result.mode}")
print(f"Number of Responses: {len(result.all_responses)}")

# View evaluation details
for provider, eval in zip(result.all_responses, result.evaluations):
    print(f"\n{provider.provider_name}:")
    print(f"  Score: {eval.overall_score}/10")
    print(f"  Rank: #{eval.rank}")
    print(f"  Strengths: {eval.strengths}")

Three Evaluation Modes

LLM Judge supports three distinct modes for different use cases. Choose the mode that matches your goal:

šŸ† Select Best

Evaluates all responses, ranks them, and selects the winning answer as the final output.

Best for: Fast, high-quality answers

Output: Complete text of best response

Use when: You want the single best answer

✨ Synthesize

Recommended

Creates a new, improved response by combining the best elements from all provider responses.

Best for: Maximum quality output

Output: Synthesized answer

Use when: You want optimal quality

šŸ“Š Compare

Provides detailed comparative analysis of all responses with strengths and weaknesses.

Best for: Benchmarking, research

Output: Detailed comparison report

Use when: You need analysis

Mode Examples

Select Best Mode

Pick the winning response from multiple providers:

result = judge.generate(
    prompt="Explain machine learning in 2-3 sentences",
    mode="select_best",
    criteria=["accuracy", "clarity", "conciseness"]
)

# Final answer is the complete text from the highest-ranked provider
print(result.final_answer)  # Best response text

# See which provider won
winner = result.evaluations[0]  # Rank 1 = winner
print(f"Winner: {result.all_responses[0].provider_name}")
print(f"Score: {winner.overall_score}/10")

Synthesize Mode (Recommended)

Combine strengths from all responses into an improved answer:

result = judge.generate(
    prompt="What are the benefits of Python for data science?",
    mode="synthesize",
    criteria=["completeness", "accuracy", "clarity"]
)

# Final answer is a NEW synthesized response combining all strengths
print(result.final_answer)  # Synthesized improved response

# Still see all original responses and their evaluations
for response, evaluation in zip(result.all_responses, result.evaluations):
    print(f"{response.provider_name}: Score {evaluation.overall_score}/10")
    print(f"  Strengths: {evaluation.strengths}")

Compare Mode

Get detailed comparative analysis for benchmarking:

result = judge.generate(
    prompt="Explain supervised vs unsupervised learning",
    mode="compare",
    criteria=["accuracy", "clarity", "depth"]
)

# Final answer is a comprehensive comparison summary
print(result.final_answer)  # Detailed comparison text

# Access detailed evaluations for each provider
for response, evaluation in zip(result.all_responses, result.evaluations):
    print(f"\n{response.provider_name} (Rank #{evaluation.rank}):")
    print(f"  Overall Score: {evaluation.overall_score}/10")
    print(f"  Strengths: {evaluation.strengths}")
    print(f"  Weaknesses: {evaluation.weaknesses}")

    # Per-criterion scores
    for criterion, score in evaluation.criterion_scores.items():
        print(f"  {criterion}: {score}/10")

Configuration Options

Customize LLM Judge behavior with these configuration parameters:

Initialization Parameters

LLMJudge(
    providers: List[LLM],              # Required: List of LLM instances to evaluate
    judge_llm: LLM,                    # Required: LLM instance to act as judge
    parallel: bool = True,              # Execute providers in parallel (faster)
    default_criteria: List[str] = None, # Default: ["accuracy", "clarity", "completeness"]
    verbose: bool = False               # Enable detailed logging
)

providers

List of LLM instances to evaluate. Minimum 1 provider required (though 2+ is recommended for comparison).

judge_llm

LLM instance used for evaluation. Typically a stronger model (e.g., GPT-4, Claude Opus) for better judgment quality.

parallel

When True, executes all providers concurrently using ThreadPoolExecutor for faster results.

default_criteria

Default evaluation criteria if not specified in generate(). Use 3-5 criteria for balanced evaluation.

Generate Method Parameters

judge.generate(
    prompt: str,                       # Required: Prompt to send to all providers
    mode: str = "synthesize",          # "select_best", "synthesize", or "compare"
    criteria: List[str] = None,        # Custom criteria (uses default_criteria if None)
    system_prompt: str = None,         # Optional system prompt for providers
    generate_summary: bool = False     # Generate RouterSummary for LLMRouter training
)

mode

Evaluation mode: "select_best" (pick winner), "synthesize" (combine strengths), or "compare" (detailed analysis).

criteria

List of criteria for evaluation (e.g., ["accuracy", "clarity", "depth"]). If None, uses default_criteria.

generate_summary

When True, generates RouterSummary with recommendations for LLMRouter training.

Working with Results

The JudgeResult object provides comprehensive access to all responses, evaluations, and metadata:

result = judge.generate(prompt, mode="synthesize")

# Core Results
print(result.final_answer)      # The final answer (best, synthesized, or comparison)
print(result.mode)               # Mode used: "select_best", "synthesize", or "compare"
print(result.total_execution_time)  # Total time in seconds

# All Provider Responses
for response in result.all_responses:
    print(f"Provider: {response.provider_name}")
    print(f"Model: {response.model_name}")
    print(f"Response: {response.response_text}")
    print(f"Time: {response.execution_time}s")
    if response.error:
        print(f"Error: {response.error}")

# Evaluations (sorted by rank)
for evaluation in result.evaluations:
    print(f"\nRank #{evaluation.rank}")
    print(f"Overall Score: {evaluation.overall_score}/10")
    print(f"Reasoning: {evaluation.reasoning}")
    print(f"Strengths: {evaluation.strengths}")
    print(f"Weaknesses: {evaluation.weaknesses}")

    # Per-criterion scores
    for criterion, score in evaluation.criterion_scores.items():
        print(f"  {criterion}: {score}/10")

# Confidence Scores (normalized 0-1)
for provider, confidence in zip(result.all_responses, result.confidence_scores):
    print(f"{provider.provider_name}: {confidence:.2%} confidence")

# Criteria Used
print(f"Evaluation criteria: {result.criteria_used}")

Batch Evaluation & Benchmarking

Evaluate multiple prompts to benchmark provider performance across different query types:

from SimplerLLM.language import LLM, LLMProvider, LLMJudge

# Setup providers and judge
providers = [
    LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
    LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
    LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash"),
]

judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")
judge = LLMJudge(providers=providers, judge_llm=judge_llm, parallel=True)

# Define test prompts
prompts = [
    "What is artificial intelligence?",
    "Explain neural networks in simple terms",
    "What are the applications of AI in healthcare?",
    "Write a Python function to reverse a string",
    "Explain the difference between machine learning and deep learning"
]

# Batch evaluate
print("Running batch evaluation...")
results = judge.evaluate_batch(
    prompts=prompts,
    mode="compare",  # Use compare mode for detailed analysis
    criteria=["accuracy", "clarity", "completeness"]
)

# Generate evaluation report
report = judge.generate_evaluation_report(
    results=results,
    export_format="json",      # Export to JSON
    export_path="benchmark_report.json"
)

# Access statistics
print(f"\nšŸ“Š Benchmark Results:")
print(f"Total Queries: {report.total_queries}")
print(f"\nProvider Win Counts:")
for provider, wins in report.provider_win_counts.items():
    print(f"  {provider}: {wins} wins")

print(f"\nAverage Scores:")
for provider, score in report.average_scores.items():
    print(f"  {provider}: {score:.2f}/10")

print(f"\nBest Provider Overall: {report.best_provider_overall}")

# Per-criterion winners
print(f"\nBest Providers by Criterion:")
for criterion, provider in report.best_provider_by_criteria.items():
    print(f"  {criterion}: {provider}")

šŸ“ Export Formats

Evaluation reports can be exported to JSON or CSV for further analysis:

# Export to JSON (includes all evaluation details)
report = judge.generate_evaluation_report(
    results=results,
    export_format="json",
    export_path="benchmark.json"
)

# Export to CSV (for Excel, pandas analysis)
report = judge.generate_evaluation_report(
    results=results,
    export_format="csv",
    export_path="benchmark.csv"
)

Router Training with RouterSummary

Generate training data to configure LLMRouter for intelligent provider selection:

from SimplerLLM.language import LLM, LLMProvider, LLMJudge

# Create judge
judge = LLMJudge(
    providers=[
        LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
        LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
    ],
    judge_llm=LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")
)

# Generate with router summary
result = judge.generate(
    prompt="Write a Python function to calculate factorial",
    mode="select_best",
    generate_summary=True  # Enable router summary generation
)

# Access router summary
summary = judge._router_summary

print(f"Query Type: {summary.query_type}")  # e.g., "coding_task"
print(f"Winning Provider: {summary.winning_provider}")
print(f"\nProvider Scores:")
for provider, score in summary.provider_scores.items():
    print(f"  {provider}: {score}/10")

print(f"\nRecommendation: {summary.recommendation}")
# e.g., "For coding_task queries, Anthropic excels with average score 9.2/10"

# Use recommendations to configure LLMRouter
from SimplerLLM.language.llm_router import LLMRouter

router_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini")
router = LLMRouter(llm_instance=router_llm)

# Add routing choices based on benchmarks
router.add_choices([
    ("Use OpenAI for general queries", {"provider": "openai"}),
    ("Use Anthropic for coding tasks", {"provider": "anthropic"}),
    ("Use Gemini for creative writing", {"provider": "gemini"}),
])

# Now router can intelligently select providers
choice = router.route("Write a Python class for a linked list")
print(f"Router chose: {choice.choice_metadata['provider']}")

LLM Judge vs LLM Feedback Loop

Both features improve output quality, but they work differently. Choose the right tool for your use case:

Aspect LLM Judge LLM Feedback Loop
Approach Multiple providers answer once (breadth) One answer improved multiple times (depth)
Providers 2+ providers (parallel) Single, dual, or rotating
Iterations 1 round (all providers + judge) Multiple rounds (3-5 typical)
Output Best answer OR synthesized OR comparison Iteratively refined single answer
Execution Parallel (faster) Sequential (slower)
Best For Fast quality, benchmarking, synthesis Maximum quality through refinement
Cost N providers + 1 judge call ~2N calls (N iterations Ɨ 2)
Use Case Provider comparison, quick wins Critical content, iterative polish

Use LLM Judge when:

  • āœ“ You want to leverage multiple models simultaneously
  • āœ“ You need fast results with parallel execution
  • āœ“ You're benchmarking provider performance
  • āœ“ You want to synthesize insights from different models
  • āœ“ You're uncertain which provider is best for a task
  • āœ“ You need router training data

Use LLM Feedback Loop when:

  • āœ“ You want maximum quality through iteration
  • āœ“ You have time for multiple improvement cycles
  • āœ“ You need to track improvement trajectory
  • āœ“ You want detailed critique at each step
  • āœ“ You already know which provider(s) to use
  • āœ“ Content quality is more important than speed

šŸ’” Pro Tip: Use Both Together

Get the best of both worlds by using LLM Judge first for synthesis, then LLM Feedback Loop for refinement:

# Step 1: Use Judge to synthesize answer from multiple providers
judge_result = judge.generate(
    prompt="Explain blockchain technology",
    mode="synthesize"
)

# Step 2: Use Feedback Loop to refine the synthesized answer
from SimplerLLM.language import LLMFeedbackLoop

feedback = LLMFeedbackLoop(
    llm=best_llm,
    max_iterations=3,
    quality_threshold=9.0
)

final_result = feedback.improve(
    prompt="Explain blockchain technology",
    initial_answer=judge_result.final_answer
)

# Result: Best synthesis from multiple models + iterative polish
print(final_result.final_answer)

Best Practices

Mode Selection Guidelines

Use SELECT_BEST when:

  • • You want the fastest results (one provider's complete answer)
  • • You trust the top-ranked provider's output as-is
  • • You're doing rapid prototyping or testing

Use SYNTHESIZE when:

  • • You want maximum output quality (recommended for most cases)
  • • You want to combine strengths from multiple models
  • • Output quality is more important than raw speed

Use COMPARE when:

  • • You're benchmarking providers systematically
  • • You need detailed analysis of response differences
  • • You're generating router training data

Provider Selection

# Mix provider tiers for cost efficiency
providers = [
    LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),        # Fast, cheap
    LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku"),  # Balanced
    LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash")    # Cost-effective
]

# Use stronger judge for better evaluation quality
judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")  # Stronger judge

Evaluation Criteria Design

Use 3-5 specific, relevant criteria for balanced evaluation:

# General queries
criteria = ["accuracy", "clarity", "completeness"]

# Technical explanations
criteria = ["accuracy", "depth", "clarity", "examples"]

# Coding tasks
criteria = ["correctness", "efficiency", "readability", "best_practices"]

# Creative content
criteria = ["creativity", "engagement", "clarity", "originality"]

# Research/analysis
criteria = ["accuracy", "depth", "citations", "objectivity"]

Performance Optimization

  • Enable parallel execution: Set parallel=True (default) for concurrent provider execution
  • Limit provider count: 2-4 providers balances quality and cost; more isn't always better
  • Use tier-appropriate judges: GPT-4 or Claude Opus for critical evaluations, GPT-4o-mini for routine tasks
  • Batch similar queries: Use evaluate_batch() for efficient multi-query benchmarking
  • Cache results: Store evaluation reports for recurring query patterns

Real-World Example: Provider Benchmarking

Complete workflow for benchmarking providers across different task types:

from SimplerLLM.language import LLM, LLMProvider, LLMJudge
import json

# Setup providers to benchmark
providers = [
    LLM.create(LLMProvider.OPENAI, model_name="gpt-4o-mini"),
    LLM.create(LLMProvider.ANTHROPIC, model_name="claude-3-5-haiku-20241022"),
    LLM.create(LLMProvider.GEMINI, model_name="gemini-1.5-flash"),
    LLM.create(LLMProvider.COHERE, model_name="command-r"),
]

# Strong judge for reliable evaluation
judge_llm = LLM.create(LLMProvider.OPENAI, model_name="gpt-4o")

# Initialize judge
judge = LLMJudge(
    providers=providers,
    judge_llm=judge_llm,
    parallel=True,
    default_criteria=["accuracy", "clarity", "completeness"],
    verbose=True
)

# Define diverse test prompts across task types
test_cases = [
    # General knowledge
    "What is quantum computing and how does it work?",
    "Explain the theory of relativity in simple terms",

    # Coding tasks
    "Write a Python function to find the longest common subsequence",
    "Create a JavaScript class for a binary search tree",

    # Creative writing
    "Write a short poem about artificial intelligence",
    "Create a compelling product description for smart home devices",

    # Analysis
    "Compare renewable energy sources: solar, wind, and hydro",
    "Analyze the pros and cons of remote work",
]

# Run benchmark
print("Running comprehensive benchmark...")
results = judge.evaluate_batch(
    prompts=test_cases,
    mode="compare",  # Detailed analysis for benchmarking
    criteria=["accuracy", "clarity", "depth", "relevance"]
)

# Generate comprehensive report
report = judge.generate_evaluation_report(
    results=results,
    export_format="json",
    export_path="provider_benchmark_2024.json"
)

# Display results
print("\n" + "="*80)
print("PROVIDER BENCHMARK RESULTS")
print("="*80)

print(f"\nTotal Queries Tested: {report.total_queries}")

print("\nšŸ“Š Win Counts (Rank #1):")
for provider, wins in sorted(report.provider_win_counts.items(),
                             key=lambda x: x[1], reverse=True):
    win_rate = (wins / report.total_queries) * 100
    print(f"  {provider}: {wins} wins ({win_rate:.1f}%)")

print("\n⭐ Average Scores (out of 10):")
for provider, score in sorted(report.average_scores.items(),
                              key=lambda x: x[1], reverse=True):
    print(f"  {provider}: {score:.2f}/10")

print(f"\nšŸ† Best Overall Provider: {report.best_provider_overall}")

print("\nšŸ“ˆ Best Providers by Criterion:")
for criterion, provider in report.best_provider_by_criteria.items():
    print(f"  {criterion}: {provider}")

# Detailed per-query analysis
print("\n" + "="*80)
print("DETAILED QUERY ANALYSIS")
print("="*80)

for i, (prompt, result) in enumerate(zip(test_cases, results), 1):
    print(f"\n{i}. {prompt[:60]}...")

    # Show top 3 providers for this query
    top_3 = sorted(
        zip(result.all_responses, result.evaluations),
        key=lambda x: x[1].rank
    )[:3]

    for response, evaluation in top_3:
        print(f"   #{evaluation.rank}. {response.provider_name}: "
              f"{evaluation.overall_score}/10")
        print(f"       Strengths: {evaluation.strengths[:80]}...")

# Export detailed CSV for further analysis
report_csv = judge.generate_evaluation_report(
    results=results,
    export_format="csv",
    export_path="provider_benchmark_2024.csv"
)

print("\nāœ… Benchmark complete! Results saved to:")
print("   - provider_benchmark_2024.json (comprehensive)")
print("   - provider_benchmark_2024.csv (for spreadsheet analysis)")

API Reference

LLMJudge Class

class LLMJudge:
    """Multi-provider orchestration and evaluation system."""

    def __init__(
        self,
        providers: List[LLM],
        judge_llm: LLM,
        parallel: bool = True,
        default_criteria: Optional[List[str]] = None,
        verbose: bool = False
    )

    def generate(
        self,
        prompt: str,
        mode: str = "synthesize",
        criteria: Optional[List[str]] = None,
        system_prompt: Optional[str] = None,
        generate_summary: bool = False
    ) -> JudgeResult

    def evaluate_batch(
        self,
        prompts: List[str],
        mode: str = "compare",
        criteria: Optional[List[str]] = None
    ) -> List[JudgeResult]

    def generate_evaluation_report(
        self,
        results: List[JudgeResult],
        export_format: Optional[str] = None,
        export_path: Optional[str] = None
    ) -> EvaluationReport

JudgeResult

class JudgeResult:
    final_answer: str                          # Final answer (best/synthesized/comparison)
    all_responses: List[ProviderResponse]      # All provider responses
    evaluations: List[ProviderEvaluation]      # Evaluations sorted by rank
    confidence_scores: List[float]             # Normalized scores (0-1)
    mode: str                                   # Mode used
    criteria_used: List[str]                   # Evaluation criteria
    total_execution_time: float                # Total time in seconds
    judge_execution_time: float                # Judge evaluation time
    timestamp: datetime                         # Execution timestamp

ProviderEvaluation

class ProviderEvaluation:
    overall_score: float                       # Overall score (1-10)
    rank: int                                   # Ranking (1 = best)
    criterion_scores: Dict[str, float]         # Score per criterion
    reasoning: str                              # Evaluation reasoning
    strengths: str                              # Identified strengths
    weaknesses: str                             # Identified weaknesses

EvaluationReport

class EvaluationReport:
    total_queries: int                          # Number of queries evaluated
    provider_win_counts: Dict[str, int]        # Win counts per provider
    average_scores: Dict[str, float]           # Average scores per provider
    best_provider_overall: str                  # Overall best provider
    best_provider_by_criteria: Dict[str, str]  # Best provider per criterion
    timestamp: datetime                         # Report generation time

Next Steps