Scoring Guide

This guide explains how to use and customize the scoring system in flujo.

Overview

The orchestrator includes several scoring mechanisms:

Ratio-based scoring
Weighted scoring
Model-based scoring
Custom scoring functions

Basic Scoring

Ratio Score

The simplest scoring method counts passed checklist items:

from flujo.domain.scoring import ratio_score

# Calculate a simple pass/fail ratio
score = ratio_score(checklist)
# Returns a float between 0.0 and 1.0

Weighted Score

Assign different weights to checklist items:

from flujo.domain.scoring import weighted_score

# Define weights for different criteria
weights = {
    "correctness": 0.5,
    "readability": 0.3,
    "efficiency": 0.2
}

# Calculate weighted score
score = weighted_score(checklist, weights)

Advanced Scoring

Model-Based Scoring

Use an AI model to evaluate quality:

from flujo.infra.agents import make_agent_async
from flujo.domain.scoring import model_score

# Create a scoring agent
scorer_agent = make_agent_async(
    "openai:gpt-4",
    "You are a quality evaluator. Score the solution from 0 to 1.",
    float
)

# Use model-based scoring
score = model_score(checklist, scorer_agent)

Custom Scoring

Create your own scoring function:

def custom_scorer(checklist):
    """Calculate a custom score based on checklist items."""
    total_score = 0
    total_weight = 0

    for item in checklist.items:
        # Define custom weights based on item type
        weight = 1.0
        if "critical" in item.description.lower():
            weight = 2.0
        elif "optional" in item.description.lower():
            weight = 0.5

        # Add to total
        total_score += weight * (1.0 if item.passed else 0.0)
        total_weight += weight

    return total_score / total_weight if total_weight > 0 else 0.0

# Use in pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(make_solution_agent())
    >> Step.validate(
        make_validator_agent(),
        scorer=custom_scorer
    )
)

Scoring in Pipelines

Step-Level Scoring

Configure scoring for individual steps:

# Review step with custom scoring
review_step = Step.review(
    make_review_agent(),
    scorer=lambda c: weighted_score(c, {
        "completeness": 0.4,
        "clarity": 0.6
    })
)

# Validation step with model scoring
validate_step = Step.validate(
    make_validator_agent(),
    scorer=lambda c: model_score(c, scorer_agent)
)

Pipeline-Level Scoring

Configure scoring for the entire pipeline:

# Create a pipeline with custom scoring
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(make_solution_agent())
    >> Step.validate(make_validator_agent())
)

# Configure the runner with custom scoring
runner = Flujo(
    pipeline,
    scorer=lambda c: weighted_score(c, {
        "review_score": 0.3,
        "solution_score": 0.5,
        "validation_score": 0.2
    })
)

Scoring Best Practices

1. Define Clear Criteria

# Example checklist with clear criteria
checklist = Checklist(items=[
    ChecklistItem(
        description="Code follows PEP 8 style guide",
        category="style",
        critical=True
    ),
    ChecklistItem(
        description="Includes docstrings for all functions",
        category="documentation",
        critical=True
    ),
    ChecklistItem(
        description="Has unit tests",
        category="testing",
        critical=False
    )
])

2. Use Appropriate Weights

# Example weights for code generation
code_weights = {
    "syntax": 0.3,      # Basic correctness
    "style": 0.2,       # Code style
    "documentation": 0.2,  # Documentation
    "testing": 0.2,     # Test coverage
    "performance": 0.1  # Performance considerations
}

# Example weights for content generation
content_weights = {
    "grammar": 0.3,     # Grammar and spelling
    "style": 0.3,       # Writing style
    "tone": 0.2,        # Appropriate tone
    "clarity": 0.2      # Clear communication
}

3. Implement Progressive Scoring

def progressive_scorer(checklist):
    """Score that requires critical items to pass."""
    # First, check critical items
    critical_items = [i for i in checklist.items if i.critical]
    if not all(i.passed for i in critical_items):
        return 0.0  # Fail if any critical item fails

    # Then, calculate weighted score for remaining items
    non_critical = [i for i in checklist.items if not i.critical]
    return weighted_score(Checklist(items=non_critical), {
        "style": 0.4,
        "documentation": 0.3,
        "testing": 0.3
    })

4. Use Model Scoring Wisely

# Create a specialized scoring agent
code_scorer = make_agent_async(
    "openai:gpt-4",
    """You are a code quality expert. Evaluate the code based on:
    1. Correctness (40%)
    2. Readability (30%)
    3. Efficiency (30%)
    Return a score between 0 and 1.""",
    float
)

# Use in pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(make_solution_agent())
    >> Step.validate(
        make_validator_agent(),
        scorer=lambda c: model_score(c, code_scorer)
    )
)

Examples

Code Generation Scoring

from flujo import Step, Flujo
from flujo.plugins import (
    SQLSyntaxValidator,
    CodeStyleValidator
)

# Define code-specific weights
code_weights = {
    "syntax": 0.3,
    "style": 0.2,
    "documentation": 0.2,
    "testing": 0.2,
    "performance": 0.1
}

# Create a code generation pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(code_agent)
    >> Step.validate(
        make_validator_agent(),
        plugins=[
            SQLSyntaxValidator(),
            CodeStyleValidator()
        ],
        scorer=lambda c: weighted_score(c, code_weights)
    )
)

Content Generation Scoring

# Define content-specific weights
content_weights = {
    "grammar": 0.3,
    "style": 0.3,
    "tone": 0.2,
    "clarity": 0.2
}

# Create a content generation pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(writer_agent)
    >> Step.validate(
        make_validator_agent(),
        scorer=lambda c: weighted_score(c, content_weights)
    )
)

Troubleshooting

Common Issues

Inconsistent Scores
Check weight definitions
Verify checklist items
Review scoring function
Monitor model outputs
Performance Issues
Cache model scores
Use simpler scoring when possible
Batch evaluations
Monitor costs
Quality Issues
Review scoring criteria
Adjust weights
Update checklist items
Calibrate model scoring

Getting Help

Check the Troubleshooting Guide
Search existing issues
Create a new issue if needed

Next Steps

Read the Usage Guide
Explore Advanced Topics
Check out Use Cases