Skip to content

Scoring Guide

This guide explains how to use and customize the scoring system in flujo.

Overview

The orchestrator includes several scoring mechanisms:

  • Ratio-based scoring
  • Weighted scoring
  • Model-based scoring
  • Custom scoring functions

Basic Scoring

Ratio Score

The simplest scoring method counts passed checklist items:

from flujo.domain.scoring import ratio_score

# Calculate a simple pass/fail ratio
score = ratio_score(checklist)
# Returns a float between 0.0 and 1.0

Weighted Score

Assign different weights to checklist items:

from flujo.domain.scoring import weighted_score

# Define weights for different criteria
weights = {
    "correctness": 0.5,
    "readability": 0.3,
    "efficiency": 0.2
}

# Calculate weighted score
score = weighted_score(checklist, weights)

Advanced Scoring

Model-Based Scoring

Use an AI model to evaluate quality:

from flujo.infra.agents import make_agent_async
from flujo.domain.scoring import model_score

# Create a scoring agent
scorer_agent = make_agent_async(
    "openai:gpt-4",
    "You are a quality evaluator. Score the solution from 0 to 1.",
    float
)

# Use model-based scoring
score = model_score(checklist, scorer_agent)

Custom Scoring

Create your own scoring function:

def custom_scorer(checklist):
    """Calculate a custom score based on checklist items."""
    total_score = 0
    total_weight = 0

    for item in checklist.items:
        # Define custom weights based on item type
        weight = 1.0
        if "critical" in item.description.lower():
            weight = 2.0
        elif "optional" in item.description.lower():
            weight = 0.5

        # Add to total
        total_score += weight * (1.0 if item.passed else 0.0)
        total_weight += weight

    return total_score / total_weight if total_weight > 0 else 0.0

# Use in pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(make_solution_agent())
    >> Step.validate(
        make_validator_agent(),
        scorer=custom_scorer
    )
)

Scoring in Pipelines

Step-Level Scoring

Configure scoring for individual steps:

# Review step with custom scoring
review_step = Step.review(
    make_review_agent(),
    scorer=lambda c: weighted_score(c, {
        "completeness": 0.4,
        "clarity": 0.6
    })
)

# Validation step with model scoring
validate_step = Step.validate(
    make_validator_agent(),
    scorer=lambda c: model_score(c, scorer_agent)
)

Pipeline-Level Scoring

Configure scoring for the entire pipeline:

# Create a pipeline with custom scoring
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(make_solution_agent())
    >> Step.validate(make_validator_agent())
)

# Configure the runner with custom scoring
runner = Flujo(
    pipeline,
    scorer=lambda c: weighted_score(c, {
        "review_score": 0.3,
        "solution_score": 0.5,
        "validation_score": 0.2
    })
)

Scoring Best Practices

1. Define Clear Criteria

# Example checklist with clear criteria
checklist = Checklist(items=[
    ChecklistItem(
        description="Code follows PEP 8 style guide",
        category="style",
        critical=True
    ),
    ChecklistItem(
        description="Includes docstrings for all functions",
        category="documentation",
        critical=True
    ),
    ChecklistItem(
        description="Has unit tests",
        category="testing",
        critical=False
    )
])

2. Use Appropriate Weights

# Example weights for code generation
code_weights = {
    "syntax": 0.3,      # Basic correctness
    "style": 0.2,       # Code style
    "documentation": 0.2,  # Documentation
    "testing": 0.2,     # Test coverage
    "performance": 0.1  # Performance considerations
}

# Example weights for content generation
content_weights = {
    "grammar": 0.3,     # Grammar and spelling
    "style": 0.3,       # Writing style
    "tone": 0.2,        # Appropriate tone
    "clarity": 0.2      # Clear communication
}

3. Implement Progressive Scoring

def progressive_scorer(checklist):
    """Score that requires critical items to pass."""
    # First, check critical items
    critical_items = [i for i in checklist.items if i.critical]
    if not all(i.passed for i in critical_items):
        return 0.0  # Fail if any critical item fails

    # Then, calculate weighted score for remaining items
    non_critical = [i for i in checklist.items if not i.critical]
    return weighted_score(Checklist(items=non_critical), {
        "style": 0.4,
        "documentation": 0.3,
        "testing": 0.3
    })

4. Use Model Scoring Wisely

# Create a specialized scoring agent
code_scorer = make_agent_async(
    "openai:gpt-4",
    """You are a code quality expert. Evaluate the code based on:
    1. Correctness (40%)
    2. Readability (30%)
    3. Efficiency (30%)
    Return a score between 0 and 1.""",
    float
)

# Use in pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(make_solution_agent())
    >> Step.validate(
        make_validator_agent(),
        scorer=lambda c: model_score(c, code_scorer)
    )
)

Examples

Code Generation Scoring

from flujo import Step, Flujo
from flujo.plugins import (
    SQLSyntaxValidator,
    CodeStyleValidator
)

# Define code-specific weights
code_weights = {
    "syntax": 0.3,
    "style": 0.2,
    "documentation": 0.2,
    "testing": 0.2,
    "performance": 0.1
}

# Create a code generation pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(code_agent)
    >> Step.validate(
        make_validator_agent(),
        plugins=[
            SQLSyntaxValidator(),
            CodeStyleValidator()
        ],
        scorer=lambda c: weighted_score(c, code_weights)
    )
)

Content Generation Scoring

# Define content-specific weights
content_weights = {
    "grammar": 0.3,
    "style": 0.3,
    "tone": 0.2,
    "clarity": 0.2
}

# Create a content generation pipeline
pipeline = (
    Step.review(make_review_agent())
    >> Step.solution(writer_agent)
    >> Step.validate(
        make_validator_agent(),
        scorer=lambda c: weighted_score(c, content_weights)
    )
)

Troubleshooting

Common Issues

  1. Inconsistent Scores
  2. Check weight definitions
  3. Verify checklist items
  4. Review scoring function
  5. Monitor model outputs

  6. Performance Issues

  7. Cache model scores
  8. Use simpler scoring when possible
  9. Batch evaluations
  10. Monitor costs

  11. Quality Issues

  12. Review scoring criteria
  13. Adjust weights
  14. Update checklist items
  15. Calibrate model scoring

Getting Help

Next Steps