Evaluation Reporting Template

Recommended ml eval_reporting_template

Agent Prompt Snippet

Provide a standardized template for presenting evaluation results, confidence intervals, and failure mode analyses to technical and non-technical stakeholders.

Purpose

The reporting template standardizes how evaluation results, confidence intervals, and failure mode analyses are presented to stakeholders for review.

This is a Recommended document — most projects benefit significantly from having one. While not strictly essential for every situation, its absence often leads to gaps in team understanding or quality.

What Makes It Good vs Bad

A strong version of this document:

Documents model architecture, training data, and evaluation metrics clearly
Includes bias analysis and fairness considerations
Specifies model versioning, A/B testing, and rollback procedures
Defines monitoring for model drift, data drift, and performance degradation
Connects model decisions to business outcomes with measurable criteria

Warning signs of a weak version:

Only documents final model — no record of experiments or alternatives tried
Missing bias and fairness analysis for the training data and predictions
No monitoring strategy for production model performance
Training pipeline undocumented — impossible to reproduce results
No clear process for model updates, retraining triggers, or deprecation

Common Mistakes

Not documenting the training data provenance and preprocessing steps
Skipping fairness and bias analysis — assuming the data is representative
Deploying models without monitoring for performance degradation over time
Treating model training as a one-time event rather than a recurring process

How to Use This Document

Document the full ML lifecycle: data collection, preprocessing, feature engineering, model selection, training, evaluation, deployment, and monitoring. Record experiment results even for failed approaches — they prevent future teams from repeating dead ends. Define clear criteria for when a model should be retrained or retired.

For AI agents: Reference ML documentation to understand model behavior, training data characteristics, and known limitations. When modifying ML pipelines, verify changes against documented evaluation metrics and fairness criteria.

Starter Template

SpecBase includes a ready-to-use template for this document: kb/templates/ml/eval_reporting_template.md.tmpl. Use the SpecBase CLI or MCP integration to generate it pre-filled for your project.

# Generate stubs via CLI
specbase init <archetype> --features <features> --dir ./docs