Evaluation Reporting Template
Agent Prompt Snippet
Provide a standardized template for presenting evaluation results, confidence intervals, and failure mode analyses to technical and non-technical stakeholders.Purpose
The reporting template standardizes how evaluation results, confidence intervals, and failure mode analyses are presented to stakeholders for review.
This is a Recommended document — most projects benefit significantly from having one. While not strictly essential for every situation, its absence often leads to gaps in team understanding or quality.
What Makes It Good vs Bad
A strong version of this document:
- Documents model architecture, training data, and evaluation metrics clearly
- Includes bias analysis and fairness considerations
- Specifies model versioning, A/B testing, and rollback procedures
- Defines monitoring for model drift, data drift, and performance degradation
- Connects model decisions to business outcomes with measurable criteria
Warning signs of a weak version:
- Only documents final model — no record of experiments or alternatives tried
- Missing bias and fairness analysis for the training data and predictions
- No monitoring strategy for production model performance
- Training pipeline undocumented — impossible to reproduce results
- No clear process for model updates, retraining triggers, or deprecation
Common Mistakes
- Not documenting the training data provenance and preprocessing steps
- Skipping fairness and bias analysis — assuming the data is representative
- Deploying models without monitoring for performance degradation over time
- Treating model training as a one-time event rather than a recurring process
How to Use This Document
Document the full ML lifecycle: data collection, preprocessing, feature engineering, model selection, training, evaluation, deployment, and monitoring. Record experiment results even for failed approaches — they prevent future teams from repeating dead ends. Define clear criteria for when a model should be retrained or retired.
For AI agents: Reference ML documentation to understand model behavior, training data characteristics, and known limitations. When modifying ML pipelines, verify changes against documented evaluation metrics and fairness criteria.
Starter Template
SpecBase includes a ready-to-use template for this document: kb/templates/ml/eval_reporting_template.md.tmpl. Use the SpecBase CLI or MCP integration to generate it pre-filled for your project.
# Generate stubs via CLI
specbase init <archetype> --features <features> --dir ./docs
Recommended Reading
- Designing Machine Learning Systems by Chip Huyen — End-to-end guide to ML system design covering data, training, deployment, and monitoring.
- Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson & Michael Munn — Reusable solutions to common challenges in ML engineering and architecture.
- Responsible AI in Practice by Yolanda Gil — Framework for ethical AI development including fairness, transparency, and accountability.