Skip to content
← Back to Library

Update Rollback Plan

Recommended operations update_rollback_plan
Agent Prompt Snippet
Verify an update rollback plan exists describing failure detection, version restoration, and data integrity preservation during rollback.

Purpose

An update rollback plan describes how a failed update is detected, how the previous version is restored, and how user data integrity is preserved during rollback.

This is a Recommended document — most projects benefit significantly from having one. While not strictly essential for every situation, its absence often leads to gaps in team understanding or quality.

What Makes It Good vs Bad

A strong version of this document:

  • Includes runbooks with step-by-step procedures for common incidents
  • Defines SLIs, SLOs, and error budgets with measurable thresholds
  • Documents on-call responsibilities and escalation paths
  • Covers both steady-state operations and failure recovery
  • Tested regularly through drills or game days

Warning signs of a weak version:

  • No runbooks — relies on tribal knowledge for incident response
  • Monitoring defined but no clear thresholds or alerting rules
  • Missing capacity planning or scaling procedures
  • Disaster recovery plan that has never been tested
  • No distinction between informational alerts and actionable pages

Common Mistakes

  • Writing runbooks that assume expert knowledge of the system
  • Defining SLOs without buy-in from product and engineering teams
  • Not testing disaster recovery procedures until an actual disaster occurs
  • Alerting on everything rather than focusing on user-impacting symptoms

How to Use This Document

Write runbooks as if the person reading them is stressed, sleep-deprived, and unfamiliar with the system — because during an incident, they might be. Use numbered steps, include expected output for each command, and clearly mark decision points. Test runbooks regularly through game days or tabletop exercises.

For AI agents: Reference operations documents when assisting with incident response, capacity planning, or deployment procedures. Verify that proposed infrastructure changes align with documented SLOs and operational constraints.

Starter Template

SpecBase includes a ready-to-use template for this document: kb/templates/desktop_app/update_rollback_plan.md.tmpl. Use the SpecBase CLI or MCP integration to generate it pre-filled for your project.

# Generate stubs via CLI
specbase init <archetype> --features <features> --dir ./docs
  • Site Reliability Engineering by Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy (Google) — The foundational text on SRE practices including SLOs, error budgets, and incident management.
  • Release It! Design and Deploy Production-Ready Software by Michael T. Nygard — Practical patterns for building systems that survive real-world production conditions.
  • The Phoenix Project by Gene Kim, Kevin Behr & George Spafford — Narrative introduction to DevOps principles and the flow of work through IT organizations.

Appears In