Skip to content

Live Ops Runbook

Required operations live_ops_runbook
Agent Prompt Snippet
Ensure the project has a live ops runbook covering deployment procedures, rollback steps, hotfix workflows, and on-call escalation paths.

Purpose

A live ops runbook is the operational playbook for a game that never stops running. It codifies every recurring procedure that keeps a live service healthy: content drops, seasonal event activations, server maintenance windows, hotfix deployments, economy tuning passes, and the emergency kill switches that shut down broken features before they damage the player experience. Where an incident response plan describes what to do when something breaks, the live ops runbook describes what to do every day, every week, and every season to keep things from breaking in the first place.

Live games are unique because the service is the product. A SaaS outage loses revenue for the duration of the downtime. A botched live game update—a duplicated currency exploit, a seasonal event that triggers a crash loop, a matchmaking change that doubles queue times—can lose players permanently. The live ops runbook exists to make these high-stakes operations routine and repeatable. It transforms “the person who knows how to push a content drop” from a single human into a documented procedure that any trained operator can execute at 3 AM.

This is a Required document for any live service game. A project that ships without one is a project that operates on luck until the luck runs out.

Who needs this document

PersonaWhy they need itHow they use it
Live Ops Engineer (Sam)Executes content drops, seasonal events, and maintenance windows on a recurring schedule; needs step-by-step procedures that eliminate guessworkFollows the runbook for every scheduled operation, checks off verification steps, and logs deviations for post-operation review
AI Agent (Claude Code)Needs to understand operational constraints before modifying game server code, economy parameters, or feature flag configurationsReferences the runbook to verify that proposed changes respect maintenance windows, rollback procedures, and kill-switch thresholds before generating code
Game Designer / Economy Lead (Priya)Tunes economy parameters, activates limited-time offers, and manages A/B tests on live content; needs to know the safe operational envelopeConsults the runbook to understand how parameter changes propagate, what validation checks run before activation, and how to revert a bad tuning pass
DevOps / SREManages the infrastructure that supports live operations—deployment pipelines, feature flag systems, monitoring dashboards, and alerting rulesUses the runbook to configure maintenance windows, validate deployment health checks, and ensure that automated rollback triggers align with documented thresholds

What separates a good version from a bad one

Criterion 1: Content drop procedures are step-by-step with verification gates

Strong: “Content drop procedure: (1) Enable feature flag season_3_content in staging at 09:00 UTC. (2) Run automated smoke suite—verify 0 critical failures. (3) Promote flag to 5% canary population. (4) Monitor error rate dashboard for 30 minutes—threshold: <0.1% increase in crash rate. (5) Promote to 100%. (6) Publish in-game news feed entry via CMS. (7) Verify client download metrics show asset bundle adoption >80% within 2 hours. Rollback: disable flag, push CMS retraction, file incident report.”

Weak: “Push the new content to production and monitor for issues.” (No verification gates, no canary phase, no rollback procedure, no definition of “issues.” The operator has no way to distinguish a successful drop from a slow-motion failure.)

Criterion 2: Emergency kill switches are documented with activation criteria

Strong: “Kill switch disable_iap_store: Activates when the payment error rate exceeds 5% over a 10-minute window OR when a currency duplication exploit is confirmed. Activation: set remote config key iap_enabled=false via Firebase Remote Config console. Effect: store UI hides all purchase buttons; pending transactions complete but no new purchases are accepted. Restoration requires explicit re-enable plus a 15-minute monitoring hold.”

Weak: “We can turn off the store if something goes wrong.” (No activation criteria, no specific mechanism, no description of player-facing behavior during the kill state. When an exploit hits at 2 AM, the on-call engineer will not know whether “turning off the store” means a config change, a server restart, or a client patch.)

Criterion 3: Seasonal event lifecycle is fully defined with rollback at each phase

Strong: “Halloween event lifecycle: Pre-load phase (T-48h)—push asset bundles via background download, flag gated. Activation (T-0)—enable halloween_2025 flag, activate event matchmaking playlist, start limited-time shop rotation. Mid-event checkpoint (T+72h)—review engagement metrics against baseline, adjust drop rates if completion rate <30%. Deactivation (T+14d)—disable flag, revert matchmaking playlist, convert unconverted event currency to standard currency at 10:1 ratio. Post-event cleanup (T+21d)—archive event telemetry, remove expired asset bundles from CDN.”

Weak: “The Halloween event runs for two weeks. Turn it on and off with the event flag.” (No pre-load strategy, no mid-event adjustment criteria, no currency conversion plan, no cleanup phase. The event leaves orphaned assets and confused players holding worthless event currency.)

Criterion 4: Economy tuning changes follow a safe deployment pattern

Strong: “Economy parameter changes (drop rates, prices, XP curves) deploy via A/B test framework. All changes start as a 10% holdout test with a 48-hour observation window. Primary metric: session revenue per DAU. Guardrail metric: D1 retention must not drop >2pp from control. If guardrails trip, the test auto-reverts and pages the economy team. Manual override requires economy lead sign-off in the #live-ops-approvals channel.”

Weak: “Economy changes go through the config system.” (No testing framework, no observation window, no guardrail metrics. A bad drop rate change ships to 100% of players instantly, and the team discovers the damage days later in the weekly metrics review.)

Common mistakes

Treating the runbook as a launch artifact that never updates. The live ops runbook must evolve with every new event type, every new feature flag, and every post-mortem that reveals a procedural gap. Teams that write the runbook at launch and never update it end up with a document that describes a game that no longer exists. Assign a quarterly review cadence and update the runbook as part of every seasonal event retrospective.

No distinction between scheduled operations and emergency procedures. A content drop and an exploit mitigation are both “live ops,” but they have completely different urgency profiles, approval chains, and rollback tolerances. A runbook that mixes them into a single flat list forces the on-call engineer to triage the document before triaging the incident. Separate scheduled operations (content drops, maintenance windows, economy tuning) from emergency procedures (kill switches, exploit response, rollback) with clear section boundaries.

Undocumented feature flag cleanup. Every seasonal event adds feature flags. Teams that document activation but not removal accumulate hundreds of stale flags within a year. The runbook should specify a flag lifecycle: creation, activation, deactivation, and deletion with a maximum TTL. Without this, the flag system becomes a minefield where no one knows which flags are safe to remove.

Assuming the operator has full context. The person executing a 3 AM hotfix is not the person who wrote the runbook. Every procedure should include expected output for each step, explicit decision points (“If error count > N, proceed to rollback section”), and links to the relevant dashboards. A runbook step that says “verify the deployment is healthy” is not a step—it is a wish.

How to use this document

When to create it

Create the live ops runbook during late production, after the game’s live service systems—feature flags, remote config, content delivery pipeline, event framework—are implemented but before the first public content drop. The document should be operational and tested (via dry runs) before launch. If you are building a seasonal event system, the runbook for operating that system should be written and rehearsed before the first season ships.

Who owns it

The live ops lead or live service producer owns this document. They are responsible for updating it after every seasonal event retrospective and every post-mortem that identifies a procedural gap. The DevOps team is a required reviewer for any change that affects deployment procedures or kill-switch mechanisms. The economy lead is a required reviewer for any change that affects tuning parameters or A/B test guardrails.

How AI agents should reference it

get_standard_docs(type="video_game", features=["live_service"])
→ live_ops_runbook in documents[]
→ agent reads document to understand content drop procedures, kill-switch criteria, and rollback sequences
→ agent cross-references with economy_balance_doc before modifying drop rates or pricing
→ agent verifies that proposed feature flag changes align with documented lifecycle and cleanup policy

The prompt_snippet“Ensure the project has a live ops runbook covering deployment procedures, rollback steps, hotfix workflows, and on-call escalation paths” — tells the agent to verify all four operational areas are addressed. If the agent is modifying live service code—event activation logic, economy parameters, feature flag integrations—it should confirm that the change is compatible with the documented operational procedures and does not bypass kill-switch or rollback mechanisms.

How it connects to other documents

The live ops runbook sits at the center of a live service game’s operational documentation. The Deployment Runbook provides the low-level mechanics of pushing builds to production; the live ops runbook layers game-specific procedures—content drops, event activations, economy tuning—on top. The Scaling Runbook defines how to handle capacity during peak events; the live ops runbook specifies when those peaks occur and what pre-scaling actions to take. The Incident Response Plan covers unplanned outages; the live ops runbook covers the planned operations that, when done poorly, create the incidents. The Monitoring & Alerting configuration should include dashboards and thresholds referenced by name in the runbook’s verification steps. Changes to the live ops runbook should trigger a review of all four downstream documents.

  • Site Reliability Engineering by Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy (Google) — The foundational text on operational runbooks, error budgets, and structured incident management. The chapters on release engineering and managing incidents directly apply to live game operations.
  • Live Ops chapter in “Game Backend Development” by Sean Duffy — Practical coverage of content pipelines, feature flags, seasonal event systems, and economy tuning in production games.
  • Release It! Design and Deploy Production-Ready Software by Michael T. Nygard — Essential patterns for circuit breakers, bulkheads, and graceful degradation that underpin the kill-switch and rollback strategies documented in a live ops runbook.

Appears In