How the SpecBase Composition Algorithm Works
When you tell SpecBase “I’m building a backend service with GDPR compliance,” it doesn’t do a database lookup or call an API. It runs a deterministic, local algorithm over a YAML knowledge base to produce a precise list of required specification documents. Here’s exactly how that works.
The Two Building Blocks: Archetypes and Features
SpecBase’s knowledge base is organized around two concepts.
An archetype is a software project type — a canonical shape that captures what a class of projects fundamentally is. SpecBase v1.0 includes 11:
web_app·backend_service·cli_tool·desktop_app·mobile_app·embedded·iot·data_pipeline·ml_ai·video_game·cloud_integration
Each archetype has base documents — the spec docs that every project of that type needs, no matter what. A web_app, for example, always needs a Product Requirements Document, UI/UX Spec, API Contract, Security & Privacy review, and an operational Runbook. These are non-negotiable; you’d be taking on real risk by omitting them.
A feature is an optional modifier. Features represent meaningful capability additions that change what documents a project needs. The web_app archetype supports six features: auth, payments, realtime, i18n, pwa, and analytics. Selecting auth adds three documents: Authentication Design, Session Management Spec, and an RBAC spec. Selecting payments adds a Payment Integration Spec, PCI Compliance Checklist, and a Refund Policy.
Features can also declare implies relationships (selecting feature A automatically activates feature B) and conflicts_with relationships (selecting A and B together is invalid). This lets the KB encode real-world constraints — for example, a game feature that requires live services implies telemetry infrastructure.
The Composition Algorithm
The algorithm is intentionally simple. Given an archetype ID and a list of selected feature IDs, it:
- Loads the archetype’s base document list.
- Validates each selected feature exists for that archetype, returning an error if not.
- Resolves implied features — for each selected feature, add any features in its
implieslist. Repeat until stable (handles chains like A → B → C). - Checks conflicts — if any two resolved features conflict, return an error.
- Collects feature documents — for each resolved feature, append its
adds_documentslist. - Deduplicates — documents with duplicate IDs are merged, keeping the first occurrence (base documents win over feature documents).
- Sorts the output by priority (P0 → P1 → P2), then by category, then alphabetically.
The result is a flat, ordered list of document objects with title, priority, category, rationale, and a prompt snippet for AI agents.
A Worked Example: backend_service + gdpr + async_jobs
Let’s trace through a real query.
Step 1: Load base documents.
The backend_service archetype has these base documents (among others): Architecture Overview, API Contract, Runbook, Incident Response Plan. These go into the resolved set immediately.
Step 2: Resolve gdpr feature.
The gdpr feature for backend_service adds three documents:
gdpr_compliance_checklist— the master GDPR checklistdata_subject_request_flow— how users exercise their rights (erasure, portability, access)data_processing_agreement— template for DPA contracts with processors
No implied features, no conflicts. Three documents added.
Step 3: Resolve async_jobs feature.
The async_jobs feature adds:
job_queue_spec— queue design, worker configuration, concurrency limitsretry_backoff_policy— exponential backoff, max retries, jitter strategydead_letter_handling— what happens to permanently failed jobs
No implied features, no conflicts with gdpr.
Step 4: Deduplicate and sort.
The combined set is: base documents + 3 GDPR docs + 3 async docs. No duplicates in this case. Sort by priority: any P0 docs come first (Architecture Overview, API Contract), then P1 (the feature-added docs), then P2.
Final output: ~12 documents, each with its rationale explaining why it matters for this combination.
Why This Algorithm Runs Identically in Go and JavaScript
The same algorithm is implemented twice: in Go (internal/kb/resolve.go) for the CLI and MCP server, and in JavaScript (site/assets/js/compose.js) for the web runner. Both must produce byte-identical JSON output for every input combination.
This isn’t just a nice property — it’s a hard constraint enforced by a parity test (test/e2e/parity_test.sh) that runs on every CI build. The test checks 31 type/feature combinations and fails the build if any output diverges between the two implementations.
Why maintain two implementations? The web runner is fully client-side — it runs in your browser with no server round-trip. Sending every runner interaction to a backend would be slower, more expensive, and require infrastructure. The JavaScript implementation gives us zero-latency, offline-capable results in the browser. The parity test ensures the implementations stay in sync as the KB evolves.
The YAML Knowledge Base Format
The KB is stored as one YAML file per archetype in kb/archetypes/. Here’s an abbreviated view of the web_app.yaml structure:
id: web_app
name: Web Application
base_documents:
- id: prd
title: Product Requirements Document
priority: P0
category: product
rationale: "Defines scope, personas, and success metrics..."
prompt_snippet: "Ensure the project has a PRD..."
features:
- id: auth
name: Authentication & Authorization
description: "User identity, login flows, and access control"
adds_documents:
- id: auth_design
title: Authentication Design
priority: P0
category: security
rationale: "Specifies identity providers, token lifecycle..."
Each document entry carries enough metadata to be self-descriptive: a human-readable rationale for why it exists, and a prompt_snippet that AI agents can use directly as a checklist item when reviewing a project.
What This Means for AI Agents
When an AI agent calls get_standard_docs(type="backend_service", features=["gdpr", "async_jobs"]) via MCP, it gets back the exact same resolved document list described above — structured JSON that the agent can use to:
- Bootstrap a new project’s documentation structure
- Audit whether an existing project is missing required specs
- Generate a system prompt that gives the agent project awareness
The algorithm’s determinism is what makes this useful. Given the same archetype and features, the output is always identical — whether you call it from the CLI, the web runner, or an MCP client. That reproducibility is the foundation of trust: when your AI agent says “this project is missing a GDPR compliance checklist,” you can verify exactly why that document is required by reading the KB directly.
Contributing to the Knowledge Base
The KB is CC-BY-4.0 licensed and community-driven. If you know of a document type that should be part of a specific archetype, or a feature variant that belongs in the graph, contributions are welcome. The YAML format is intentionally human-readable and designed to be edited without running code.
See CONTRIBUTING.md to get started, and browse the Document Library to see all 214 documents currently in the KB.