How the SpecBase Composition Algorithm Works

When you tell SpecBase “I’m building a backend service with GDPR compliance,” it doesn’t do a database lookup or call an API. It runs a deterministic, local algorithm over a YAML knowledge base to produce a precise list of required specification documents. Here’s exactly how that works.

The Two Building Blocks: Software Types and Features

SpecBase’s knowledge base is organized around two concepts.

A software type is a software project type — a canonical shape that captures what a class of projects fundamentally is. SpecBase v1.0 includes 11:

web_app · backend_service · cli_tool · desktop_app · mobile_app · embedded · iot · data_pipeline · ml_ai · video_game · cloud_integration

Each software type has base documents — the spec docs that every project of that type needs, no matter what. A web_app, for example, always needs a Product Requirements Document, UI/UX Spec, API Contract, Security & Privacy review, and an operational Runbook. These are non-negotiable; you’d be taking on real risk by omitting them.

A feature is an optional modifier. Features represent meaningful capability additions that change what documents a project needs. The web_app software type supports six features: auth, payments, realtime, i18n, pwa, and analytics. Selecting auth adds three documents: Authentication Design, Session Management Spec, and an RBAC spec. Selecting payments adds a Payment Integration Spec, PCI Compliance Checklist, and a Refund Policy.

Features can also declare implies relationships (selecting feature A automatically activates feature B) and conflicts_with relationships (selecting A and B together is invalid). This lets the KB encode real-world constraints — for example, a game feature that requires live services implies telemetry infrastructure.

The Composition Algorithm

The algorithm is intentionally simple. Given a software type ID and a list of selected feature IDs, it:

Loads the software type’s base document list.
Validates each selected feature exists for that software type, returning an error if not.
Resolves implied features — for each selected feature, add any features in its implies list. Repeat until stable (handles chains like A → B → C).
Checks conflicts — if any two resolved features conflict, return an error.
Collects feature documents — for each resolved feature, append its adds_documents list.
Deduplicates — documents with duplicate IDs are merged, keeping the first occurrence (base documents win over feature documents).
Sorts the output by priority (P0 → P1 → P2), then by category, then alphabetically.

The result is a flat, ordered list of document objects with title, priority, category, rationale, and a prompt snippet for AI agents.

A Worked Example: `backend_service` + `gdpr` + `async_jobs`

Let’s trace through a real query.

Step 1: Load base documents.

The backend_service software type has these base documents (among others): Architecture Overview, API Contract, Runbook, Incident Response Plan. These go into the resolved set immediately.

Step 2: Resolve gdpr feature.

The gdpr feature for backend_service adds three documents:

gdpr_compliance_checklist — the master GDPR checklist
data_subject_request_flow — how users exercise their rights (erasure, portability, access)
data_processing_agreement — template for DPA contracts with processors

No implied features, no conflicts. Three documents added.

Step 3: Resolve async_jobs feature.

The async_jobs feature adds:

job_queue_spec — queue design, worker configuration, concurrency limits
retry_backoff_policy — exponential backoff, max retries, jitter strategy
dead_letter_handling — what happens to permanently failed jobs

No implied features, no conflicts with gdpr.

Step 4: Deduplicate and sort.

The combined set is: base documents + 3 GDPR docs + 3 async docs. No duplicates in this case. Sort by priority: any P0 docs come first (Architecture Overview, API Contract), then P1 (the feature-added docs), then P2.

Final output: ~12 documents, each with its rationale explaining why it matters for this combination.

Why This Algorithm Runs Identically in Go and JavaScript

The same algorithm is implemented twice: in Go (internal/kb/resolve.go) for the CLI and MCP server, and in JavaScript (site/assets/js/compose.js) for the web runner. Both must produce byte-identical JSON output for every input combination.

This isn’t just a nice property — it’s a hard constraint enforced by a parity test (test/e2e/parity_test.sh) that runs on every CI build. The test checks 31 type/feature combinations and fails the build if any output diverges between the two implementations.

Why maintain two implementations? The web runner is fully client-side — it runs in your browser with no server round-trip. Sending every runner interaction to a backend would be slower, more expensive, and require infrastructure. The JavaScript implementation gives us zero-latency, offline-capable results in the browser. The parity test ensures the implementations stay in sync as the KB evolves.

The YAML Knowledge Base Format

The KB is stored as one YAML file per software type in kb/software types/. Here’s an abbreviated view of the web_app.yaml structure:

id: web_app
name: Web Application
base_documents:
  - id: prd
    title: Product Requirements Document
    priority: P0
    category: product
    rationale: "Defines scope, personas, and success metrics..."
    prompt_snippet: "Ensure the project has a PRD..."

features:
  - id: auth
    name: Authentication & Authorization
    description: "User identity, login flows, and access control"
    adds_documents:
      - id: auth_design
        title: Authentication Design
        priority: P0
        category: security
        rationale: "Specifies identity providers, token lifecycle..."

Each document entry carries enough metadata to be self-descriptive: a human-readable rationale for why it exists, and a prompt_snippet that AI agents can use directly as a checklist item when reviewing a project.

What This Means for AI Agents

When an AI agent calls get_standard_docs(type="backend_service", features=["gdpr", "async_jobs"]) via MCP, it gets back the exact same resolved document list described above — structured JSON that the agent can use to:

Bootstrap a new project’s documentation structure
Audit whether an existing project is missing required specs
Generate a system prompt that gives the agent project awareness

The algorithm’s determinism is what makes this useful. Given the same software type and features, the output is always identical — whether you call it from the CLI, the web runner, or an MCP client. That reproducibility is the foundation of trust: when your AI agent says “this project is missing a GDPR compliance checklist,” you can verify exactly why that document is required by reading the KB directly.

Contributing to the Knowledge Base

The KB is CC-BY-4.0 licensed and community-driven. If you know of a document type that should be part of a specific software type, or a feature variant that belongs in the graph, contributions are welcome. The YAML format is intentionally human-readable and designed to be edited without running code.

See CONTRIBUTING.md to get started, and browse the Document Library to see all 214 documents currently in the KB.