Spec-Driven Test Generation

Overview

Most AI-generated tests are implementation mirrors. An agent reads a function, infers what it does, and writes a test that confirms the code does what the code does. These tests pass on day one, break on every refactor, and verify nothing meaningful — they’re circular assertions dressed up as coverage. The problem isn’t that AI agents can’t write tests. It’s that they don’t know what the system is supposed to do. Without access to specifications — acceptance criteria, business rules, invariants, authorization policies, state machine definitions — an agent can only test what it sees in the code. And testing what the code does is not the same as testing what the code should do. CoreStory changes this equation. It ingests the full codebase alongside the PRD, TechSpec, and architectural documentation, then serves as a Specification Expert — an intelligence layer that can answer “what should this system do?” before the agent ever looks at implementation. This unlocks a fundamentally different approach to test generation: specification-driven testing, where the agent extracts behavioral specifications first, then generates tests that verify those specifications against the code.

Why Specification Before Code

The principle is simple: if you know what the system should do before you look at how it does it, you produce tests that survive refactoring, catch real bugs, and document actual business intent.

Approach	What the Agent Knows	Test Quality	Refactor Survival
Code-mirroring	Only the implementation	Tests confirm code does what code does	Breaks on rename, restructure, or any internal change
Specification-driven	Business rules, acceptance criteria, invariants	Tests confirm code does what it should do	Survives any refactor that preserves behavior

A code-mirroring test asserts how:

def test_order_calls_minimum_check_validator():
    with mock.patch("OrderValidator.check_minimum") as m:
        m.return_value = False
        submit_order(create_order(total=4.99))
        m.assert_called_once_with(4.99)

A specification-driven test asserts what:

def test_order_rejected_when_below_minimum_amount():
    result = submit_order(create_order(total=4.99))
    assert result.status == "rejected"
    assert "minimum order amount" in result.error

The first test breaks when someone renames the validator. The second test survives any refactor that preserves the business rule. CoreStory is what makes the second kind possible at scale — it tells the agent that a minimum order amount rule exists, what the threshold is, and what should happen when it’s violated.

CoreStory’s Role

CoreStory serves as the specification expert across the entire test generation workflow:

Behavioral extraction — surfaces acceptance criteria, validation rules, state transitions, authorization matrices, invariants, and implicit behaviors from the PRD, TechSpec, and codebase
Convention discovery — describes the project’s existing test framework, directory structure, fixture patterns, and assertion styles so generated tests match perfectly
Gap analysis — identifies which behaviors are already tested, which are partially covered, and which have no coverage at all
Validation — reviews generated tests to confirm they’re actually verifying the intended specification, not accidentally testing implementation details

How This Relates to Other Playbooks

This playbook suite generates tests for existing, already-implemented behavior. It doesn’t implement new features or fix bugs. If you need a different workflow:

Implementing a new feature with tests: Use the Feature Implementation playbook — its Phase 4 includes TDD as part of the implementation cycle.
Verifying behavioral equivalence during modernization: Use the Behavioral Verification playbook — it compares legacy and modernized implementations.
Extracting business rules before testing: Use the Business Rules Extraction playbook — its output (the BR-XXX inventory) feeds directly into the Behavioral Test Coverage sub-playbook.

When to Use This Playbook

A codebase has significant untested business logic and you want to close coverage gaps systematically
You’re onboarding to an unfamiliar codebase and want to build a safety net before making changes
Preparing for a major refactor, migration, or dependency upgrade and need comprehensive regression tests
A compliance or audit requirement demands documented test coverage of specific business rules
You’ve completed a Business Rules Extraction and want to turn the inventory into executable tests
The team’s test coverage is implementation-heavy (mocking everything, testing method signatures) and you want to shift toward behavioral tests
You need end-to-end tests that verify critical user journeys against acceptance criteria

When to Skip This Playbook

You’re implementing a new feature (use the Feature Implementation playbook)
The codebase is trivially small (under ~5k LOC) — write the tests directly
No CoreStory project exists for the codebase and you can’t create one
You need to verify behavioral equivalence between two implementations (use the Behavioral Verification playbook)
The system under test has no observable behavior (pure infrastructure, configuration-only)

Prerequisites

CoreStory account with at least one project that has completed ingestion
CoreStory MCP server connected to your AI coding agent (see the CoreStory MCP Server Setup Guide)
A code repository the agent can read and write to locally
An existing test framework configured in the project (these playbooks generate tests matching existing conventions — they don’t set up test infrastructure from scratch)
(Recommended) A prior Business Rules Extraction conversation — if one exists, the behavioral inventory phase can consume it directly
(Recommended) Ability to run the test suite locally to verify generated tests

The Sub-Playbooks

This playbook suite contains two workflows. Both follow the same “Specification before Code” methodology but differ in scope, tooling, and output.

Behavioral Test Coverage

The primary workflow. Generates unit-level and integration-level behavioral tests that verify business rules, validation logic, state transitions, authorization policies, invariants, and calculations.

Aspect	Details
Output	Test files in the project’s existing framework (pytest, Jest, JUnit, xUnit, RSpec, etc.)
Scope	One module or domain per session, 10–30 test cases
CoreStory role	Specification Expert — extracts what to test before the agent looks at code
Key differentiator	Tests assert behavioral specifications, not implementation details
Best for	Closing coverage gaps in business-critical logic, preparing for refactors, compliance requirements

Go to Behavioral Test Coverage →

E2E Test Generation

The journey-level workflow. Generates end-to-end tests that verify critical user journeys across the full application stack — UI interactions, API calls, data persistence, and cross-service flows.

Aspect	Details
Output	E2E test files in the project’s E2E framework (Playwright, Cypress, Selenium, etc.)
Scope	One user journey per session, 5–15 test scenarios
CoreStory role	Journey Expert — extracts user stories, acceptance criteria, and critical paths from PRD and codebase
Key differentiator	Tests verify complete user journeys against acceptance criteria, including environment setup and flakiness management
Best for	Verifying critical user flows, pre-release regression suites, onboarding safety nets

Go to E2E Test Generation →

Which Sub-Playbook Should I Use?

Situation	Recommended
Business logic has coverage gaps (validation, auth, state machines)	Behavioral Test Coverage
Critical user journeys have no automated E2E tests	E2E Test Generation
Preparing for a refactor of internal logic	Behavioral Test Coverage
Preparing for a UI or API overhaul	E2E Test Generation
Compliance audit requires documented rule coverage	Behavioral Test Coverage
Release confidence requires journey-level regression	E2E Test Generation
You’ve completed a Business Rules Extraction	Behavioral Test Coverage (consumes the inventory directly)
Both — build from the inside out	Behavioral Test Coverage first, then E2E Test Generation

Shared Principles

Both sub-playbooks follow these principles: Specification before Code. Always extract what to test from CoreStory before examining source code or writing tests. This ensures tests verify intended behavior, not implementation accidents. Match existing conventions exactly. Generated tests should be indistinguishable from hand-written tests by the team — same framework, same directory structure, same fixture patterns, same assertion style. Behavioral assertions, not implementation assertions. Test what the system does, not how it does it. Assert outcomes and state changes, not method calls and internal wiring. Treat failing assertions as discovery. A generated test that fails on assertion (not on setup) is telling you something valuable: either the specification is wrong or the code is wrong. Both are worth knowing. Flag these for human review. Specific queries produce specific tests. “What should I test?” produces shallow tests. “What validation rules exist for order submission, including minimum order amounts, inventory checks, and payment method validation?” produces precise, high-value tests.

CoreStory MCP Tools Used

Both sub-playbooks use the same set of CoreStory MCP tools:

Tool	Purpose
`list_projects`	Find the target project
`create_conversation`	Create a persistent conversation for the generation session
`send_message`	Query CoreStory for specifications, conventions, and validation
`get_project_prd`	Skim PRD structure for domain vocabulary and acceptance criteria
`get_project_techspec`	Skim TechSpec for data model constraints and architectural invariants
`list_conversations`	Check for prior Business Rules Extraction sessions
`get_conversation`	Resume or consume a prior session
`rename_conversation`	Mark the conversation as resolved

A note on the PRD and TechSpec: These documents are typically too large for an agent’s context window. Don’t try to read them end-to-end. Query CoreStory about their contents via send_message instead — CoreStory has already ingested them and can answer targeted questions more efficiently than the agent can parse the raw documents.

​Overview

​Why Specification Before Code

​CoreStory’s Role

​How This Relates to Other Playbooks

​When to Use This Playbook

​When to Skip This Playbook

​Prerequisites

​The Sub-Playbooks

​Behavioral Test Coverage

​E2E Test Generation

​Which Sub-Playbook Should I Use?

​Shared Principles

​CoreStory MCP Tools Used