Skip to main content

Overview

Most AI-generated tests are implementation mirrors. An agent reads a function, infers what it does, and writes a test that confirms the code does what the code does. These tests pass on day one, break on every refactor, and verify nothing meaningful — they’re circular assertions dressed up as coverage. The problem isn’t that AI agents can’t write tests. It’s that they don’t know what the system is supposed to do. Without access to specifications — acceptance criteria, business rules, invariants, authorization policies, state machine definitions — an agent can only test what it sees in the code. And testing what the code does is not the same as testing what the code should do. CoreStory changes this equation. It ingests the full codebase alongside the PRD, TechSpec, and architectural documentation, then serves as a Specification Oracle — an intelligence layer that can answer “what should this system do?” before the agent ever looks at implementation. This unlocks a fundamentally different approach to test generation: specification-driven testing, where the agent extracts behavioral specifications first, then generates tests that verify those specifications against the code.

Why Specification Before Code

The principle is simple: if you know what the system should do before you look at how it does it, you produce tests that survive refactoring, catch real bugs, and document actual business intent.
ApproachWhat the Agent KnowsTest QualityRefactor Survival
Code-mirroringOnly the implementationTests confirm code does what code doesBreaks on rename, restructure, or any internal change
Specification-drivenBusiness rules, acceptance criteria, invariantsTests confirm code does what it should doSurvives any refactor that preserves behavior
A code-mirroring test asserts how:
def test_order_calls_minimum_check_validator():
    with mock.patch("OrderValidator.check_minimum") as m:
        m.return_value = False
        submit_order(create_order(total=4.99))
        m.assert_called_once_with(4.99)
A specification-driven test asserts what:
def test_order_rejected_when_below_minimum_amount():
    result = submit_order(create_order(total=4.99))
    assert result.status == "rejected"
    assert "minimum order amount" in result.error
The first test breaks when someone renames the validator. The second test survives any refactor that preserves the business rule. CoreStory is what makes the second kind possible at scale — it tells the agent that a minimum order amount rule exists, what the threshold is, and what should happen when it’s violated.

CoreStory’s Role

CoreStory serves as the specification oracle across the entire test generation workflow:
  • Behavioral extraction — surfaces acceptance criteria, validation rules, state transitions, authorization matrices, invariants, and implicit behaviors from the PRD, TechSpec, and codebase
  • Convention discovery — describes the project’s existing test framework, directory structure, fixture patterns, and assertion styles so generated tests match perfectly
  • Gap analysis — identifies which behaviors are already tested, which are partially covered, and which have no coverage at all
  • Validation — reviews generated tests to confirm they’re actually verifying the intended specification, not accidentally testing implementation details

How This Relates to Other Playbooks

This playbook suite generates tests for existing, already-implemented behavior. It doesn’t implement new features or fix bugs. If you need a different workflow:
  • Implementing a new feature with tests: Use the Feature Implementation playbook — its Phase 4 includes TDD as part of the implementation cycle.
  • Verifying behavioral equivalence during modernization: Use the Behavioral Verification playbook — it compares legacy and modernized implementations.
  • Extracting business rules before testing: Use the Business Rules Extraction playbook — its output (the BR-XXX inventory) feeds directly into the Behavioral Test Coverage sub-playbook.

When to Use This Playbook

  • A codebase has significant untested business logic and you want to close coverage gaps systematically
  • You’re onboarding to an unfamiliar codebase and want to build a safety net before making changes
  • Preparing for a major refactor, migration, or dependency upgrade and need comprehensive regression tests
  • A compliance or audit requirement demands documented test coverage of specific business rules
  • You’ve completed a Business Rules Extraction and want to turn the inventory into executable tests
  • The team’s test coverage is implementation-heavy (mocking everything, testing method signatures) and you want to shift toward behavioral tests
  • You need end-to-end tests that verify critical user journeys against acceptance criteria

When to Skip This Playbook

  • You’re implementing a new feature (use the Feature Implementation playbook)
  • The codebase is trivially small (under ~5k LOC) — write the tests directly
  • No CoreStory project exists for the codebase and you can’t create one
  • You need to verify behavioral equivalence between two implementations (use the Behavioral Verification playbook)
  • The system under test has no observable behavior (pure infrastructure, configuration-only)

Prerequisites

  • CoreStory account with at least one project that has completed ingestion
  • CoreStory MCP server connected to your AI coding agent (see the CoreStory MCP Server Setup Guide)
  • A code repository the agent can read and write to locally
  • An existing test framework configured in the project (these playbooks generate tests matching existing conventions — they don’t set up test infrastructure from scratch)
  • (Recommended) A prior Business Rules Extraction conversation — if one exists, the behavioral inventory phase can consume it directly
  • (Recommended) Ability to run the test suite locally to verify generated tests

The Sub-Playbooks

This playbook suite contains two workflows. Both follow the same “Specification before Code” methodology but differ in scope, tooling, and output.

Behavioral Test Coverage

The primary workflow. Generates unit-level and integration-level behavioral tests that verify business rules, validation logic, state transitions, authorization policies, invariants, and calculations.
AspectDetails
OutputTest files in the project’s existing framework (pytest, Jest, JUnit, xUnit, RSpec, etc.)
ScopeOne module or domain per session, 10–30 test cases
CoreStory roleSpecification Oracle — extracts what to test before the agent looks at code
Key differentiatorTests assert behavioral specifications, not implementation details
Best forClosing coverage gaps in business-critical logic, preparing for refactors, compliance requirements
Go to Behavioral Test Coverage →

E2E Test Generation

The journey-level workflow. Generates end-to-end tests that verify critical user journeys across the full application stack — UI interactions, API calls, data persistence, and cross-service flows.
AspectDetails
OutputE2E test files in the project’s E2E framework (Playwright, Cypress, Selenium, etc.)
ScopeOne user journey per session, 5–15 test scenarios
CoreStory roleJourney Oracle — extracts user stories, acceptance criteria, and critical paths from PRD and codebase
Key differentiatorTests verify complete user journeys against acceptance criteria, including environment setup and flakiness management
Best forVerifying critical user flows, pre-release regression suites, onboarding safety nets
Go to E2E Test Generation →

Which Sub-Playbook Should I Use?

SituationRecommended
Business logic has coverage gaps (validation, auth, state machines)Behavioral Test Coverage
Critical user journeys have no automated E2E testsE2E Test Generation
Preparing for a refactor of internal logicBehavioral Test Coverage
Preparing for a UI or API overhaulE2E Test Generation
Compliance audit requires documented rule coverageBehavioral Test Coverage
Release confidence requires journey-level regressionE2E Test Generation
You’ve completed a Business Rules ExtractionBehavioral Test Coverage (consumes the inventory directly)
Both — build from the inside outBehavioral Test Coverage first, then E2E Test Generation

Shared Principles

Both sub-playbooks follow these principles: Specification before Code. Always extract what to test from CoreStory before examining source code or writing tests. This ensures tests verify intended behavior, not implementation accidents. Match existing conventions exactly. Generated tests should be indistinguishable from hand-written tests by the team — same framework, same directory structure, same fixture patterns, same assertion style. Behavioral assertions, not implementation assertions. Test what the system does, not how it does it. Assert outcomes and state changes, not method calls and internal wiring. Treat failing assertions as discovery. A generated test that fails on assertion (not on setup) is telling you something valuable: either the specification is wrong or the code is wrong. Both are worth knowing. Flag these for human review. Specific queries produce specific tests. “What should I test?” produces shallow tests. “What validation rules exist for order submission, including minimum order amounts, inventory checks, and payment method validation?” produces precise, high-value tests.

CoreStory MCP Tools Used

Both sub-playbooks use the same set of CoreStory MCP tools:
ToolPurpose
list_projectsFind the target project
create_conversationCreate a persistent conversation for the generation session
send_messageQuery CoreStory for specifications, conventions, and validation
get_project_prdSkim PRD structure for domain vocabulary and acceptance criteria
get_project_techspecSkim TechSpec for data model constraints and architectural invariants
list_conversationsCheck for prior Business Rules Extraction sessions
get_conversationResume or consume a prior session
rename_conversationMark the conversation as resolved
A note on the PRD and TechSpec: These documents are typically too large for an agent’s context window. Don’t try to read them end-to-end. Query CoreStory about their contents via send_message instead — CoreStory has already ingested them and can answer targeted questions more efficiently than the agent can parse the raw documents.