Overview
Test suites rot from the inside. Teams write tests for the features they build, skip the ones they inherit, and never go back to fill the gaps. The result is coverage that correlates with recency, not criticality — the newest code is well-tested, the most important code is not. AI coding agents can generate tests at scale, but without knowing what the system is supposed to do, they produce tests that mirror implementation: tests that pass today, break on every refactor, and verify nothing meaningful. This playbook teaches you how to use the CoreStory MCP server, combined with local source code, to systematically generate tests that verify behavioral specifications — acceptance criteria, business rules, invariants, state transitions, and authorization policies — rather than implementation details. The approach uses CoreStory as a Specification Oracle: the agent queries CoreStory for what the system should do, discovers how the system actually does it, and generates tests that bridge the two. The primary deliverable is executable test code that matches the project’s existing test conventions — framework, directory structure, naming patterns, fixture approach, and assertion style. There is no intermediate documentation artifact. The behavioral inventory lives in the CoreStory conversation; the tests are the output. How this relates to other playbooks: This playbook generates tests for existing, already-implemented behavior — it doesn’t implement new features or fix bugs. If you’re implementing a new feature and want tests as part of that process, use the Feature Implementation playbook, which includes TDD as Phase 4. If you’re verifying behavioral equivalence between legacy and modernized code, use the Behavioral Verification playbook. If you need to extract and document business rules before generating tests, use the Business Rules Extraction playbook — its output feeds directly into Phase 2 of this playbook. This playbook’s unique contribution is systematic, specification-driven test generation for existing codebases that lack adequate coverage.When to Use This Playbook
- A codebase has significant untested business logic and you want to close coverage gaps systematically
- You’re onboarding to an unfamiliar codebase and want to build a safety net before making changes
- Preparing for a major refactor, migration, or dependency upgrade and need comprehensive regression tests
- A compliance or audit requirement demands documented test coverage of specific business rules
- You’ve completed a Business Rules Extraction and want to turn the inventory into executable tests
- The team’s test coverage is implementation-heavy (mocking everything, testing method signatures) and you want to shift toward behavioral tests
When to Skip This Playbook
- You’re implementing a new feature (use the Feature Implementation playbook — its TDD phase generates tests as part of implementation)
- The codebase is trivially small (under ~5k LOC) — write the tests directly
- No CoreStory project exists for the codebase and you can’t create one
- You need to verify behavioral equivalence between two implementations (use the Behavioral Verification playbook)
- The system under test has no observable behavior (pure infrastructure, configuration-only)
Prerequisites
- CoreStory account with at least one project that has completed ingestion
- CoreStory MCP server connected to your AI coding agent (see the CoreStory MCP Server Setup Guide)
- A code repository the agent can read and write to locally
- An existing test framework configured in the project (the playbook generates tests matching existing conventions — it doesn’t set up test infrastructure from scratch)
- (Recommended) A prior Business Rules Extraction conversation — if one exists, Phase 2 can consume it directly instead of starting from scratch
- (Recommended) Ability to run the test suite locally to verify generated tests
How It Works
The Workflow Phases
| Phase | Name | Purpose | CoreStory Role |
|---|---|---|---|
| 1 | Setup & Scoping | Select project, create conversation, define generation scope | Setup |
| 2 | Behavioral Inventory | Extract testable specifications: acceptance criteria, invariants, business rules, state transitions | Oracle |
| 3 | Test Convention Discovery | Understand the project’s existing test patterns, framework, fixtures, and structure | Oracle + Navigator |
| 4 | Coverage Gap Analysis | Map behavioral inventory against existing tests to identify what’s missing | Navigator |
| 5 | Test Generation & Validation | Generate test code, run it, verify tests are meaningful | — (local code + CoreStory validation) |
| 6 | Completion & Capture | Review coverage, commit tests, rename conversation | Knowledge capture |
CoreStory MCP Tools Used
| Tool | Phase(s) | Purpose |
|---|---|---|
list_projects | 1 | Find the target project |
create_conversation | 1 | Create a persistent conversation for the generation session |
send_message | 2, 3, 4, 5 | Query CoreStory for specifications, test conventions, and validation |
get_project_prd | 2 | Skim PRD structure for domain vocabulary and acceptance criteria sections |
get_project_techspec | 2 | Skim TechSpec for data model constraints and architectural invariants |
list_conversations | 1 | Check for prior Business Rules Extraction sessions to build on |
get_conversation | 1 | Resume or consume a prior extraction session |
rename_conversation | 6 | Mark the conversation as resolved |
send_message instead — CoreStory has already ingested them and can answer targeted questions about acceptance criteria, business rules, and constraints more efficiently than the agent can parse the raw documents.
HITL Gate
After Phase 4 (Coverage Gap Analysis): Before generating tests, a human should review the behavioral inventory and prioritized gap list. This is the checkpoint where domain knowledge matters most — the human validates that the extracted specifications are correct, that the prioritization makes sense, and that the scope is appropriate. Generating tests from incorrect specifications produces confidently wrong assertions.
Step-by-Step Walkthrough
Phase 1 — Setup & Scoping
Goal: Establish the test generation session and define what you’re generating tests for. Step 1.1: Find the project.project_id — you’ll use it for every subsequent call.
Step 1.2: Check for prior work.
- Business Rules Extraction conversations (titles containing “Business Rules”) — if one exists for the module you’re targeting, it contains a pre-built behavioral inventory. You can consume it in Phase 2 instead of extracting from scratch.
- Prior Test Generation conversations (titles containing “Test Generation”) — if you’ve run this playbook before on a different module, review it for conventions and patterns that worked well.
get_conversation to review any relevant prior work.
Step 1.3: Create a conversation.
- “Test Generation — Order Processing Module”
- “Test Generation — Full System Behavioral Coverage”
- “Test Generation — Authentication & Authorization Rules”
| Scope | When to Use | Expected Output |
|---|---|---|
| Single module | You need tests for one specific area (e.g., payments, auth) | 10–30 test cases |
| Single domain | You need tests across a domain (e.g., all e-commerce rules) | 30–80 test cases |
| Full system | You need comprehensive behavioral coverage | 80+ test cases, done in multiple sessions |
Phase 2 — Behavioral Inventory (Oracle)
Goal: Extract a comprehensive list of testable behavioral specifications for the scoped area. This is the phase that distinguishes specification-driven test generation from naive code-coverage-driven generation. You’re building an inventory of what the system should do, not what it happens to do. Each item in this inventory becomes one or more test cases. If a Business Rules Extraction conversation exists for the target scope, consume it:- Acceptance criteria (from PRD / user stories)
- Validation rules (per entity/operation)
- State transitions (valid and invalid)
- Authorization rules (per role × operation)
- Invariants (always-true conditions)
- Implicit behaviors (undocumented but enforced)
- Calculations and transformations
Phase 3 — Test Convention Discovery (Oracle + Navigator)
Goal: Understand how the project’s existing tests are structured so generated tests match perfectly. This phase ensures generated tests are indistinguishable from hand-written tests by the team. The agent must discover the testing conventions before writing any test code. Step 3.1: Query for test framework and structure.test_*.py, *.test.ts, *Test.java), and any test configuration files.
Step 3.2: Query for fixture and setup patterns.
- Import patterns
- Setup/teardown patterns
- Assertion style (fluent, classic, custom)
- Mock/stub conventions
- Test naming (descriptive strings vs. method names)
- Comment and docstring conventions
- Test framework and runner
- Directory and file naming conventions
- Fixture and setup patterns to follow
- Mock/stub approach
- Assertion style
- 2–3 reference test files to use as templates
Phase 4 — Coverage Gap Analysis (Navigator)
Goal: Map the behavioral inventory against existing tests to identify what’s missing. Prioritize the gaps. Step 4.1: Query for existing test coverage.- Which behaviors are already tested
- Which behaviors are tested but weakly (e.g., only happy path, no edge cases)
- Which behaviors have no test coverage at all
| Status | Meaning | Action |
|---|---|---|
| Covered | Existing tests adequately verify this behavior | Skip — no new test needed |
| Partially covered | Tests exist but miss edge cases, error paths, or boundary conditions | Generate supplemental tests |
| Uncovered | No tests verify this behavior | Generate full test coverage |
- Business criticality — Rules that affect money, security, data integrity, or regulatory compliance
- Risk of breakage — Behaviors in frequently modified code, complex logic, or cross-component interactions
- Specificity of specification — Behaviors where the Phase 2 inventory has precise, testable specifications (vague specifications produce vague tests)
- Testability — Behaviors that can be tested in isolation without excessive infrastructure
HITL Gate: Present the gap analysis to the human for review before proceeding. Key questions: Are the extracted specifications correct? Is the prioritization sensible? Is the scope appropriate for this session?
Phase 5 — Test Generation & Validation
Goal: Generate test code for each gap, verify tests pass, and confirm they’re meaningful. Work through the prioritized gap list from Phase 4. For each behavioral specification: Step 5.1: Generate the test. Using the behavioral specification from Phase 2, the test conventions from Phase 3, and the reference test files as templates, write the test. Each test should:- Follow the project’s naming conventions exactly
- Use the project’s fixture and setup patterns
- Assert the behavioral specification, not implementation details
- Include a docstring or comment linking back to the specification (e.g., the acceptance criterion, business rule ID, or invariant)
- Handle setup, action, and assertion in the project’s standard structure (AAA, Given-When-Then, etc.)
| Failure Type | Meaning | Action |
|---|---|---|
| Setup failure | Test infrastructure is wrong (bad imports, missing fixtures, incorrect setup) | Fix the test mechanics — this is a convention mismatch, not a specification issue |
| Assertion failure | The code doesn’t match the specification | Investigate: is the spec wrong or is the code wrong? This is valuable discovery — flag it for human review |
| Runtime error | Test triggers an error path not anticipated in the specification | Add to the behavioral inventory as a discovered edge case |
- All new tests pass
- No existing tests broke (new test files shouldn’t affect existing tests, but shared fixture changes might)
- Test execution time is reasonable (generated tests should not significantly slow the suite)
Phase 6 — Completion & Capture
Goal: Finalize generated tests, capture the session, and report coverage. Step 6.1: Review coverage against the behavioral inventory. Map the generated tests back to the Phase 2 behavioral inventory. Produce a summary:Tips & Best Practices
The specificity principle applies to test generation even more than to extraction. Compare:| Query | Test Quality |
|---|---|
| ”What should I test in the order module?” | Generic, shallow tests |
| ”What validation rules exist for order submission, including minimum order amounts, inventory checks, and payment method validation?” | Precise, high-value tests with specific assertions |
- Generate tests one domain at a time, completing the full cycle (inventory → conventions → gaps → generate → validate) before moving on
- Within a domain, generate core behavioral tests first, edge case tests second
- Target 10–30 test cases per session — enough to be meaningful, small enough for thorough human review
- For a full-system effort, plan multiple sessions with clear domain boundaries
- After Phase 2 (behavioral inventory) — to validate extracted specifications, especially implicit rules that exist only in code
- After Phase 4 (gap analysis) — to confirm prioritization and scope
- When generated tests fail on assertion — to determine whether the spec or the code is wrong
- For low-confidence specifications (found only in code with no supporting documentation)
Advanced Patterns
Consuming a Business Rules Inventory
If the Business Rules Extraction playbook has been run for this module, the output is a structured inventory with rule IDs (BR-XXX), domains, types, enforcement layers, source files, and invariants. Each rule maps to tests as follows:| Rule Type | Test Pattern |
|---|---|
| Validation | Given invalid input → assert rejection with specific error |
| Authorization | Given unauthorized user/role → assert access denied |
| State Transition | Given entity in state A → perform action → assert state B + postconditions |
| Calculation | Given inputs → assert output matches formula/expected value |
| Constraint | After any operation → assert invariant still holds |
| Workflow | Given preconditions → execute full workflow → assert end state + side effects |
Parameterized Tests for Validation Rules
When a validation rule has multiple constraint values (e.g., field length limits, allowed formats, enum values), generate parameterized tests rather than individual test functions:Authorization Matrix Testing
When the behavioral inventory includes authorization rules across multiple roles and operations, generate the tests systematically from the matrix:State Transition Testing
For entities with defined lifecycles, generate tests for both valid and invalid transitions:Integration with CI/CD
Generated tests should be integrated into the project’s CI/CD pipeline like any other tests. No special configuration should be needed — the tests use the same framework, fixtures, and assertion patterns as existing tests. If the project has coverage reporting (e.g.,pytest-cov, Istanbul, JaCoCo), the generated tests will automatically improve reported coverage.
For teams running this playbook regularly, consider a periodic cadence: run test generation for one domain per sprint, rotating through the system. This gradually builds comprehensive behavioral coverage without requiring a single large effort.
Troubleshooting
CoreStory returns vague behavioral specifications. Your query is too broad. Replace “What should I test?” with “What validation rules exist for [specific entity] including [specific rule types]?” Always name the module, entity, or workflow. See the specificity principle in Tips above. Generated tests fail on setup, not on assertions. The test conventions from Phase 3 don’t match reality. Re-inspect the existing test files locally. Common causes: wrong import paths, missing fixture setup, incorrect mock targets, or framework version mismatches. Generated tests all pass but don’t feel meaningful. The tests may be asserting implementation details rather than behavioral specifications. Review against the “behavioral vs. implementation” distinction in Tips. If the test would still pass after changing the underlying business rule, it’s not testing the rule. CoreStory’s behavioral specification contradicts what the code does. This is valuable discovery. The specification (from the PRD or CoreStory’s understanding) says X; the code does Y. Flag it as a conflict rather than adjusting the test to match the code. One of two things is true: the code has a bug, or the specification is outdated. Both are worth knowing. Too many gaps to address in one session. This is normal for large, undertested codebases. Focus on one domain per session, prioritized by business criticality. Use the gap matrix from Phase 4 to plan a multi-session campaign. Each session produces value independently — you don’t need to cover everything at once. Phase 2 surfaces behaviors that are already well-tested. Skip them. The gap analysis in Phase 4 exists precisely to avoid generating redundant tests. If Phase 4 shows most behaviors are covered, the module has good existing coverage — move to a different module. Tests take too long to run. Generated behavioral tests should be fast. If they’re slow, check whether they’re accidentally hitting real databases, APIs, or file systems instead of using the project’s standard mocks and fixtures. Ensure generated tests follow the same isolation patterns as existing tests.Agent Implementation Guides
Claude Code
Setup
1. Connect CoreStory MCP server. Run this in your terminal:.claude/skills/generate-tests/SKILL.md with the contents from the Skill File section below. Commit it to version control so the whole team gets it:
Usage
The skill activates automatically when Claude Code detects test generation requests:Tips
- Skills auto-load from directories added via
--add-dir, so team-shared skills work across machines. - Claude Code detects file changes during sessions — you can edit the skill file and it takes effect immediately.
- Keep the SKILL.md under 500 lines for reliable loading.
- Let it run. The workflow is designed for autonomous execution. Interrupting mid-phase breaks the chain of context.
- Start with a focused module. A single-module run produces tests you can review in one sitting. Full-system runs produce too much to review at once.
- The skill works with other skills. If you have a Business Rules Extraction skill, Claude Code will use its output as input to Phase 2.
Skill File
Save as.claude/skills/generate-tests/SKILL.md:
GitHub Copilot
Setup
-
Configure the CoreStory MCP server. Add to your VS Code MCP settings (
.vscode/mcp.jsonor user settings). Verify by asking Copilot Chat: “List my CoreStory projects.” -
Add project-level custom instructions. Create or update
.github/copilot-instructions.mdwith the content from the instructions file below. -
Optionally add a reusable prompt file. Create
.github/prompts/generate-tests.prompt.mdwithmode: agentfrontmatter for on-demand invocation. - Commit to version control:
Usage
With custom instructions active, Copilot Chat applies the workflow automatically when you ask about test generation:Tips
.github/copilot-instructions.mdis always active — it’s global custom instructions for the project. Keep it focused on principles.- Prompt files (
.github/prompts/) are invoked on demand and supportmode: agentfor agentic execution. - Copilot Chat accesses MCP tools through the VS Code MCP configuration. Ensure CoreStory tools appear in the available tools list.
Custom Instructions File
Save as.github/copilot-instructions.md (append to existing content if the file already exists):
Cursor
Setup
-
Configure the CoreStory MCP server. Add to your Cursor MCP configuration (
.cursor/mcp.jsonor user settings). Verify by asking Cursor Chat: “List my CoreStory projects.” -
Add the project rule. Cursor uses rules stored in
.cursor/rules/:
.cursor/rules/generate-tests.mdc with the content from the rule file below.
- Commit to version control:
Usage
WithalwaysApply: true, the rule activates automatically when Cursor detects test generation context. Or trigger it explicitly:
Tips
- Cursor rules use
.mdcextension with YAML frontmatter containingdescription,globs, andalwaysApply. - Set
alwaysApply: truefor rules that should always be active, or useglobsto restrict to specific files. - Rules apply in both Composer and Chat modes.
Project Rule
Save as.cursor/rules/generate-tests.mdc:
Factory.ai
Setup
-
Configure the CoreStory MCP server in your Factory.ai environment. Verify with the
/mcpcommand that CoreStory tools are accessible. -
Add the custom droid. Factory.ai uses droids stored in
.factory/droids/(project-level) or~/.factory/droids/(personal):
.factory/droids/generate-tests.md with the content from the droid file below.
- Commit to version control:
Usage
Invoke the droid:Droid File
Save as.factory/droids/generate-tests.md: