Pillar 1: testing your AI code before shipping (TDD, unit tests, e2e)

The code that “works” but doesn’t work

Last week, I asked Claude to generate a parsing function for Stripe webhooks. The code was clean. Well-structured. TypeScript types were correct. I read through it, nodded, and merged.

Next day, first real payment. Crash.

The function handled the happy path perfectly: a well-formed checkout.session.completed event with every field filled in. But the first user who triggered an invoice.payment_failed with a null customer field blew up the entire chain.

The code worked. In the prompt. Not in production.

This is the fundamental trap of vibe coding. The AI optimizes to answer your question. Not to survive the real world. It generates code that satisfies the question you asked, not the questions you didn’t ask.

And that’s exactly why tests exist.

Why Robert C. Martin insists on testing

In The Clean Coder, Robert C. Martin lays down a simple rule: a professional developer does not ship code without tests. Period.

Not because it’s pretty. Not because it’s a trendy best practice. Because it’s the only way to know the code actually works.

He defines the 3 laws of TDD (Test-Driven Development):

You shall not write production code without first writing a failing test.
You shall not write more of a test than is sufficient to fail.
You shall not write more production code than is sufficient to pass the test.

The red-green-refactor cycle. Write a failing test. Write the minimum code to make it pass. Clean up. Repeat.

When you write code yourself, you have a mental model. You know which edge cases exist because you’ve thought about the problem. You know null can show up because you understand the upstream API.

When AI writes the code, you have none of that. You have code that looks correct. But you don’t have the mental model that comes with it.

Tests fill that gap. They’re the safety net between the code you didn’t write and the production environment where it’s going to run.

The 3 types of tests

Not all tests are equal. Each one catches a different type of bug. As a solo builder using AI, you need all three.

Unit tests

A unit test isolates a function and verifies it does what it claims to do. It’s the fastest, cheapest test, and the one AI writes best.

// Vitest
import { describe, it, expect } from 'vitest'
import { parseWebhookEvent } from './stripe'

describe('parseWebhookEvent', () => {
  it('should extract customer email from checkout event', () => {
    const event = mockCheckoutEvent({ email: 'user@test.com' })
    expect(parseWebhookEvent(event).email).toBe('user@test.com')
  })

  it('should throw on missing customer field', () => {
    const event = mockCheckoutEvent({ customer: null })
    expect(() => parseWebhookEvent(event)).toThrow('Missing customer')
  })

  it('should handle empty string email', () => {
    const event = mockCheckoutEvent({ email: '' })
    expect(parseWebhookEvent(event).email).toBeNull()
  })
})

Vitest or Jest for JavaScript/TypeScript. pytest for Python. Go has its built-in testing framework. The tool doesn’t matter. What matters is testing the edge cases.

The happy path, the AI handles. It’s the null values, the empty strings, the arrays with 10,000 elements, the Unicode characters that break things.

Unit tests catch logic errors. They run in milliseconds. You can have 500 of them and execute them all in 2 seconds.

Integration tests

The unit test says: “this function works alone.” The integration test says: “these functions work together.”

This is where the real bugs live. Function A returns a string. Function B expects a number. Each function passes its unit tests. Together, they explode.

describe('Checkout flow', () => {
  it('should create order from Stripe webhook', async () => {
    const webhook = buildStripeWebhook('checkout.session.completed')
    const response = await app.inject({
      method: 'POST',
      url: '/webhooks/stripe',
      payload: webhook,
      headers: stripeHeaders(webhook),
    })

    expect(response.statusCode).toBe(200)

    const order = await db.orders.findFirst({
      where: { stripeSessionId: webhook.data.object.id },
    })
    expect(order).not.toBeNull()
    expect(order.status).toBe('completed')
  })
})

Integration tests are slower. They need a test database, a running server. But they catch the bugs unit tests miss: serialization issues, connection errors, race conditions.

When AI generates multiple modules that need to talk to each other, integration tests are your only guarantee the whole thing works.

End-to-end tests

The end-to-end test simulates a real user. It opens a browser, clicks buttons, fills forms, and checks the result is correct.

// Playwright
import { test, expect } from '@playwright/test'

test('user can complete checkout', async ({ page }) => {
  await page.goto('/products')
  await page.click('[data-testid="add-to-cart"]')
  await page.click('[data-testid="checkout"]')

  await page.fill('#email', 'test@example.com')
  await page.fill('#card', '4242424242424242')
  await page.click('#pay')

  await expect(page.locator('.confirmation')).toContainText('Thank you')
})

Playwright or Cypress. E2e tests are slow, from 10 seconds to several minutes per test. But they catch what humans actually experience. The button that doesn’t respond. The redirect that loops. The form that loses data after a refresh.

For a solo builder, 5 to 10 e2e tests on critical paths are enough. Login, main action, expected result. The path your users walk every day.

How to make AI write the tests FOR you

Here’s the irony: you can use AI to test AI code. And it works. If you know how to ask.

Naive prompt: “Write tests for this function.”

Result: the AI writes tests that pass. Always. Because it tests the happy path it coded itself. It’s like asking a student to grade their own exam.

Effective prompt: “Write tests that BREAK this function. Find the edge cases. What happens with null? With an empty string? With an array of 100,000 elements? With Unicode characters? With a connection that times out?”

Now the AI becomes deadly. It knows bug patterns. It knows JSON.parse can throw. It knows Array.prototype.find returns undefined. You just need to tell it to look for flaws instead of confirming everything works.

My 3-step strategy:

I write the specs in plain language. “This function should parse a Stripe webhook, extract the customer’s email, and return null if the field is missing.”
I ask the AI to generate the implementation.
I ask the AI to generate adversarial tests, the ones that try to break the implementation.

If you’re using Claude Code as a copilot, you can take this further. Describe your testing conventions in your CLAUDE.md. List the recurring edge cases in your project. The AI will remember them with every generation.

And with Claude Code hooks, you can even run tests automatically on every commit. The AI generates the code, the hooks run the tests, you only merge if it’s green.

The “no merge without green” rule

The best test in the world is useless if nobody runs it.

The rule is simple: no code reaches production unless all tests are green. Not “most of them.” Not “the important ones.” All of them.

In practice:

Pre-commit hooks that run unit tests before every commit. If a test fails, the commit is blocked. husky + lint-staged in JS, pre-commit in Python.
CI pipeline that runs integration and e2e tests on every push. GitHub Actions, GitLab CI, doesn’t matter. If it’s red, the merge is blocked.
Zero exceptions. The day you say “I’ll skip the tests just this once” is the day the bug ships to prod.

Tests are the first pillar. But running them manually doesn’t scale. That’s where automation comes in: CI/CD, monitoring, alerts that wake you up before your users do.

That’s the topic of the next article: CI/CD and monitoring for solo builders.

Summary

Test type	What it catches	Tool	Setup time
Unit	Logic errors, edge cases	Vitest, Jest, pytest	10 minutes
Integration	Cross-module issues, serialization, DB	Supertest, pytest + fixtures	30 minutes
E2E	UX bugs, broken flows, visual regressions	Playwright, Cypress	1-2 hours

AI code isn’t magic. It’s code. And all code needs tests.

The difference with code you write yourself is that you don’t have the mental model that comes with it. You don’t know why the AI made a particular choice. You don’t know which shortcuts it took.

Tests fill that hole. They turn “it looks like it works” into “it works, and here’s the proof.”

This is the first pillar of the manifesto for shipping clean AI code. The other two follow. But without this one, nothing holds.