An AI agent that misunderstands a requirement will write a perfect implementation of the wrong thing. The code will be clean. The tests will pass. CI will be green. Everything looks correct because the code and the tests agree with each other. They just don't agree with reality.

The mistake isn't in the code. It's in the test, because the test is where the agent wrote down what it thinks “correct” means.

The internally consistent mistake

When an agent builds a feature, it generates code and tests as a pair. The code implements the logic. The test asserts what the agent believes the outcome should be. If the agent got the requirement wrong, both are wrong in the same direction.

A late fee function that triggers after 30 days instead of 14. The test asserts expect(lateFee(31)).toBe(25). Green. The agent is confident. The business rule says 14 days, €12. Nobody told the agent that, or the ticket was ambiguous, or the spec didn't exist. The code is structurally flawless. The assumption underneath it is wrong.

The test encodes the assumption. The code just satisfies it.

Reading the implementation line by line won't surface this. The implementation will always be consistent with its own tests. The variable names will be good. The error handling will be there. That's the part AI never gets wrong. The question that matters is whether the test says the right thing.

Who should be reviewing

Traditional code review asks: is this code correct? That question made sense when humans wrote the code, because humans make structural mistakes. Forgot a null check. Off-by-one. Stale reference. Those are visible in the diff.

AI doesn't make those mistakes. What it gets wrong is intent. And intent is visible in the test assertions, not the implementation. expect(lateFee(31)).toBe(25) is a claim about the business. Evaluating that claim requires knowing the business.

That's not a programming skill. It's a domain knowledge skill. The people who can review AI-generated work aren't necessarily the strongest coders on the team. They're the ones who know why the billing threshold is 14 days and not 30. Product owners, domain experts, the senior who's been in the codebase long enough to know the history behind the rules.

This was already true before AI. Domain-blind reviewers have always rubber-stamped business logic they didn't understand. The difference is that AI has removed the other things those reviewers could usefully catch. Style, structure, formatting, patterns. All handled. What's left is the one thing that was always the hardest to review: does this match what the business actually needs?

The blind spot

Tests validate stated assumptions. They say nothing about unstated ones.

A test that checks the late fee calculation doesn't check whether the notification email fires, whether the fee shows up on the invoice, or whether the retry logic handles a payment gateway timeout. Integration behaviour, performance under load, security at the boundaries. These don't live in unit test assertions.

Reviewing the tests catches the wrong business rule. It doesn't catch the missing one. So this isn't “just read the tests and approve.” It's “start with the tests, because that's where the agent showed its hand.” Then ask what it didn't show you.

One question

Next time you open a PR from an AI agent, skip the implementation. Open the test file. Read the assertions.

Is this what the feature should actually do?

If you can answer that, the implementation barely matters. If you can't, you're not the right reviewer for this PR. And that's not a failure. It's the most useful signal code review has produced in years.