Let’s be clear upfront: we don’t just audit AI. We audit entire companies.

Engineering. Product. Process. Team structure. The actual shape of the codebase. All the usual suspects. You can read more about that on our audit services page. But lately, something’s shifted.

More and more companies show up with a pitch deck that says "AI-powered" in size 72 font. And when we look under the hood, we find classification models, pipelines, prompts, vector stores, agents, and a whole bunch of magic wires no one can really explain. So we’ve started digging deeper.

This post isn’t for AI researchers or prompt whisperers. It’s for founders, product leaders, and investors who want to understand what makes an AI system production-ready or dangerously duct-taped.

Where the AI lives

First question: What kind of AI are we dealing with? Classical machine learning? Prompting an LLM? Retrieval-augmented generation (RAG)? Agentic tool use? Each option brings a different flavour of complexity, and most of them will bite you somewhere eventually.

Then there’s the data foundation. Where does it come from? Is it cleaned? Is it labelled? Is anyone watching what goes in, or does the model just eat whatever it’s fed?

I’ve seen jobs that try to process thousands of documents through a single-threaded script, keeping a single HTTP connection open for 24 hours straight with no retries, no logging, no fallbacks. If it fails halfway, nobody knows which parts ran or what broke. If your product has pipelines or jobs, someone needs to own them. And if your system ever scales up, we want to know what infrastructure handles that load and how confident the team is that it won’t fall over.

One last tip: beware vendor lock-in. If every AI call is wired directly to OpenAI, switching to another provider later becomes a full rewrite. It’s not exciting work, but having an abstraction layer is what makes that kind of pivot even possible. And remember, you never really know where a vendor will go next in pricing, policy, or product direction.

Who actually owns the AI part?

This is where it often gets messy. Is there a data scientist? Great. Where does their work live? In a Jupyter notebook? In production? Somewhere in between?

I often see notebooks floating in separate repos, not versioned, not reviewed, and definitely not built to run in production. What starts as experimentation becomes part of the product by accident. Code gets copied over. Scripts get triggered manually. No one really knows who owns what. And once things break, the team scrambles to debug a notebook that was never meant to run outside of someone’s laptop.

If developers write prompts, are those prompts reviewed? Versioned? Tested? If agents make decisions, is anyone watching what they actually do?

And then there’s the hiring side. If you’re building an AI product with zero AI experience in-house, we’ll ask who’s going to maintain all this because it won’t maintain itself.

Can it evolve without breaking?

Prompts change. Embedding models get replaced. Models get retrained. We look for versioning, rollback paths, and migration plans. If someone updates a prompt and suddenly accuracy drops 20%, does anyone notice? And can they revert it?

At one company, everything looked solid: 70% test coverage and well-documented pipelines. But none of the new AI features were tested. The AI code lived in a separate setup and didn’t fit their usual testing strategy. A small prompt change made responses sound friendlier but also less precise. The tone improved, but key details disappeared. The team only noticed when users started flagging vague or incomplete answers.

We don’t expect perfect processes. We expect signs that someone’s thinking ahead.

Will you know when it breaks?

Traditional systems break loudly. AI systems degrade quietly. Hallucinations increase. Token usage spikes. Output quality dips. Costs sneak up.

You can’t write a unit test that says "response must be exactly X", but you can still check for quality. We look for validation datasets, prompt evaluations, latency checks, and human feedback loops. We ask what’s being monitored. Are there metrics? Feedback loops? Billing alerts? If the first time you notice a problem is when customers complain, it’s already too late.

And don’t forget: if your ingestion pipeline is letting garbage through, you don’t need to be an AI model to predict what comes out the other side. It’s going to be garbage. You can polish your prompts until they shine, but if you’re feeding the model 💩, the results are going to smell the same.

Are you leaking anything?

Sensitive data can appear in prompts, be inadvertently stored in embeddings, or end up recorded in logs. We ask whether anything private is being indexed, stored, or shipped off to a third party. We also ask whether your provider choices match your customers’ compliance needs. GDPR isn’t optional and relying on a big brand name alone is just the due diligence you didn’t do.

AI isn’t magic. But it can break in magical ways.

We don’t audit to play gotcha, I'm also not Sherlock Holmes. We audit to see whether the team is in control, or just putting twenty half-blind oracles in a room and hoping they agree on something.

If you’re building something AI-heavy, make sure someone’s asking the unglamorous questions. And if you want help doing that, well… you know where to find us. Because if the AI fails silently, your product fails with it and by the time you notice, your customers already have.