I built a customer-support AI agent on top of an existing messaging app (WhatsApp linked to a CRM in this case, but it works with any similar setup), and instead of turning it on all at once, I rolled it out in stages. The goal was not to replace the customer-support team overnight. It was to create a system that could learn from real conversations, earn trust gradually, and only eventually take on more autonomous work.
The whole thing is thin middleware: roughly ten Supabase Edge Functions with no separate vector database, no separate application host, and no scheduler. A customer message comes in, gets embedded with Voyage AI, is matched against the knowledge base with pgvector cosine search, and is then passed to Claude Haiku 4.5 to generate a candidate reply. What happens next depends on the autonomy phase the system is currently in. Voyage's voyage-3-lite model is the embedding layer throughout, and its 512-dimensional output keeps retrieval fast and compact while still giving strong semantic matching for short support Q&As.
Progressive autonomy
The product shape is progressive autonomy: one database row controls which of three phases the system is in.
- In Phase 1, the AI generates a suggestion silently. The reply is stored in the database, but neither the customer nor the agent sees it. This is pure shadow mode, and its only purpose is evaluation.
- In Phase 2, the same suggestion is posted as an internal note, visible to the operator but not the customer. The agent can copy it, edit it, or ignore it, and each action becomes a learning signal.
- In Phase 3, the system is allowed to auto-send replies to customers. That switch is only flipped once the Phase 2 metrics show that the agent is accurate enough to trust.
That gradual rollout matters because it makes autonomy something the system has to earn. Going straight to auto-send would have been a credibility bomb.
Thin middleware by design
The architecture was intentionally kept boring. An incoming customer message is embedded with Voyage AI, routed through pgvector search over the knowledge base, and then passed to Claude Haiku 4.5 along with the retrieved examples. The model generates a reply using those nearby Q&A pairs as few-shot context, and the system either stores the draft, posts it as an internal note, or sends it to the customer, depending on the current phase. Anthropic's Haiku 4.5 pricing fits the cost profile of a high-volume support workflow.
The point of keeping the stack this small was not elegance for its own sake. Support systems need to fail visibly, stay debuggable, and avoid hidden complexity. The hard problems were never the model calls themselves. They were the integration edges around them.
Voyage as the retrieval layer
Voyage AI turned out to be a core part of the system rather than a supporting detail. I use Voyage for every embedding in the workflow. Each knowledge-base row gets a Q+A embedding for retrieval and a Q-only embedding for deduplication, while every incoming customer message is embedded as a query before searching for nearest neighbours. The voyage-3-lite model was a strong fit because it is optimised for latency and cost, and it produces 512-dimensional vectors by default, which keeps vector search compact while still being semantically useful for short support exchanges.
I also found that using two input_type modes mattered. document works for ingest-time embeddings, while query works for retrieval-time embeddings, and separating those two gave a measurable bump in recall. In practice, that meant the system was more likely to surface the right answer instead of a near miss. Voyage's role is invisible to the end user, but it is the lever that decides whether retrieval finds the actual fix or just something vaguely related.
The knowledge base as memory
The knowledge base is the system's memory, and it was the hardest part to get right.
I store two embeddings per row because retrieval and deduplication care about different signals. The Q+A embedding helps retrieval because answer context disambiguates similar questions, while the Q-only embedding helps deduplication because the same question can be answered in slightly different ways by different agents. When a new pair arrives, it checks the nearest existing row; a very high similarity means it should merge into the canonical record, a slightly lower match means it should link through parent_id, and anything else becomes a fresh insert. Soft-delete preserves the audit trail, which turned out to be more valuable than trying to keep the table perfectly clean.
I also weight sources differently. Learned data from live conversations gets the most importance, imported historical material sits in the middle, and curated management FAQ content gets the highest weight. That gives the system room to learn from reality without letting noisy history overpower carefully curated answers.
Noise filtering is another piece that sounds minor until you skip it. Widget boilerplate, emoji-only replies, and single-word pings never enter the KB. If they do, retrieval quality collapses quickly.
I also reinforce successful resolutions. When a conversation is marked resolved, every KB row that contributed to it gets a small weight bump. The logic is simple: if an answer helped solve a real customer problem, the system should see that as evidence that the answer is trustworthy.
Human feedback as training signal
Phase 2 is where the learning loop becomes visible. When a customer-support agent replies, I compare their response to the AI suggestion using Dice-coefficient bigram similarity. If the match is very high, I treat it as an accept signal. If it lands in the middle, it is an edit signal, which is actually the most valuable correction because it says the model was close but needed refinement. If the similarity is low, I mark it as ignored.
The important nuance is that I still learn from all three outcomes. The difference is in how much credit I assign. An ignored suggestion might mean the model was wrong, the internal note arrived too late, or the human simply chose to handle the issue directly instead of using the suggestion. That ambiguity is part of the system, so I designed for it instead of trying to remove it.
Operational scaffolding
The operational layer matters more than people usually admit. Two SQL views, ai_performance_daily and ai_performance_lifetime, expose accept rate, edit rate, ignored rate, KB growth per day, and token cost in a single query. Watching kb_added_learned decline over time is one of the clearest signs that the system is getting smarter.
Every state transition is written to audit_log with a machine-readable event_type and a JSON payload, which means debugging is a SQL query instead of a log scrape. Every script that touches a third-party API also carries a checkpoint file and a --resume flag, so long jobs can survive crashes and rate limits without starting over.
What it cost
The build took about two weeks of focused work. Most of that time was spent on the edges: webhook behaviour, Voyage's limits, and the practical cleanup that comes with wiring real systems together.
Operating cost is low (less than $10/month).
Claude Haiku 4.5 comes in at roughly a dollar per million input tokens and five dollars per million output tokens, which makes it a reasonable choice for support workloads where latency and cost both matter. Supabase and Voyage usage are negligible by comparison, especially once the embedding layer is compact and the retrieval path is tuned.
Lessons that held up
A few lessons will probably outlive the codebase.
- A three-phase rollout is the right way to introduce an AI agent into a support workflow.
- Retrieval and deduplication are different problems, so they deserve different embeddings.
- Soft-delete beats hard-delete almost everywhere if you expect to debug the system later. I am putting this here, as I am pro-hard-delete normally.
- The AI-human race condition is not theoretical; it shapes the product.
- Most of the hard work was not the model itself, but the integration edges around it.
That is what this project became: not a chatbot, but a way to introduce autonomy carefully, measure it honestly, and let the system learn from the same workflows humans already trust.