A large amount of AI product work right now is happening on text. You type a query, the model responds, and the interface renders the result. Even the more capable implementations (agentic search, document analysis, code generation) are invisible at the surface. There is no personality or identity, no presence, no sense that something is communicating with your user rather than just outputting at them.

Voice is the next surface. Not as a gimmick or a read-aloud, robotic-style button with no soul stuck onto a chat interface, but voice as the primary modality. A user speaks, an intelligent system listens and responds, and the experience is designed around that loop from the start. ElevenLabs is one of the more visible players in this field, with Deliveroo and Revolut already running their agent product in production.

The voice gap in AI products

The reason most AI products don't have voice isn't capability. It's friction. Building a voice interaction that feels natural requires more than connecting a microphone to a model. You need speech-to-text that handles accents, background noise, and interruptions. You need text-to-speech that doesn't sound synthetic. And you need turn-taking, which is the part most teams underestimate. Natural conversation isn't a clean request-response cycle. Someone starts speaking before you finish. They trail off mid-sentence. They use vocal fillers like "um" and mean keep going. Getting all of that working fluidly without anyone noticing the technology is the hard part.

That is the problem ElevenLabs has spent the last few years solving. They started as a text-to-speech company and got genuinely good at it, to the point where they are now the audio layer behind a meaningful slice of AI products. The two capabilities worth understanding for any team thinking about voice are voice cloning and conversational agents.

What voice cloning actually gives you

Voice cloning sounds more dramatic than it is. You upload audio samples of a voice, ElevenLabs analyses the vocal characteristics: tone, pitch, accent, and rhythm. It produces a model that can synthesise new speech from any text you give it.

There are two tiers, and they serve genuinely different purposes.

Instant Voice Cloning requires one to a few minutes of audio and produces a result within seconds. The quality is good enough for demos and internal tools. I spoke for thirty seconds and had a working model in under a minute. The output wasn't indistinguishable from the original, but it was close enough to be useful, which is exactly what you want for those use cases.

Professional Voice Cloning is different. It requires 30 minutes to 2+ hours of clean, high-quality audio and produces something ElevenLabs describes as virtually indistinguishable from the original. This is the tier for production use, and the result is impressive. It was the first time I had experienced AI voice that sounded genuine.

Cloning requires explicit, verified consent from the voice owner. ElevenLabs enforces this at platform level. You record an authorisation phrase before training a Professional clone, and you cannot upload someone else's voice without their permission. This is a hard gate, not a checkbox. Plan for it before designing around any specific voice.

What an ElevenLabs agent does

ElevenAgents is the platform for building voice-powered conversational AI. The core proposition: your users speak, your AI speaks back. Not text in a chat window. Actual voice.

Underneath, four things work together: fine-tuned speech recognition, a configurable language model, text-to-speech drawn from a library of more than 5,000 voices across over 70 languages, and a proprietary turn-taking model that handles interruptions, silences, and overlap. Get the turn-taking right and people stop noticing the technology and just have a conversation. Get it wrong and every interaction feels off in a way users can't quite articulate.

You configure an agent through a dashboard. Every agent has a first message, a system prompt, and a choice of language model. You attach a knowledge base (documents, PDFs, URLs) and the agent uses RAG to pull in relevant information when it needs it. You pick a voice from the library or use a cloned one. Ready-made templates for common use cases (customer support, lead qualification, appointment booking) give you a pre-built starting point if you'd rather not begin from scratch.

Where things get more capable is in actions. Agents can do things during a conversation, not just talk. ElevenLabs exposes this through tools and integrations that hook into your frontend, your backend, and your existing stack: CRMs, calendars, payment systems, telephony, and ticketing. The agent can fetch a record, update an order, schedule a call, or hand off to a human without the user dropping out of the conversation.

By default, an agent just has a free-form conversation, good for Q&A and support. Workflows let you go further. You define a graph of conversation stages, where each node is a focused step with its own prompt, voice, and tools. Between nodes you set conditions in plain English: "user is on the Free plan", "user has expressed frustration." The model evaluates these in real time and routes accordingly. No code. A support flow that escalates when sentiment drops, an onboarding flow that branches by plan type, structured behaviour without conditional logic.

To see how this plays out, I built a voice onboarding agent for a fictional B2B SaaS platform called NovaDash. The knowledge base covers plans, billing, integrations, and troubleshooting. The workflow has five stages: a welcome, collecting the user's name and role, asking which plan they're on, then branching into three different sets of tailored setup tips depending on whether they're on Free, Starter, or Pro/Enterprise, before wrapping up with support links. Building it from a blank template to a working demo took about 45 minutes.

The structure was straightforward. Getting one node to hold its ground until it had both a name and a role, rather than moving on too early, took some iteration. Voice also surfaces ambiguities that text doesn't: spoken email addresses, card numbers, and product names need explicit handling instructions, or you get inconsistent output, some of which can be unsafe. Deployment covers the obvious surfaces: a website widget, a React component, mobile SDKs for iOS and Android, and Twilio integration for telephony.

The catches

Latency matters more in voice than in text. A two-second delay in a conversation is noticeable in a way a two-second chat response isn't. Most use cases are fine; anything requiring genuinely real-time response needs benchmarking before you commit.

Probabilistic behaviour is the nature of language models, not a defect. Ask your agent the same question twice and you may get different answers. You cannot enforce behaviour the way you can in code; you write instructions that make certain responses more likely. The more precise the instructions, the more consistent the agent. Getting comfortable with that shift is part of building with this technology.

Knowledge base quality determines answer quality. RAG retrieves relevant chunks; it doesn't reason across complex questions. Scope the agent to what your documentation actually covers, and don't expect the model to fill the gaps.

Token costs multiply faster than expected. Running a voice agent at volume costs more than a text interface. Model the unit economics before you design for scale.

The consent requirement for voice cloning is a hard gate, not a technicality. Cloning someone's voice requires documented permission. The platform enforces it. Plan for this before you design around a specific voice.

Where this leaves you

Voice is one of the few remaining ways to create a genuinely distinct product experience. Most AI products are converging on the same interface: text in, text out. A well-designed voice interaction, with a brand-specific voice and an agent that handles real queries well, is not something that can be easily copied.

You don't need an audio engineering team. You need a use case, a system prompt, good documentation, and a willingness to iterate. The first demo takes fifteen minutes to set up. That is the right first step. The bar for production quality in voice is higher than in text, but the path is clear.

The question is not whether voice will matter in AI products. It will. The question is whether your product will have a voice that belongs to it when it does.