Part 1 of a series. In this post we’ll map out what it actually takes to put a real-time AI voice agent behind a phone number using FreeSWITCH — the moving parts, the integration options, the latency budget, and the honest cost numbers. No code yet; that starts in Part 2.
Everyone is talking about voice AI agents right now, and it seems like everyone in the business has one to sell you. New companies are popping up almost daily, and some of them are shipping genuinely good products. What almost nobody is doing is showing you how to build one from scratch.
A web demo is easy. A real-time voice agent is a different problem entirely.
To do it properly you need SIP signaling, live bidirectional audio streaming, speech-to-text, an LLM, text-to-speech, turn-taking, barge-in handling, observability, and a latency budget tight enough that the caller doesn’t feel like they’re talking to a fax machine from 1998. Any one of those pieces is tractable. Getting all of them working together, under a second of round-trip delay, over a real phone line, is where the tutorials stop and the hand-waving begins.
This series is about closing that gap. Over the next several posts, I’m going to build a self-hosted voice AI agent on top of FreeSWITCH from the ground up — real SIP, real audio, real providers, real production concerns. This first post is the map: what the system looks like, what the moving parts are, where the latency goes, and what it actually costs to run. The code starts in Part 2.
1. The Problem, Stated Precisely
A phone call is not a text conversation with some audio bolted on.
It’s a continuous, bidirectional, real-time audio stream. Audio moves in both directions simultaneously, usually in small packets every 20 milliseconds. With codecs like G.711 or Opus, that’s roughly 50 packets per second flowing from the caller to the system, and another 50 flowing back.
That stream does not pause politely while your application thinks.
The caller speaks. The system has to listen. The caller stops. The system has to decide whether they’re actually done, or just taking a breath. Then the AI has to generate a response, synthesize it into audio, and start playing it back quickly enough that the caller still feels like they’re in a conversation.
That’s the real problem.
An AI voice agent has to fit inside the rhythm of a phone call without breaking the illusion of a natural conversation. It has to hear the caller while they’re still speaking, which means streaming speech-to-text, not uploading a recording after the fact. It has to decide when the caller has finished a thought, which means endpoint detection or voice activity detection, usually some combination of both. It has to generate a response fast enough that the silence doesn’t feel awkward. And it has to speak back in a way that can be interrupted, which is usually called barge-in.
This is where voice AI becomes very different from a chatbot.
In a chatbot, a user types a message, hits enter, and waits. If the response takes two or three seconds, it’s slightly annoying, but it doesn’t destroy the experience. The interaction is already asynchronous.
A phone call is not. Humans are extremely sensitive to timing in spoken conversation. In natural conversation, the gap between speakers is often only a few hundred milliseconds. By 800ms, the silence starts to feel noticeable. By 1500ms, the system feels broken. Much beyond that, callers start talking over the agent, repeating themselves, or hanging up.
So the latency budget matters.
For a voice AI agent to feel natural, the round trip from:
caller stops speaking → AI response audio starts playing
needs to land under roughly 800ms whenever possible. Around 1500ms is the upper edge of what I’d consider usable. Above that, you’re no longer building a conversational voice agent. You’re building an IVR that occasionally uses an LLM.
That distinction changes how you design the entire system. You can’t treat the call as a recording. You can’t wait until the caller is completely done, send a big audio blob to transcription, pass the final text to an LLM, wait for a full paragraph back, synthesize the whole thing, and play it.
That architecture works for voicemail transcription.
It does not work for a real-time phone agent.
A real voice AI system streams audio in, streams intelligence through the pipeline, and streams audio back out — all under a second, all without breaking the rhythm of the call.
Let’s look at the components.
2. The Components
At a high level, a real-time voice AI agent has five moving pieces:
- The telephony engine (FreeSWITCH)
- The audio bridge
- Speech-to-text (STT)
- The language model (LLM)
- Text-to-speech (TTS)
You can make the architecture more complicated than that, and eventually you probably will, but this is the core loop. Audio comes in from the phone call, gets turned into text, gets passed to a model, gets turned back into audio, and gets played back to the caller.
The challenge is doing all of that fast enough to preserve the rhythm of a live conversation.
The Telephony Engine: FreeSWITCH
In this series, we will use FreeSWITCH as the telephony engine. It handles the SIP signaling, RTP media, codec negotiation, dialplan logic, and whatever PSTN or SIP trunk interconnect you are using to get calls in and out of the system.
In this architecture, FreeSWITCH owns the phone call.
That means it answers the call, negotiates media, anchors the RTP, and gives you programmable hooks into the call through the dialplan, the Event Socket Library (ESL), and the module system. It is very good at being the thing in the middle of a real-time voice session.
FreeSWITCH does not care about AI.
It does not know what an LLM is. It does not care whether your speech-to-text provider is Deepgram, Google, Whisper, or something running locally on a GPU in your basement. It cares about getting audio in and out of the call in real time.
Everything AI-related happens outside of FreeSWITCH, in a process you write that talks to FreeSWITCH.
That separation is important. FreeSWITCH is the call control and media engine. Your application is the intelligence layer.
The Audio Bridge: The Hard Part
The audio bridge is the piece that connects the phone call to your AI pipeline.
It takes media from the FreeSWITCH call leg and streams it somewhere you control, usually a WebSocket server running in your own application. It also has to take audio generated by your system and get it back into the call so the caller can hear the agent respond.
This is where the architecture stops being theoretical.
You need to decide how audio leaves FreeSWITCH, what format it is in, how it gets transported, how you handle timing, and how generated audio gets injected back into the live call. You also need to think about what happens when the caller interrupts, when your WebSocket server disconnects, or when your AI pipeline falls behind.
This is the part most tutorials hand-wave.
They show a nice diagram with “audio stream” in the middle, then skip directly to the AI response. But that middle piece is the system. If you cannot reliably move live audio out of the call and back into the call, you do not have a voice agent. You have a diagram.
The next section is where we get specific.
Speech-to-Text: Giving the System Ears
Speech-to-text is what gives the agent ears.
It receives the caller’s audio as a stream and emits transcripts while the caller is still speaking. In a real-time voice agent, you do not want to wait for the call to end, save a recording, upload it somewhere, and then transcribe it. That works for voicemail. It does not work for conversation.
For this kind of system, you want streaming STT.
A good streaming STT engine gives you interim transcripts as words are being recognized, and final transcripts when it believes the caller has completed a phrase or thought. Current strong options include Deepgram, AssemblyAI, Google Cloud Speech-to-Text, and local Whisper-based options like faster-whisper or whisper.cpp if you need to self-host. Deepgram is often a practical default for low-latency commercial STT, while local Whisper variants are attractive when data control or cost is more important than managed-service simplicity.
The non-obvious part is that transcription accuracy is only half the problem. The system also has to decide when the caller is done speaking. If it decides too early, the agent interrupts the caller. If it waits too long, the call feels slow. This is where voice activity detection, endpointing, interim transcripts, final transcripts, and conversational timing all start to overlap.
Bad STT does not just mishear words.
Bad STT breaks the rhythm of the call.
The Language Model: The Brain
The language model is the part most people think of first, but in a real-time voice system, it is only one piece of the loop.
The LLM takes the transcript, conversation history, system prompt, and any application context you provide, then produces the agent’s response. That response might be a simple answer, a routing decision, a tool call, a database lookup, or a handoff to a human.
You can use hosted models like OpenAI, Anthropic, Google, or others. You can also run self-hosted models like Llama or Mistral if you need more control over data, cost, or deployment. The exact model matters, but the architecture should not depend on one specific provider.
Here’s where most implementations leave latency on the table: you usually do not want to wait for the LLM to finish the entire response.
You want streaming output.
If the model starts generating, “Sure, I can help you with that,” you should be able to start sending that first sentence toward TTS while the model is still generating the rest of the response. Blocking until the full response is complete can add hundreds of milliseconds, sometimes more than a second, of pure waste to your latency budget.
In a chatbot, that may not matter.
On a phone call, it matters a lot.
Text-to-Speech: Giving the System a Voice
Text-to-speech is what gives the agent a voice.
It takes the LLM’s response and turns it into audio that can be played back to the caller. For a real-time voice agent, you want TTS that can produce audio quickly, sound natural enough for a phone call, and ideally stream audio as it is being generated.
Current strong options include ElevenLabs, Cartesia Sonic, OpenAI TTS, and local engines like Piper when you want something free, fast, and self-hosted. Cartesia’s Sonic models and ElevenLabs’ Turbo line are both built specifically for low-latency streaming.
One detail that’s easy to miss: there are really two kinds of streaming you care about.
The first is streaming output: the TTS provider starts returning audio before the full response has been synthesized.
The second is streaming input: you can feed text into the TTS engine as the LLM produces it, instead of waiting for the full LLM response.
You want both.
That is how you avoid wasting time between the LLM and the caller. The LLM starts producing text, the TTS engine starts producing audio, and FreeSWITCH starts playing that audio back into the call.
That is the loop.
The pipeline, not the monolith
The important thing to notice is that every one of these components can be swapped independently.
FreeSWITCH can stay the call engine while you change STT providers. Deepgram can be replaced with Whisper. ElevenLabs can be replaced with Cartesia. A hosted LLM can be replaced with a self-hosted model. Your WebSocket audio bridge can evolve without rewriting the entire system.
That is the value of treating this as a pipeline instead of a monolith.
You are not building “a Deepgram bot” or “an OpenAI bot” or “an ElevenLabs bot.” You are building a real-time voice system where each component has a clear job: FreeSWITCH handles the call, the audio bridge moves media, STT turns speech into text, the LLM decides what to say, and TTS turns the response back into audio.
Once you see the system that way, the architecture gets a lot easier to reason about.
Now we can look at the part that makes or breaks the whole thing: how to actually get audio out of FreeSWITCH and back into the call.
3. The Three Integration Paths
FreeSWITCH gives you several ways to get call audio out to an external process — but they are not equally good for real-time AI.
For a voice AI agent, we don’t just need access to a recording after the call. We need live audio while the caller is speaking, and we need a way to inject generated audio back into the same call quickly enough that the interaction still feels conversational.
There are three realistic paths for doing that with FreeSWITCH, plus a fourth option that’s worth knowing about. For the rest of this series, we’ll use a hybrid approach: mod_audio_stream for caller audio in, and uuid_broadcast for agent audio out. The reason for the hybrid is worth understanding before we go any further.
A note on the module landscape
If you’ve read other tutorials about FreeSWITCH and WebSockets, you’ve almost certainly seen references to mod_audio_fork, originally from the drachtio/drachtio-freeswitch-modules repo.
That repo is gone. It returns a 404. The forks that remain are unmaintained, and mod_audio_fork itself depended on a custom-compiled FreeSWITCH with libwebsockets support — which the standard SignalWire binary packages do not include. Most existing tutorials pointing at mod_audio_fork are broken.
The successor is mod_audio_stream, maintained by amigniter and explicitly designed to work against stock binary FreeSWITCH. It ships as a prebuilt Debian 12 package. The community edition is free for up to 10 concurrent streaming channels — plenty for development and most small production deployments.
That is what we use, but with one important caveat that shapes the rest of the architecture.
Path A: mod_audio_stream — works perfectly for half the problem
mod_audio_stream is designed for exactly the kind of problem we are trying to solve: take live call audio from FreeSWITCH, stream it to a WebSocket endpoint you control, and optionally receive audio back over that same WebSocket to play into the call.
That maps very cleanly to a real-time AI voice agent.
FreeSWITCH owns the call. mod_audio_stream bridges the media. Your application receives the caller’s audio, sends it to STT, passes transcripts to the LLM, sends the LLM output to TTS, and streams the generated audio back into the call.
mod_audio_stream streams audio over WebSocket using linear PCM. It supports configurable sample rates and bidirectional media, which means it can send call audio out and receive generated audio back. That is the key capability we need for a conversational AI agent.
So here’s the caveat I owe you up front: the outbound direction (caller → your application) is rock solid. The inbound direction (your application → caller) currently does not work reliably on stock FreeSWITCH installs.
The v1.0.3 binary advertises bidirectional playback. It accepts JSON-wrapped audio over the WebSocket, decodes the base64 payloads, and writes them to temp files on disk — you can watch this happen in the FreeSWITCH log. But the playback engine that’s supposed to inject those decoded chunks into the live call doesn’t fire for most users. No audio reaches the caller. There are several open issues describing the same symptoms (#56, #87, #104, #119) and no public resolution.
I spent a long time trying to make the bidirectional path work before giving up. The full debugging story is in Part 4. For Parts 2 and 3 it doesn’t matter — those parts only use the outbound direction (streaming caller audio out for transcription), and that works flawlessly.
For Part 4, where we need to play synthesized speech back to the caller, we use a second mechanism: FreeSWITCH’s built-in uuid_broadcast command via the Event Socket. That brings us to Path C.
Path B: mod_unimrcp
The second option is mod_unimrcp.
This is the more traditional speech integration path in FreeSWITCH. It allows FreeSWITCH to speak MRCPv2 to MRCP-compatible speech servers for ASR and TTS.
If you have been around contact centers, IVRs, Nuance, Loquendo, or older enterprise speech platforms, this model will look familiar. MRCP was built for that world.
And that is the problem.
mod_unimrcp is useful, mature, and still has its place — especially in environments that already have MRCP infrastructure. For a modern real-time AI voice agent, though, it is not the cleanest fit.
You can make it work. You can put adapter servers in the middle. You can bridge MRCP to modern streaming STT and TTS providers. You can wrap things and translate protocols.
But now you have added another layer.
Another server. Another protocol. Another source of latency. Another place for timing and endpointing behavior to get weird.
For a classic IVR, that may be fine. For a sub-second conversational AI loop, it is awkward.
My recommendation: skip this path unless you already have MRCP infrastructure and a specific reason to reuse it.
Path C: ESL with uuid_broadcast for playback
The third option uses FreeSWITCH’s Event Socket Layer (ESL) to control playback into a live call.
The idea is straightforward. Your application has a separate connection to FreeSWITCH on the management port. When you have audio you want the caller to hear, you write it to a .wav file on disk and tell FreeSWITCH to play it: uuid_broadcast <call-uuid> /tmp/file.wav aleg. FreeSWITCH’s standard playback machinery — the same code path that’s been playing IVR prompts for two decades — handles the rest. Codec negotiation is automatic. Every FreeSWITCH install handles this command identically.
For full call automation built only on ESL — recording chunks with uuid_record, transcribing the file, generating a response, and playing it back with uuid_broadcast — The latency story is bad. You have to wait for enough audio to be recorded, wait for the file to close, send it to STT, wait for the transcript, send it to the LLM, wait for the response, synthesize audio, and only then play it back. That’s chunked processing, and it feels like chunked processing. Fine for voicemail transcription or simple “press 1, say something” IVR flows. Not fine for conversational AI.
But used surgically — only for playback, while live caller audio still streams out via mod_audio_stream — the latency cost shrinks dramatically. The caller’s voice path stays real-time. We only pay the file-write-and-broadcast overhead on agent responses, where we already have to wait for the LLM and TTS to produce something to play.
That cost turns out to be small: about 50–100 ms added on top of TTS time-to-first-byte. The total speech-to-first-audio latency lands in the 1.3–1.5 second range, which is conversational and usable. Part 4 walks through the timing measurements in detail.
The combination we actually use
So the architecture for the rest of the series is:
Caller → application: mod_audio_stream streams the caller’s audio out over WebSocket as it happens. This is what Parts 2 and 3 are built around.
Application → caller: ESL uuid_broadcast plays the synthesized agent response back into the call. We add this in Part 4.
This isn’t the prettiest architecture. The prettiest architecture would be mod_audio_stream doing both directions on a single WebSocket, and that’s what most tutorials describe. But “prettiest” and “works on a stock FreeSWITCH install” turned out to be different things, and I’d rather show you something that works than something that looks clean in a diagram and fails when you try to run it.
If mod_audio_stream‘s bidirectional path stabilizes in a future release, the swap is easy — the rest of the architecture stays the same.
Path D: Don’t integrate it yourself — use Jambonz
There is also a fourth option: do not build this layer yourself.
Use Jambonz.
Jambonz is an open-source voice platform built on top of FreeSWITCH by people who have already done a lot of this integration work. It gives you webhook-style call control, audio forking, STT and TTS integration, provider abstraction, and a much higher-level way to build voice applications.
If your goal is to ship a voice agent as quickly as possible and you do not care about owning every layer of the stack, Jambonz is probably the right answer. That is not a criticism — it is the same reason people use Twilio, SignalWire, or other cloud communications platforms. Sometimes the right business decision is to use the platform, ship the product, and not spend weeks building media plumbing.
But that is not what this series is about. This series is about understanding the pieces underneath the platform — how the audio actually gets from a SIP call into your AI process, how the media loop works, where latency gets introduced, how barge-in becomes possible, and how to customize the system beyond what a platform gives you.
If that’s what you’re after, building the pieces directly is worth the effort. That’s why we’re using mod_audio_stream plus ESL.
For the rest of this series, that combination is the bridge between FreeSWITCH and the AI agent. Now let’s talk about the constraint everything in this pipeline has to respect: latency.
4. The Latency Budget
If the target is caller stops speaking → AI response audio starts playing in roughly 800ms, every stage in the pipeline has to earn its place.
There is not one magic bottleneck. There are several small delays that add up very quickly.
A realistic latency budget looks something like this:
| Stage | Budget | Notes |
| Endpoint detection | 150–300ms | Time from the caller being “done” to the STT/VAD deciding they are done. Heavily dependent on provider behavior and tuning. |
| STT final transcript | 50–150ms | Time for the final transcript event to arrive once speech has ended. Streaming STT keeps this low. |
| LLM first token | 200–500ms | First-token latency, not full completion time. This is the time before the model starts producing a response. |
| TTS first audio chunk | 150–400ms | Time before synthesized audio starts coming back. Streaming TTS matters here. |
| Audio back to FreeSWITCH and playback | 50–100ms | Network, buffering, decoding, and media injection back into the call. |
| Total, best case | ~600ms | Feels natural. Under budget. |
| Total, typical | ~1000ms | Noticeable, but still usable if the rest of the experience is good. |
| Total, bad day | ~1800ms | The conversation starts to feel broken. |
These ranges shift based on providers, region, model choice, codec, and tuning. They’re approximate, but they’re representative of what a well-tuned pipeline actually looks like in practice.
Every stage adds latency.
You do not get to say, “the LLM is fast, so we’re fine.” The LLM is only one part of the loop. You still have to detect the end of speech, finalize the transcript, generate the first token, synthesize the first audio chunk, and get that audio back into the call.
A few hundred milliseconds here and a few hundred milliseconds there turns into a slow voice agent very quickly.
Stream everything
The biggest wins come from streaming everything.
Streaming STT means you are receiving interim transcripts while the caller is still speaking. That gives your application a head start. You may not act on every interim word, but you can begin tracking intent, preparing context, and getting ready for the final transcript.
Streaming LLM output means you do not wait for the model to finish writing the entire response before doing anything useful. As soon as the model starts producing text, you can start moving that text toward TTS.
And streaming TTS closes the loop. You do not wait for the entire response to be synthesized before playing audio. The first audio chunk should hit the call as fast as possible — then you keep feeding chunks back as the rest gets generated.
That is how the system stays conversational.
A note on speech-to-speech models
Providers like OpenAI’s Realtime API collapse STT, LLM, and TTS into a single model, which can dramatically reduce the per-stage latency we’ve broken out here. That is a real and growing option, but it trades away the flexibility of swapping components independently — you commit to one provider’s voice, one provider’s model, and one provider’s pricing.
For this series, we are sticking with the pipeline architecture because understanding the pieces is the point. Speech-to-speech models are worth a follow-up post of their own.
The dark horse: endpoint detection
The dark horse in all of this is endpoint detection.
It sounds like a small implementation detail, but it is one of the biggest factors in how fast the system feels. The caller does not perceive “LLM latency” or “TTS latency” as separate things. They perceive silence.
They finish speaking.
Nothing happens.
That gap is the experience.
If endpoint detection fires too early, the agent interrupts the caller. If it fires too late, the agent feels slow. Tuning the threshold is one of the most underrated parts of building a voice AI system.
Geography and the network underneath
Geography matters too. If your FreeSWITCH server is in Frankfurt, your STT endpoint is in US-East, your LLM lives somewhere else, and your TTS provider is in a fourth region, you can burn a meaningful chunk of your latency budget before any model has done any work.
Co-locate what you can. Put FreeSWITCH close to your carrier edge. Put your application server close to FreeSWITCH. Choose STT, LLM, and TTS regions intentionally. Don’t send audio halfway around the world unless you have a good reason.
The numbers above also assume the call is already on a healthy network with reasonable codec choices. Packet loss, jitter, or a poorly-tuned RTP path can add another 50–200ms of effective delay or worse — and unlike application-layer latency, you cannot stream your way out of it.
Reasonable, not perfect
The goal is not to make every stage perfect. The goal is to keep every stage reasonable.
The numbers in the table above are not aspirational — they are what a well-tuned pipeline actually looks like in production. If your numbers come in 2x worse, your callers will notice. If they come in 3x worse, they will hang up.
The architecture has to be designed around latency from the start. You cannot bolt that on later.
But latency is not the only constraint a real voice agent has to live within. The other one is cost.
5. The Cost Reality
Most voice AI tutorials skip the part everyone eventually has to care about: what does a call actually cost?
The honest answer depends on your STT provider, LLM choice, TTS provider, call length, and whether you self-host any of it. But we can build a realistic back-of-the-napkin estimate that’s accurate enough to make architecture decisions on.
For the examples below, assume a 5-minute phone call with roughly 10 conversational turns, using streaming STT, a small fast LLM, and streaming TTS.
Pricing note: GPT-4o mini is used here as a stable reference point for “small fast LLM” pricing. By the time you’re reading this, newer options like GPT-5 mini, Claude Haiku, or whatever ships next may offer better cost/performance. The category — cheap, fast, hosted small model — is what matters; the specific model will keep changing.
Stack A: Commercial, low-latency, higher-quality voice
This is the “I want it to sound good and work well without self-hosting models” stack.
| Component | Cost for a 5-minute call |
|---|---|
| Deepgram streaming STT | ~ $0.04 |
| GPT-4o mini LLM usage | ~ $0.002–$0.005 |
| ElevenLabs Flash/Turbo TTS (~1,500 characters) | ~ $0.08 |
| FreeSWITCH host cost | Negligible |
| Total | ~ $0.12–$0.15 |
Deepgram’s streaming STT is currently around $0.0077/minute. GPT-4o mini sits at $0.15 per million input tokens and $0.60 per million output tokens. ElevenLabs Flash/Turbo TTS starts at $0.05 per 1,000 characters at the lowest API tier, with rates rising to $0.06 at higher tiers and significantly more on subscription overage.
That means the LLM is almost a rounding error in this kind of call.
The voice is where the money goes.
Stack B: Commercial, cheaper TTS
Now keep the same basic architecture, but swap the premium TTS provider for a cheaper text-to-speech option.
| Component | Cost for a 5-minute call |
|---|---|
| Deepgram streaming STT | ~ $0.04 |
| GPT-4o mini LLM usage | ~ $0.002–$0.005 |
| Lower-cost TTS (~1,500 characters) | ~ $0.02–$0.05 |
| FreeSWITCH host cost | Negligible |
| Total | ~ $0.06–$0.10 |
This is probably the practical sweet spot for a lot of early builds. You still get managed STT, a fast hosted LLM, and streaming TTS — without spending most of your call budget on the most expensive voice option.
The tradeoff is voice quality and the depth of the voice library. A budget TTS provider might have ten decent voices in three languages. ElevenLabs has hundreds across thirty-plus languages, with much finer control over tone and emotion. For a scheduling bot, that does not matter. For a customer-facing concierge, it might be the whole point.
A scheduling bot, password reset bot, appointment reminder, or internal IT assistant probably does not need the most expressive voice model on the market. A sales agent, healthcare intake agent, or customer-facing support bot might.
The rest of this series uses Deepgram Aura-2 (~$0.030 per 1,000 characters) for exactly this reason — keeping STT and TTS with one vendor means one API key, one billing relationship, and one SDK. Aura-2 lands solidly in the budget-TTS range while still sounding good, and the operational simplicity of single-vendor billing is genuinely worth something when you’re getting started.
Stack C: Self-hosted
The third option is to self-host more of the stack.
| Component | Example |
|---|---|
| STT | faster-whisper, whisper.cpp, or another local model |
| LLM | Llama, Mistral, Qwen, or another local model |
| TTS | Piper or another local engine |
| Runtime | GPU VPS, rented GPU box, or your own hardware |
This changes the economics completely.
Instead of paying per call, you are paying for capacity. If the GPU server costs you $0.20/hour, that’s $0.20/hour whether you process one call or sixty minutes of calls. At 100% utilization, that becomes extremely cheap on a per-minute basis. At low utilization, it can be more expensive than just using commercial APIs.
| Utilization | Effective per-minute cost |
|---|---|
| GPU box at $0.20/hour, 100% utilized | ~ $0.003/min |
| 300 call-minutes/day (box runs 24/7) | ~ $0.016/min |
| 1,000 call-minutes/day (box runs 24/7) | ~ $0.005/min |
At 1,000 call-minutes/day, you’re paying roughly half of what even the cheapest commercial stack costs. Below 300, you’re paying more for less convenience.
Self-hosting is not automatically cheaper. It’s cheaper when you have enough volume to keep the hardware busy, or when you need data sovereignty, offline operation, predictable spend, or full control over the models. Below that point, managed APIs are usually the easier and often cheaper answer.
Don’t forget the phone call itself
None of these numbers include the actual PSTN or SIP trunk cost.
You still need a phone number. You still need inbound or outbound minutes. Depending on your carrier, country, call direction, and volume, that might add something like:
| Cost item | Rough range |
|---|---|
| SIP/PSTN per-minute usage | ~ $0.005–$0.015/min |
| 5-minute call | ~ $0.025–$0.075 |
| Toll-free inbound | $0.015–$0.025/min (higher) |
A 5-minute call that costs $0.08 in AI processing and $0.04 in carrier usage is really a $0.12 call. Carrier cost is rarely the biggest line item — but it is never zero, and at scale it matters.
What to remember
The LLM is rarely the expensive part. With a small fast model like GPT-4o mini, the model cost for a short phone call is fractions of a cent. The LLM matters for latency and quality, but it is not where your per-call cost lives.
TTS is usually the biggest cost driver. A premium voice can cost several times more than a budget one. That may be worth it for a customer-facing agent, but you should know that before you design your margins around it.
Commercial APIs are not as expensive as people assume. A realistic managed stack lands at roughly $0.06–$0.15 for a 5-minute call before carrier costs, or $0.09–$0.22 with PSTN included. That is very different from the per-minute pricing of bundled voice agent platforms — Deepgram’s Voice Agent API, for example, is currently $0.08/minute ($4.80/hour) on Pay-as-you-Go, dropping to roughly $4.50/hour on the Growth tier with annual prepayment.
Self-hosting only wins at volume. A few dozen calls a day, managed APIs are cheaper and easier. Hundreds or thousands of call-minutes per day, self-hosting STT, LLM, or TTS starts to be worth analyzing seriously.
Failed calls still cost money. If your STT provider has a 30-second timeout and the connection drops at 28 seconds, you’ve paid for the STT, paid for any LLM tokens you generated, possibly paid for partial TTS — and the caller has nothing to show for it. At small volumes this is rounding error. At thousands of calls per day, retry storms can meaningfully change your unit economics.
The cost model is not complicated. But you do have to model the whole call:
STT + LLM + TTS + hosting + carrier minutes + failed calls + retries + logging
Once you have that number, you can design the rest of the system around it — the model choice, the TTS provider, the self-hosting decision, the volume targets you need to hit before any of this turns into a real business.
Without it, you’re guessing.
These numbers shift — verify current pricing against each provider’s pricing page before you build a business model around them. Voice AI pricing in particular has been moving downward, which is good for your margins but bad for any tutorial that does not flag this caveat.
6. What This Series Will Build
This first post was the map. The next four are where we start building.
In Part 2, we’ll install mod_audio_stream, wire it into a FreeSWITCH dialplan extension via a small Lua helper, and build a minimal WebSocket server that receives the streamed call audio. No AI yet. The goal is to prove the plumbing works.
In Part 3, we’ll add streaming speech-to-text with Deepgram. The WebSocket server will receive live call audio, forward it to Deepgram, and log transcripts as the caller speaks.
In Part 4, we’ll close the loop. We’ll take the transcript, feed it to an LLM, synthesize the response with streaming text-to-speech, and send the generated audio back into the call. We’ll deal with conversation memory, phrase ordering, utterance boundaries, and the architectural detour we have to take because mod_audio_stream’s bidirectional playback path doesn’t work reliably on stock FreeSWITCH installs.
In Part 5, we’ll harden the system for production. Barge-in and turn-taking, call logging, per-call cost tracking, graceful failure handling, observability, and the security story — because a FreeSWITCH box that calls out to cloud AI APIs has an attack surface a basic hardening baseline doesn’t cover.
By the end of the series, you’ll have a self-hosted FreeSWITCH-based voice AI agent that you can point a real phone number at. It will use real SIP, real audio, real streaming STT, a real LLM, and real TTS.
It will not be a toy browser demo.
It will be a real voice AI pipeline that you control end to end — from the SIP signaling to the model choice to the audio flowing out of the speaker.
This is not a weekend hack.
It is a real production pattern. The same basic architecture sits underneath many commercial voice AI platforms: telephony engine, media bridge, streaming STT, LLM, streaming TTS, and enough observability to figure out what happened when a call goes sideways.
The reason to build it yourself is not only cost, although the cost difference can be significant.
The bigger reason is control.
You decide which STT provider to use. You decide which model gets the transcript. You decide where the audio goes, what gets logged, how long data is retained, how failures are handled, and what happens when a provider has an outage. You’re not stuck with whatever abstraction a managed platform exposes.
If you haven’t installed FreeSWITCH yet, start there: [FreeSWITCH installation guide]. If you’re exposing it to real networks, the [hardening guide] is required reading before Part 2 — a voice AI agent is still a SIP server with everything that implies, including being a target for toll fraud the moment it touches the public internet.
In Part 2, we’ll install mod_audio_stream, connect FreeSWITCH to a WebSocket server, and prove that we can stream live call audio out of the system.
That’s where the build starts.
Part 2 lands soon!
Be First to Comment