How AI Call Summaries Actually Work (And Why Most Vendors Get Them Wrong)

Every vendor in business phones now puts the phrase AI call summaries on a feature list. Most of the time, that phrase means the same thing. A recording is shipped to OpenAI. OpenAI ships back a paragraph. The vendor charges you extra for it.

That is not what it means here. This article walks through the pipeline we built, the choices we made, and why the word summary is doing very different work when you see it in different product sheets.

The Ten-Second Version

A call ends. Inside a minute, a three to five sentence paragraph shows up in Reports. Who called. What they wanted. What was agreed. That paragraph rides out to the customer record in Salesforce, HubSpot, Zoho, QuickBooks, or whatever CRM the account is wired to.

The steps in between are where the product gets built.

Step One. Recording the Call

When a call runs on Vocatech, our sip_server (written in C++) captures the audio at the SIP layer. Not from a phone, not from a browser. From the switch itself. That matters for a reason most people never think about.

The recording is dual-channel. The caller is on one channel. The agent is on the other. They never mix. When the call is done, the two channels are merged with a C++ utility called pad_silence that aligns the timing, then passed to ffmpeg with -b:a 16k to produce a compact MP3.

Why dual-channel matters. When transcription runs, the model knows which words came from which speaker before it does any diarization work. On telephony audio, that cuts the speaker-labeling error rate to almost nothing. Most hosted VoIP platforms record mixed-mono because it is easier. Their summaries suffer for it.

The merged MP3 uploads to Google Cloud Storage. The file lives there. Every downstream step reads from GCS.

Step Two. Transcription on Our Own GPU

This is the part nobody else does.

The audio is handed to faster-whisper 1.2.1 running on an NVIDIA T4 in Google Cloud's us-west4-a zone. Faster-whisper is an optimized implementation of OpenAI's open-source Whisper model, compiled with CTranslate2 for roughly four times the throughput of stock Whisper on the same hardware. The T4 is the right shape of GPU for this workload. Plenty of memory, modest cost, good availability.

The important word in that sentence is our. We own the GPU time. The audio never leaves the Vocatech environment during transcription. It is never uploaded to OpenAI, Deepgram, AssemblyAI, or any of the hosted speech-to-text services the rest of the industry is built on top of.

A few settings matter for telephony audio specifically:

beam_size=5. Whisper uses beam search to pick the most likely transcription. Five is the sweet spot for call audio. Higher gets marginal accuracy at a large speed cost. Lower misses words on noisy calls.
VAD filtering, optional. Voice activity detection skips over silence before the model sees it. That stops Whisper from hallucinating words in dead air (a known behavior on the original model). We turn it on by default. We turn it off for calls where every pause is data, like legal depositions.
8 kHz telephony tuning. Whisper was trained on 16 kHz podcast-style audio. A phone call is 8 kHz narrowband. The words are there, but the frequencies above 4 kHz are gone. We upsample to 16 kHz before inference, which the model handles gracefully as long as the upsampling is clean (cubic, not nearest-neighbor).
Language detection. Whisper auto-detects. Spanish calls transcribe in Spanish. A bilingual call flips mid-sentence without breaking. For the homecare agencies on the platform, this is not a bonus feature, it is the feature.

A typical five-minute call transcribes in ten to fifteen seconds on the T4. A twenty-minute sales call takes under a minute. The pipeline is designed so transcription never blocks recording storage. The recording is durable the instant it hits GCS. Transcription catches up behind it.

Step Three. Summary Generation

Here is where we do use a third-party model. The transcript (not the audio, the text) goes to GPT-4o-mini through OpenAI's API with a narrow system prompt. The prompt asks for three to five sentences, in a specific format: who called, what they needed, what was agreed as the next step.

Why GPT and not our own model here. Summarization is a different problem from transcription. Training a custom summarization model that performs as well as GPT-4o-mini on free-form English conversation is a research project, not a product feature. We do not pretend otherwise.

But notice what is getting sent and what is not. OpenAI sees the text of the call. They never see the audio. They never see the phone number metadata, the CRM mapping, or any customer identifier we do not have to send. That is a very different privacy posture from a platform that ships the raw MP3 to a third party with everything attached.

The call to GPT is wrapped in a circuit breaker. If the API returns errors for a threshold number of requests in a row, the breaker opens and we stop calling for a cooldown window. During that window, transcripts still land in Reports, just without a summary on them. When the breaker closes again, summaries resume and the backlog drains. OpenAI has had outages. Our customers stop noticing about three minutes in.

Step Four. Storage and Surfacing

The finished record has a recording URL, a transcript, and a summary. It writes to the Recordings database with the call metadata (numbers, duration, user, timestamps). It shows up in Reports, our admin-portal call log, with the summary preview inline and the transcript one click away.

If the customer has an integration configured, the summary pushes out to the mapped CRM record. Salesforce activity timeline. HubSpot contact note. Zoho call log. QuickBooks customer notes. HHAeXchange visit note for the homecare agencies. Whatever the customer's system of record is, that is where the summary ends up, with no manual step from the agent.

Integrations are custom-built per customer at our shop for a few hundred dollars. We do not sell a marketplace of half-working connectors. We sell a connector that works for your specific setup, configured by the people who built the platform.

Why Running Our Own GPU Matters

Four reasons, in order of how often they come up.

Privacy. Customer audio does not cross our trust boundary during transcription. For medical offices, law firms, and anyone with a compliance officer, that sentence is the whole pitch. Hosted speech APIs require you to sign a BAA or a DPA that covers your customer audio being processed by a vendor you have never heard of. We do not need one.

Cost control. Hosted transcription is priced per minute. At our call volume, per-minute pricing turns AI summaries into a product tier we would have to charge extra for. Running our own T4 is a fixed monthly cost that we spread across every customer on the platform. That is how we include summaries in the flat $29.95.

Latency. Network round trips to a hosted API add seconds of overhead on every call. Our pipeline stays inside one GCP region. Sub-minute turnaround is the baseline, not the headline feature.

No vendor lock-in. Faster-whisper is open source. If a better open model ships tomorrow, we swap it in on the same GPU over a weekend. We do not wait for a third party to decide our feature roadmap.

What the Summary Actually Looks Like

A real example, from a property management customer, with the names changed:

Marcus Fletcher called about a leak in unit 4B at the 858-area-code property. The tenant reports water coming from the ceiling in the bathroom. Agent scheduled a plumber to arrive tomorrow between 9 and 11 AM and agreed to follow up with the tenant once the repair is confirmed.

That is the shape. Three sentences. Who. What. Next step. A receptionist reading twelve of these at the start of a shift knows exactly what the building walked in on. A sales manager spot-checking a rep's week reads forty and knows where the deals are stuck.

No adjectives. No tone analysis. No AI-generated pleasantries about how the customer was satisfied with the service. Those are the tells of a summary that is not doing any real work. Ours does the work, then stops talking.

The Short Answer to the Question in the Title

Most vendors ship audio to OpenAI and call it a feature. We recorded the call ourselves, transcribed it on a GPU we own, summarized the text through a narrow prompt, and delivered the result to the CRM before the agent finished their after-call notes.

The distance between those two sentences is the distance between AI summary as a marketing phrase and AI summary as a thing your operations team actually uses.

About Vocatech

Vocatech is a business phone service built on Cisco BroadWorks with our own platform layered on top. Callpop for desktop caller context. Reports with AI transcription and summaries. Textdock for SMS and WhatsApp from your business number inside Cisco Webex. A custom integrations workshop that connects any CRM or tool you already use.

Flat $29.95 per seat. Month to month. Founded 2008. Over a thousand businesses. 97% retention.

Start a trial at vocatech.com/contact or see the platform at vocatech.com/platform.