Transcription accuracy | Vocatech Help

Vocatech runs OpenAI Whisper Large-v3 on our own NVIDIA T4 GPUs. It is the same model that powers most commercial transcription services, fine-tuned for real-world call audio and running on our own infrastructure.

On a clean call Whisper delivers accuracy north of 98 percent. On bad audio it drops. Here is what the engine does well and where you should set expectations.

Languages supported

Whisper transcribes dozens of languages. English and Spanish are the strongest for US business calls. French, German, Mandarin, Portuguese, Hebrew, Arabic, Russian, and many others are well supported.

The engine auto-detects the language. You do not need to tell it. If a call switches between two languages mid-conversation, Whisper generally handles the transition, though accuracy can dip at the switch point.

Audio quality

Quality on a call is the biggest variable. A desk phone on a wired network produces the cleanest audio and the best transcription. Webex on a laptop with a good microphone is a close second.

Desk phones on Ethernet: excellent
Webex on a laptop with a decent mic: very good
Cell phones with strong signal: good
Cell phones with weak signal: variable
Speakerphone in a busy room: weaker

Multi-speaker alignment

Two-channel calls produce the best multi-speaker results. The recording captures each side on its own channel, so Whisper knows which speaker said which words. Attribution is reliable.

Single-channel audio like a conference call with three or four participants is harder. Whisper can still transcribe the words, but deciding who said what is limited to acoustic cues. Long back-and-forth between two similar voices sometimes gets misattributed.

Where accuracy suffers

Three situations consistently weaken transcription.

Low bandwidth. When a call drops packets or compresses audio aggressively, words get clipped before the model can recognize them. The fix is on the network side, not the transcription side.

Heavy accents with technical vocabulary. Whisper handles most accents well, but when a strong accent combines with industry-specific jargon the model has not seen often, word error rate climbs.

Noisy environments. A TV in the background, children in the room, a busy restaurant. Background speech that crosses speech-detection thresholds gets mixed in with the primary conversation. AI noise removal in Webex helps, but cannot eliminate everything.

When transcription is still good enough

Even on weaker audio, transcription is almost always good enough for a summary. The model produces imperfect text, but GPT-4o-mini is forgiving. A three-sentence summary of a rough transcription is usually accurate to the meaning of the call, even if individual words are wrong.

The workflow we see most often is this: customers skim the summary first, and only dig into the raw transcription when they need exact phrasing. The summary layer covers most of the imperfections.