How call recording works | Vocatech Help

Call recording on Vocatech runs on our own infrastructure. Not a third-party reseller service. A dedicated C++ SIP server that we built, maintain, and improve. This is why recordings show up in the journey within about 10 seconds of the call ending, and why retention is measured in years instead of days.

Here is what happens from the moment a call is recorded to the moment the MP3 appears in the portal.

The SIPREC protocol

Vocatech uses SIPREC, the industry-standard SIP recording protocol. When recording is enabled on an extension, BroadWorks forks the call media to our recording server in parallel with the normal call path. The caller and callee are not affected. The recording happens alongside the call, not in series.

SIPREC means two separate media streams arrive at the recording server. One for each side of the conversation. They are not mixed by the phone system. They come in clean and independent.

Capturing the channels

Our C++ recording server captures each SIPREC stream into its own audio file. If a caller and an agent are on a call, there are two files. Each file contains only that speaker's audio.

Capturing separately is important for two reasons. It keeps the recording at a higher signal-to-noise ratio, and it lets the transcription engine later attribute every sentence to the correct speaker with high reliability.

Merging with packet timing

After the call ends, the recording server merges the two channels into a single file. The merge is not a simple overlay. It uses the packet timing data from the original SIPREC streams to align the audio to the millisecond.

This is why Vocatech recordings do not have the drift you hear on some platforms, where one speaker sounds slightly behind the other. Packet-timing alignment keeps the conversation natural.

MP3 output

The merged output is an MP3 file. Standard format, plays in any browser or audio player. The MP3 is the file you hear when you press play in the call journey.

The merge step also produces a two-channel version that the transcription engine uses. You do not see this file in the portal. Whisper uses it to keep speaker attribution clean.

Upload to cloud storage

Once merged, the MP3 is uploaded to Google Cloud Storage. Storage is encrypted at rest. Access to the file is controlled through the portal and the API. Raw storage URLs are not exposed.

The file is replicated across Google's storage infrastructure. A single hardware failure does not lose recordings. Retention is at least one year. Longer retention is available on request.

Ten-second turnaround

End-to-end, from the moment a caller hangs up to the moment the MP3 is playable in the portal, the typical turnaround is about 10 seconds. That includes the merge, the upload, and the portal refresh.

Transcription and AI summary follow on a separate pipeline and usually complete within another minute or two. See AI call summaries for how transcription works after the recording lands.