Voice agents need to feel natural. That means responding fast — the less time between a user finishing their sentence and hearing the agent reply, the better. In this guide, we'll build a real-time voice agent using murmr's WebSocket API.
Why WebSocket for Voice?
For voice agents, HTTP-based streaming (SSE) adds unnecessary overhead. Each request requires a new connection, and you can't send new text until the previous response completes.
WebSocket provides:
- Persistent connection: No connection overhead per message
- Bidirectional communication: Send text while audio is still playing
- Text buffering: Stream LLM tokens directly — the server assembles them into natural phrases
- Multiple generations: One connection supports many text→audio cycles
| Method | Best For | |--------|----------| | HTTP Batch | Generating complete audio files | | SSE Streaming | Progressive playback, simple integration | | WebSocket | Voice agents, real-time LLM integration |
Architecture Overview
A typical voice agent architecture looks like this:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Browser │ │ Your App │ │ murmr │
│ (mic/spk) │────▶│ Server │────▶│ WebSocket │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ 1. Audio input │ │
│ ─────────────▶ │ │
│ │ 2. Transcribe │
│ │ + LLM response │
│ │ │
│ │ 3. Send text │
│ │ ─────────────────▶│
│ │ │
│ │ 4. Stream audio │
│ 5. Play audio │◀───────────────── │
│ ◀────────────── │ │
│ │ │
The key insight: your backend maintains a persistent WebSocket connection to murmr, and streams audio back to the browser as it arrives. Voice configuration happens once at connection time, then you just send text.
Getting Started
First, get your API key from the murmr dashboard. The WebSocket endpoint is:
wss://api.murmr.dev/v1/realtime
Plan requirements
WebSocket access requires the Realtime plan ($49/mo) or Scale plan ($99/mo). SSE streaming is available on all plans including Free.
Implementation
murmr's WebSocket protocol has three phases: connect, configure, and send text.
import WebSocket from 'ws';
const ws = new WebSocket('wss://api.murmr.dev/v1/realtime');
ws.on('open', () => {
// Step 1: Send config with API key and voice
ws.send(JSON.stringify({
type: 'config',
api_key: process.env.MURMR_API_KEY,
voice_description: 'A warm, friendly assistant with clear enunciation',
language: 'English'
}));
});
ws.on('message', (data) => {
const msg = JSON.parse(data.toString());
switch (msg.type) {
case 'config_ack':
console.log('Connected, session:', msg.session_id);
// Step 2: Send text to synthesize
ws.send(JSON.stringify({
type: 'text',
text: 'Hello! How can I help you today?'
}));
// Flush to generate immediately
ws.send(JSON.stringify({ type: 'flush' }));
break;
case 'audio':
// msg.chunk is base64-encoded PCM (24kHz, 16-bit, mono)
const pcmBuffer = Buffer.from(msg.chunk, 'base64');
// Forward to client or play locally
break;
case 'done':
console.log(`Generated in ${msg.duration_ms}ms`);
break;
case 'error':
console.error('Error:', msg.message);
break;
}
});Key Protocol Details
- Auth happens in
config, not HTTP headers. Send yourapi_keyin the first message. - Voice is set once per connection via
voice_description(for VoiceDesign) orvoice_clone_prompt(for saved voices). - Text messages are buffered server-side and generated at natural sentence/clause boundaries.
flushforces generation of any remaining buffered text.
Handling Audio Chunks
Audio arrives as base64-encoded PCM (24kHz, 16-bit, mono). Buffer and play it smoothly in the browser:
// Create audio context
const audioContext = new AudioContext({ sampleRate: 24000 });
const audioQueue = [];
let isPlaying = false;
// Buffer incoming audio
function queueAudio(base64Audio) {
const pcmData = atob(base64Audio);
const samples = new Int16Array(pcmData.length / 2);
for (let i = 0; i < samples.length; i++) {
samples[i] = pcmData.charCodeAt(i * 2) |
(pcmData.charCodeAt(i * 2 + 1) << 8);
}
// Convert to float32 for Web Audio
const float32 = new Float32Array(samples.length);
for (let i = 0; i < samples.length; i++) {
float32[i] = samples[i] / 32768;
}
audioQueue.push(float32);
if (!isPlaying) playNext();
}
// Play audio chunks sequentially
function playNext() {
if (audioQueue.length === 0) {
isPlaying = false;
return;
}
isPlaying = true;
const samples = audioQueue.shift();
const buffer = audioContext.createBuffer(1, samples.length, 24000);
buffer.copyToChannel(samples, 0);
const source = audioContext.createBufferSource();
source.buffer = buffer;
source.connect(audioContext.destination);
source.onended = () => playNext();
source.start();
}Binary mode
For even lower overhead, enable binary mode after config_ack by sending {"type": "binary_mode"}. Audio then arrives as raw binary WebSocket frames — no base64 decoding needed, saving ~50-100ms per chunk.
LLM Integration: Streaming Tokens
The real power of WebSocket TTS is streaming LLM tokens directly. murmr's text buffering handles the assembly — you don't need to wait for complete sentences:
// Stream LLM tokens directly to murmr
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: userTranscription }],
stream: true
});
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content;
if (token) {
// Send each token as it arrives
ws.send(JSON.stringify({ type: 'text', text: token }));
}
}
// Done streaming — flush any remaining text
ws.send(JSON.stringify({ type: 'flush' }));
The server buffers tokens and generates audio at natural boundaries (sentence ends, clause breaks, or every 200 characters). This means audio starts playing while the LLM is still generating.
Production Tips
Connection handling
WebSocket connections can drop due to network issues. Always implement reconnection logic with exponential backoff.
1. Audio buffering strategy
Start playback after receiving 2-3 chunks (~100-150ms of audio) to handle network jitter. Too much buffering adds latency; too little causes choppy playback.
2. Voice consistency
Save your VoiceDesign voice via the Voice Management API and use the saved voice_clone_prompt in your config message. This avoids re-generating the voice each connection.
3. Connection lifecycle
Each WebSocket connection supports multiple text→audio cycles. Send text, receive audio + done, then send more text — the voice configuration persists for the entire session. Create one connection per user session, not per utterance.
4. Reconnection with exponential backoff
let retryCount = 0;
function connectWithRetry() {
const ws = new WebSocket('wss://api.murmr.dev/v1/realtime');
ws.on('open', () => {
retryCount = 0;
// Send config...
});
ws.on('close', (code) => {
if (code !== 1000) {
const delay = Math.min(1000 * Math.pow(2, retryCount), 30000);
retryCount++;
setTimeout(connectWithRetry, delay);
}
});
return ws;
}
Full Example
Here's a complete voice agent with OpenAI for transcription and LLM:
import WebSocket from 'ws';
import OpenAI from 'openai';
const openai = new OpenAI();
let murmrWs = null;
async function initializeMurmr() {
murmrWs = new WebSocket('wss://api.murmr.dev/v1/realtime');
return new Promise((resolve, reject) => {
murmrWs.on('open', () => {
// Configure voice for the session
murmrWs.send(JSON.stringify({
type: 'config',
api_key: process.env.MURMR_API_KEY,
voice_description: 'A helpful, friendly assistant',
language: 'English'
}));
});
murmrWs.on('message', (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === 'config_ack') resolve();
if (msg.type === 'error') reject(new Error(msg.message));
});
});
}
async function processUserAudio(audioBuffer) {
// 1. Transcribe user speech
const transcription = await openai.audio.transcriptions.create({
model: 'whisper-1',
file: audioBuffer,
response_format: 'text'
});
console.log('User said:', transcription);
// 2. Stream LLM response tokens directly to murmr
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful voice assistant. Keep responses concise.' },
{ role: 'user', content: transcription }
],
stream: true,
max_tokens: 150
});
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content;
if (token) {
murmrWs.send(JSON.stringify({ type: 'text', text: token }));
}
}
// 3. Flush remaining text
murmrWs.send(JSON.stringify({ type: 'flush' }));
}
// Forward audio to browser
murmrWs.on('message', (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === 'audio') {
clientWs.send(msg.chunk); // Forward base64 PCM to browser
}
});
await initializeMurmr();
console.log('Voice agent ready');Need help? Check out the WebSocket protocol reference for the complete message spec and close codes.
Ready to build? Start with the free tier using SSE streaming, then upgrade to Realtime when you need WebSocket integration.