Back to Blog
Tutorials

Building Real-time Voice Agents with WebSocket Streaming

Learn how to build low-latency voice agents using murmr's WebSocket streaming API. Includes architecture patterns, code examples, and production tips.

mT
murmr Team
1. Februar 20267 min readUpdated 17. Februar 2026
#websocket#streaming#voice-agent#real-time#tutorial

Voice agents need to feel natural. That means responding fast — the less time between a user finishing their sentence and hearing the agent reply, the better. In this guide, we'll build a real-time voice agent using murmr's WebSocket API.

Why WebSocket for Voice?

For voice agents, HTTP-based streaming (SSE) adds unnecessary overhead. Each request requires a new connection, and you can't send new text until the previous response completes.

WebSocket provides:

  • Persistent connection: No connection overhead per message
  • Bidirectional communication: Send text while audio is still playing
  • Text buffering: Stream LLM tokens directly — the server assembles them into natural phrases
  • Multiple generations: One connection supports many text→audio cycles

| Method | Best For | |--------|----------| | HTTP Batch | Generating complete audio files | | SSE Streaming | Progressive playback, simple integration | | WebSocket | Voice agents, real-time LLM integration |

Architecture Overview

A typical voice agent architecture looks like this:

text
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Browser   │     │  Your App   │     │    murmr    │
│  (mic/spk)  │────▶│   Server    │────▶│  WebSocket  │
└─────────────┘     └─────────────┘     └─────────────┘
      │                   │                    │
      │  1. Audio input   │                    │
      │  ─────────────▶   │                    │
      │                   │  2. Transcribe     │
      │                   │  + LLM response    │
      │                   │                    │
      │                   │  3. Send text      │
      │                   │  ─────────────────▶│
      │                   │                    │
      │                   │  4. Stream audio   │
      │  5. Play audio    │◀─────────────────  │
      │ ◀──────────────   │                    │
      │                   │                    │

The key insight: your backend maintains a persistent WebSocket connection to murmr, and streams audio back to the browser as it arrives. Voice configuration happens once at connection time, then you just send text.

Getting Started

First, get your API key from the murmr dashboard. The WebSocket endpoint is:

text
wss://api.murmr.dev/v1/realtime

Plan requirements

WebSocket access requires the Realtime plan ($49/mo) or Scale plan ($99/mo). SSE streaming is available on all plans including Free.

Implementation

murmr's WebSocket protocol has three phases: connect, configure, and send text.

Connect and configure
import WebSocket from 'ws';

const ws = new WebSocket('wss://api.murmr.dev/v1/realtime');

ws.on('open', () => {
  // Step 1: Send config with API key and voice
  ws.send(JSON.stringify({
    type: 'config',
    api_key: process.env.MURMR_API_KEY,
    voice_description: 'A warm, friendly assistant with clear enunciation',
    language: 'English'
  }));
});

ws.on('message', (data) => {
  const msg = JSON.parse(data.toString());

  switch (msg.type) {
    case 'config_ack':
      console.log('Connected, session:', msg.session_id);
      // Step 2: Send text to synthesize
      ws.send(JSON.stringify({
        type: 'text',
        text: 'Hello! How can I help you today?'
      }));
      // Flush to generate immediately
      ws.send(JSON.stringify({ type: 'flush' }));
      break;

    case 'audio':
      // msg.chunk is base64-encoded PCM (24kHz, 16-bit, mono)
      const pcmBuffer = Buffer.from(msg.chunk, 'base64');
      // Forward to client or play locally
      break;

    case 'done':
      console.log(`Generated in ${msg.duration_ms}ms`);
      break;

    case 'error':
      console.error('Error:', msg.message);
      break;
  }
});

Key Protocol Details

  • Auth happens in config, not HTTP headers. Send your api_key in the first message.
  • Voice is set once per connection via voice_description (for VoiceDesign) or voice_clone_prompt (for saved voices).
  • Text messages are buffered server-side and generated at natural sentence/clause boundaries.
  • flush forces generation of any remaining buffered text.

Handling Audio Chunks

Audio arrives as base64-encoded PCM (24kHz, 16-bit, mono). Buffer and play it smoothly in the browser:

Audio playback with buffering
// Create audio context
const audioContext = new AudioContext({ sampleRate: 24000 });
const audioQueue = [];
let isPlaying = false;

// Buffer incoming audio
function queueAudio(base64Audio) {
  const pcmData = atob(base64Audio);
  const samples = new Int16Array(pcmData.length / 2);

  for (let i = 0; i < samples.length; i++) {
    samples[i] = pcmData.charCodeAt(i * 2) |
                 (pcmData.charCodeAt(i * 2 + 1) << 8);
  }

  // Convert to float32 for Web Audio
  const float32 = new Float32Array(samples.length);
  for (let i = 0; i < samples.length; i++) {
    float32[i] = samples[i] / 32768;
  }

  audioQueue.push(float32);
  if (!isPlaying) playNext();
}

// Play audio chunks sequentially
function playNext() {
  if (audioQueue.length === 0) {
    isPlaying = false;
    return;
  }

  isPlaying = true;
  const samples = audioQueue.shift();

  const buffer = audioContext.createBuffer(1, samples.length, 24000);
  buffer.copyToChannel(samples, 0);

  const source = audioContext.createBufferSource();
  source.buffer = buffer;
  source.connect(audioContext.destination);
  source.onended = () => playNext();
  source.start();
}

Binary mode

For even lower overhead, enable binary mode after config_ack by sending {"type": "binary_mode"}. Audio then arrives as raw binary WebSocket frames — no base64 decoding needed, saving ~50-100ms per chunk.

LLM Integration: Streaming Tokens

The real power of WebSocket TTS is streaming LLM tokens directly. murmr's text buffering handles the assembly — you don't need to wait for complete sentences:

javascript
// Stream LLM tokens directly to murmr
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: userTranscription }],
  stream: true
});

for await (const chunk of stream) {
  const token = chunk.choices[0]?.delta?.content;
  if (token) {
    // Send each token as it arrives
    ws.send(JSON.stringify({ type: 'text', text: token }));
  }
}

// Done streaming — flush any remaining text
ws.send(JSON.stringify({ type: 'flush' }));

The server buffers tokens and generates audio at natural boundaries (sentence ends, clause breaks, or every 200 characters). This means audio starts playing while the LLM is still generating.

Production Tips

Connection handling

WebSocket connections can drop due to network issues. Always implement reconnection logic with exponential backoff.

1. Audio buffering strategy

Start playback after receiving 2-3 chunks (~100-150ms of audio) to handle network jitter. Too much buffering adds latency; too little causes choppy playback.

2. Voice consistency

Save your VoiceDesign voice via the Voice Management API and use the saved voice_clone_prompt in your config message. This avoids re-generating the voice each connection.

3. Connection lifecycle

Each WebSocket connection supports multiple text→audio cycles. Send text, receive audio + done, then send more text — the voice configuration persists for the entire session. Create one connection per user session, not per utterance.

4. Reconnection with exponential backoff

javascript
let retryCount = 0;

function connectWithRetry() {
  const ws = new WebSocket('wss://api.murmr.dev/v1/realtime');

  ws.on('open', () => {
    retryCount = 0;
    // Send config...
  });

  ws.on('close', (code) => {
    if (code !== 1000) {
      const delay = Math.min(1000 * Math.pow(2, retryCount), 30000);
      retryCount++;
      setTimeout(connectWithRetry, delay);
    }
  });

  return ws;
}

Full Example

Here's a complete voice agent with OpenAI for transcription and LLM:

Complete voice agent
import WebSocket from 'ws';
import OpenAI from 'openai';

const openai = new OpenAI();
let murmrWs = null;

async function initializeMurmr() {
  murmrWs = new WebSocket('wss://api.murmr.dev/v1/realtime');

  return new Promise((resolve, reject) => {
    murmrWs.on('open', () => {
      // Configure voice for the session
      murmrWs.send(JSON.stringify({
        type: 'config',
        api_key: process.env.MURMR_API_KEY,
        voice_description: 'A helpful, friendly assistant',
        language: 'English'
      }));
    });

    murmrWs.on('message', (data) => {
      const msg = JSON.parse(data.toString());
      if (msg.type === 'config_ack') resolve();
      if (msg.type === 'error') reject(new Error(msg.message));
    });
  });
}

async function processUserAudio(audioBuffer) {
  // 1. Transcribe user speech
  const transcription = await openai.audio.transcriptions.create({
    model: 'whisper-1',
    file: audioBuffer,
    response_format: 'text'
  });

  console.log('User said:', transcription);

  // 2. Stream LLM response tokens directly to murmr
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'You are a helpful voice assistant. Keep responses concise.' },
      { role: 'user', content: transcription }
    ],
    stream: true,
    max_tokens: 150
  });

  for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content;
    if (token) {
      murmrWs.send(JSON.stringify({ type: 'text', text: token }));
    }
  }

  // 3. Flush remaining text
  murmrWs.send(JSON.stringify({ type: 'flush' }));
}

// Forward audio to browser
murmrWs.on('message', (data) => {
  const msg = JSON.parse(data.toString());
  if (msg.type === 'audio') {
    clientWs.send(msg.chunk);  // Forward base64 PCM to browser
  }
});

await initializeMurmr();
console.log('Voice agent ready');

Need help? Check out the WebSocket protocol reference for the complete message spec and close codes.

Ready to build? Start with the free tier using SSE streaming, then upgrade to Realtime when you need WebSocket integration.

mT

murmr Team

Engineering

Building the next generation of multilingual text-to-speech.

Related Posts