wss://api.murmr.dev/v1/realtimeVoice Agents
Build conversational voice agents by streaming LLM tokens directly into murmr's WebSocket API. Text goes in, audio comes out — sub-600ms end-to-end.
Architecture
A voice agent pipes LLM output tokens into murmr as they arrive. murmr buffers text to natural boundaries, generates speech, and streams audio back to the client for playback.
User speaks
│
▼
┌─────────┐ ┌──────────┐ ┌───────────────┐ ┌──────────┐
│ STT / │────▶│ LLM │────▶│ murmr WS │────▶│ Audio │
│ Input │ │ (stream) │ │ /v1/realtime │ │ Playback │
└─────────┘ └──────────┘ └───────────────┘ └──────────┘
tokens audio chunks
as they as they're
arrive generatedPlan requirement
WebSocket access requires the Realtime ($49/mo) or Scale ($99/mo) plan. See Pricing.
Integration Example
This example connects an OpenAI chat completion stream to murmr's WebSocket. As the LLM generates tokens, they're forwarded to murmr for speech synthesis.
import OpenAI from "openai";
import WebSocket from "ws";
const openai = new OpenAI();
// 1. Connect to murmr WebSocket
const ws = new WebSocket("wss://api.murmr.dev/v1/realtime");
ws.on("open", () => {
// 2. Send config (auth + voice setup)
ws.send(JSON.stringify({
type: "config",
api_key: process.env.MURMR_API_KEY,
voice_description: "A calm, professional female voice",
language: "English",
}));
});
ws.on("message", (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === "config_ack") {
// 3. Start LLM stream after config acknowledged
streamLLMResponse(ws);
}
if (msg.type === "audio") {
// 5. Play audio chunk (base64 PCM, 24kHz mono 16-bit)
const pcm = Buffer.from(msg.chunk, "base64");
playAudio(pcm);
}
if (msg.type === "done") {
console.log(`Audio complete: ${msg.duration_ms}ms`);
}
});
async function streamLLMResponse(ws: WebSocket) {
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Explain quantum computing briefly" }],
stream: true,
});
// 4. Forward each token to murmr
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content;
if (token) {
ws.send(JSON.stringify({ type: "text", text: token }));
}
}
// Signal end of text
ws.send(JSON.stringify({ type: "flush" }));
}Text Buffering
murmr doesn't generate audio for every token. It buffers incoming text and triggers generation at natural speech boundaries for optimal quality.
| Rule | Condition | Behavior |
|---|---|---|
| Force flush | Buffer ≥ 200 chars | Generates immediately at best available boundary |
| Sentence flush | Buffer ≥ 50 chars + sentence end (.!?) | Generates at sentence boundary |
| Clause flush | Buffer ≥ 50 chars + clause end (,;:) | Generates at clause boundary |
| Explicit flush | Client sends {"type":"flush"} | Generates all buffered text immediately |
This means you can send tokens one at a time — murmr accumulates them and generates speech when it has a meaningful phrase. You don't need to batch tokens yourself.
When to flush
Send {"type":"flush"} when the LLM finishes its response. This ensures the final words are spoken even if they don't end with punctuation. Without a flush, trailing text like “Thank you” (no period, under 50 chars) stays buffered.
Binary Mode
By default, audio chunks are base64-encoded inside JSON messages. Binary mode sends raw PCM frames as WebSocket binary messages instead, saving ~50ms per chunk from encoding overhead.
// Opt into binary mode after config_ack
ws.send(JSON.stringify({ type: "binary_mode" }));
// Server responds with:
// { type: "binary_mode_ack", sample_rate: 24000, format: "pcm_s16le" }
// Audio now arrives as raw binary frames instead of JSON
ws.on("message", (data, isBinary) => {
if (isBinary) {
// Raw PCM: 24kHz, mono, 16-bit signed little-endian
playAudio(data as Buffer);
} else {
// JSON messages (done, error, pong) still arrive as text
const msg = JSON.parse(data.toString());
// ...
}
});When to use binary mode
Use binary mode when latency matters most — voice agents, interactive demos, real-time conversations. The JSON mode is fine for applications where you need metadata with each chunk (chunk index, latency metrics).
Parallel Auth Flow
murmr uses a parallel authentication flow to minimize startup latency. When you send the config message, the server:
- Immediately sends
config_ack— you can start sending text right away - Validates your API key in the background (~200ms)
- If text arrives before auth completes, it's queued and processed as soon as auth succeeds
- If auth fails, the connection closes with code
4002
This saves ~200ms compared to waiting for auth before sending text. Your LLM can start generating immediately after config_ack.
Handling Interruptions
In a conversational agent, the user may interrupt while audio is still playing. Handle this client-side:
// When user starts speaking (interrupt detected):
// 1. Stop audio playback
audioPlayer.stop();
// 2. Close the current WebSocket connection
ws.close();
// 3. Open a new connection for the next response
const newWs = new WebSocket("wss://api.murmr.dev/v1/realtime");
// ... configure and start new LLM streamEach WebSocket connection handles one conversation turn. When interrupted, close and reconnect. The server cancels any in-progress generation on disconnect.
Latency Expectations
| Metric | Typical Value | Notes |
|---|---|---|
| Server TTFC | ~550ms | Time from text received to first audio chunk generated |
| Client TTFC | ~600–700ms | Includes network round-trip |
| Binary mode savings | ~50ms | Skips base64 encoding overhead |
| Auth overhead | ~0ms | Parallel auth — no blocking wait |
| Subsequent chunks | ~80ms apart | Continuous generation after first chunk |
Total voice agent latency = LLM TTFT + murmr TTFC + network. With a fast LLM (~300ms TTFT) and murmr (~600ms TTFC), expect ~900ms from user input to first audio — well under the 1-second threshold for natural conversation.
Best Practices
Use saved voices for consistency
Pass voice (saved voice ID) or voice_clone_prompt in the config message instead of voice_description. Saved voices produce consistent audio across turns. VoiceDesign may vary slightly each time.
Send tokens immediately
Don't wait for complete sentences from the LLM. Send each token as it arrives. murmr's text buffer handles accumulation and boundary detection.
Always send a final flush
After the LLM stream ends, send {"type":"flush"} to ensure any remaining buffered text is generated. Without this, the last partial sentence may be silent.
Monitor the done message
The done message includes first_chunk_latency_ms and duration_ms. Log these to monitor performance in production.
See Also
- WebSocket Protocol — Complete message type reference
- JavaScript Client — Browser-side WebSocket with Web Audio API
- Voice Management — Save and manage consistent voices
- Rate Limits — WebSocket connection and generation limits