Back to Blog
Tutorials

How to Add Voice to Your Next.js App

Add text-to-speech to your Next.js app in under 50 lines. Stream audio from an API route using VoiceDesign or saved voices.

mT
murmr Team
17 de febrero de 20267 min read
#nextjs#tutorial#streaming#typescript#react

Next.js makes it simple to add voice to your application. In this tutorial, we'll build a text-to-speech feature with streaming audio playback — from API route to React component.

What We're Building

A Next.js app that:

  1. Takes text input from the user
  2. Sends it to murmr's VoiceDesign API via a server-side API route
  3. Streams audio back and plays it progressively in the browser

No audio files to manage, no pre-recorded clips. Just type text, describe a voice, and hear it speak.

Prerequisites

  • A Next.js 14+ project (App Router)
  • A murmr API key from the dashboard
  • Node.js 18+

Step 1: Environment Setup

Add your API key to .env.local:

text
MURMR_API_KEY=your_api_key_here

Never expose your API key

The API key goes in .env.local (no NEXT_PUBLIC_ prefix). It should only be accessible server-side, in API routes and Server Components.

Step 2: Create the API Route

The API route proxies requests to murmr and streams audio back to the client. This keeps your API key server-side and lets you add your own auth, rate limiting, or logging.

app/api/speak/route.ts
import { NextRequest } from 'next/server';

export async function POST(req: NextRequest) {
  const { text, voiceDescription, language } = await req.json();

  const response = await fetch(
    'https://api.murmr.dev/v1/voices/design/stream',
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.MURMR_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        text,
        voice_description: voiceDescription,
        language: language || 'en',
      }),
    }
  );

  if (!response.ok) {
    const error = await response.text();
    return new Response(error, { status: response.status });
  }

  // Forward the SSE stream directly to the client
  return new Response(response.body, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
    },
  });
}

The streaming variant forwards the SSE stream from murmr directly to the browser — no buffering on the server. The batch variant is simpler: it waits for the complete audio file and returns it as a WAV.

Step 3: Build the React Component

Now the client-side component that captures input and plays audio.

Streaming Playback

For the VoiceDesign streaming endpoint, audio arrives as Server-Sent Events with base64-encoded PCM chunks. We decode and queue them for playback:

components/voice-player.tsx
'use client';

import { useState, useRef, useCallback } from 'react';

export function VoicePlayer() {
  const [text, setText] = useState('');
  const [description, setDescription] = useState(
    'A warm, friendly narrator with clear enunciation'
  );
  const [isPlaying, setIsPlaying] = useState(false);
  const audioContextRef = useRef<AudioContext | null>(null);
  const nextStartTimeRef = useRef(0);

  const speak = useCallback(async () => {
    setIsPlaying(true);

    // Initialize AudioContext on user gesture
    if (!audioContextRef.current) {
      audioContextRef.current = new AudioContext({ sampleRate: 24000 });
    }
    const ctx = audioContextRef.current;
    nextStartTimeRef.current = ctx.currentTime;

    const response = await fetch('/api/speak', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        text,
        voiceDescription: description,
        language: 'en',
      }),
    });

    const reader = response.body!.getReader();
    const decoder = new TextDecoder();
    let buffer = '';

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      buffer += decoder.decode(value, { stream: true });
      const lines = buffer.split('\n');
      buffer = lines.pop() || '';

      for (const line of lines) {
        if (!line.startsWith('data: ')) continue;
        const data = JSON.parse(line.slice(6));

        if (data.audio) {
          // Decode base64 PCM and schedule playback
          const pcmBytes = atob(data.audio);
          const samples = new Float32Array(pcmBytes.length / 2);
          for (let i = 0; i < samples.length; i++) {
            const int16 =
              pcmBytes.charCodeAt(i * 2) |
              (pcmBytes.charCodeAt(i * 2 + 1) << 8);
            samples[i] = (int16 > 32767 ? int16 - 65536 : int16) / 32768;
          }

          const audioBuffer = ctx.createBuffer(1, samples.length, 24000);
          audioBuffer.copyToChannel(samples, 0);

          const source = ctx.createBufferSource();
          source.buffer = audioBuffer;
          source.connect(ctx.destination);

          const startTime = Math.max(
            ctx.currentTime,
            nextStartTimeRef.current
          );
          source.start(startTime);
          nextStartTimeRef.current =
            startTime + audioBuffer.duration;
        }
      }
    }

    // Wait for all audio to finish
    const remaining = nextStartTimeRef.current - ctx.currentTime;
    if (remaining > 0) {
      await new Promise((r) => setTimeout(r, remaining * 1000));
    }
    setIsPlaying(false);
  }, [text, description]);

  return (
    <div className="space-y-4">
      <textarea
        value={description}
        onChange={(e) => setDescription(e.target.value)}
        placeholder="Describe the voice..."
        className="w-full p-3 rounded bg-zinc-800 text-zinc-100"
        rows={2}
      />
      <textarea
        value={text}
        onChange={(e) => setText(e.target.value)}
        placeholder="Enter text to speak..."
        className="w-full p-3 rounded bg-zinc-800 text-zinc-100"
        rows={4}
      />
      <button
        onClick={speak}
        disabled={isPlaying || !text}
        className="px-6 py-2 bg-amber-500 text-zinc-900 rounded
                   font-medium disabled:opacity-50"
      >
        {isPlaying ? 'Speaking...' : 'Speak'}
      </button>
    </div>
  );
}

Why schedule with startTime?

Using source.start(startTime) instead of source.start() ensures gapless playback. Each chunk is scheduled to begin exactly when the previous one ends, eliminating clicks and pauses between chunks.

Batch Playback

If you're using saved voices (simpler, no SSE parsing), the batch approach is even easier:

typescript
async function speakWithSavedVoice(text: string, voiceId: string) {
  const response = await fetch('/api/speak', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text, voiceId }),
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  audio.play();

  audio.onended = () => URL.revokeObjectURL(url);
}

Step 4: Add Input Validation

Validate request bodies in your API route to prevent abuse:

typescript
import { z } from 'zod';

const speakSchema = z.object({
  text: z.string().min(1).max(5000),
  voiceDescription: z.string().min(1).max(500),
  language: z.string().length(2).default('en'),
});

export async function POST(req: NextRequest) {
  const body = speakSchema.parse(await req.json());
  // ... rest of handler
}

Step 5: Deploy to Production

murmr's API handles all the GPU compute — your Next.js app just proxies requests. Deploy normally:

bash
vercel deploy

Make sure MURMR_API_KEY is set in your Vercel project's environment variables.

Production Checklist

  • API key security: Verify your key is only in server-side env vars (no NEXT_PUBLIC_ prefix)
  • Rate limiting: Add rate limiting to your API route to prevent abuse. murmr enforces plan limits, but you should also protect your own endpoint
  • Error handling: Show user-friendly errors when the API is unavailable or quota is exceeded
  • Audio format: The streaming endpoint returns 24kHz 16-bit mono PCM. The batch endpoint returns WAV by default (configurable via response_format)

Choosing Between Streaming and Batch

| | Streaming (VoiceDesign) | Batch (Saved Voice) | |---|---|---| | Latency | First audio in ~300ms | Full audio after generation completes | | Use case | Interactive, real-time | Pre-generated, downloadable | | Voice | Describe on-the-fly | Consistent saved voice | | Complexity | SSE parsing + audio scheduling | Simple fetch + play | | Plans | All plans (Free tier: 5/day) | All plans |

For most apps, start with streaming VoiceDesign for prototyping, then save voices you like and switch to the batch endpoint for production consistency.

Next Steps

  • Save voices: Use the Voice Management API to save VoiceDesign voices for reuse
  • Multiple languages: murmr supports 10 languages — just change the language parameter
  • Real-time agents: For voice agents that need sub-200ms latency, check out the WebSocket API
  • Audio formats: Request mp3, opus, aac, or flac via the response_format parameter on batch endpoints
mT

murmr Team

Engineering

Building the next generation of multilingual text-to-speech.

Related Posts