Back to Blog
Languages

Japanese TTS for VTubers: Complete Guide

How to use murmr's Japanese TTS for VTuber content, anime voice styles, and Japanese-language applications. Includes voice descriptions and integration tips.

mT
murmr Team
January 28, 20266 min read
#japanese#vtuber#anime#streaming#languages

Japanese TTS has historically been challenging—most English-first services produce robotic or unnatural-sounding Japanese. If you're building VTuber tools, games, or Japanese-language applications, quality matters. Here's how to get native-quality Japanese voices with murmr.

Why Japanese TTS Matters

Japan has one of the most demanding audiences for synthetic voices. The VTuber industry, visual novels, and anime games have set high expectations for voice quality. Poor TTS immediately breaks immersion.

Common issues with Japanese TTS:

  • Pitch accent errors: Japanese has pitch accent, not stress accent like English
  • Particle pronunciation: は/へ/を pronounced differently in context
  • Honorific handling: Keigo (formal speech) sounds unnatural
  • Emotional range: Anime-style voices require expressive delivery

Most TTS APIs treat Japanese as a secondary language. The model is trained primarily on English, then fine-tuned on limited Japanese data. The result: technically correct but obviously synthetic speech.

murmr's Japanese Quality

murmr uses Qwen3-TTS, which was trained on massive multilingual data including extensive Japanese content. In cross-lingual benchmarks, Qwen3-TTS outperforms other models in Japanese naturalness and speaker similarity.

| Feature | murmr | Typical TTS | |---------|-------|-------------| | Pitch accent | Native-like | Often incorrect | | Emotional range | Anime-suitable | Limited | | Keigo (formal) | Natural | Robotic | | Voice variety | VoiceDesign (unlimited) | Preset only |

Japanese Voice Styles

Japanese voice acting has distinct archetypes. Here's how to describe them for VoiceDesign:

Genki (元気) - Energetic

The cheerful, high-energy voice common in anime protagonists:

text
"元気で明るい若い女性の声。高めのトーンで、テンポよく話す。
アニメのヒロインのような、親しみやすくエネルギッシュな印象。"

// Or in English:
"An energetic young Japanese woman with a bright, cheerful voice.
High-pitched and fast-paced, like an anime protagonist.
Friendly and enthusiastic."

Calm (落ち着いた) - Composed

The mature, composed voice often used for narration or older characters:

text
"落ち着いた女性の声。ゆっくりと丁寧に話す。上品で知的な印象。
ナレーションや教育コンテンツに適している。"

// Or in English:
"A calm, composed Japanese woman. Speaks slowly and carefully.
Elegant and intellectual. Suitable for narration or educational content."

Kawaii (かわいい) - Cute

The cute, slightly childish voice popular in VTuber content:

text
"可愛らしい女の子の声。少し高めで、甘えるような話し方。
VTuberやアイドルのような、キュートで愛らしい印象。"

// Or in English:
"A cute young Japanese girl's voice. Slightly high-pitched with
a sweet, endearing way of speaking. Like a VTuber or idol."

Cool (クール) - Cool/Aloof

The detached, slightly cold voice often used for mysterious characters:

text
"クールで落ち着いた女性の声。感情を抑えた話し方で、
少しミステリアスな印象。低めのトーンで話す。"

// Or in English:
"A cool, reserved Japanese woman. Speaks with controlled emotion,
slightly mysterious. Lower pitch, deliberate pacing."

Language tip

Japanese descriptions often work better for Japanese voices, but English descriptions are also effective. Test both and save the voice you prefer.

VTuber Integration

For VTuber applications, you typically need:

  1. Real-time or near-real-time text-to-speech
  2. Consistent voice across the stream
  3. Integration with OBS or streaming software
  4. Optional: lip sync data

OBS Setup

The most common setup uses a local script that watches for text input (from chat, TTS triggers, etc.) and plays audio through a virtual audio device:

VTuber TTS Script
import requests
import pyaudio
import io
import wave

# Your saved Japanese voice
VOICE_ID = "voice_abc123"
API_KEY = "your-api-key"

def speak(text: str):
    """Generate and play TTS audio."""
    response = requests.post(
        "https://api.murmr.dev/v1/audio/speech",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "input": text,
            "voice": VOICE_ID,
            "response_format": "wav"
        }
    )

    if response.status_code == 200:
        # Play through default audio device
        # (Configure OBS to capture this device)
        audio = io.BytesIO(response.content)
        with wave.open(audio, 'rb') as wf:
            p = pyaudio.PyAudio()
            stream = p.open(
                format=p.get_format_from_width(wf.getsampwidth()),
                channels=wf.getnchannels(),
                rate=wf.getframerate(),
                output=True
            )
            data = wf.readframes(1024)
            while data:
                stream.write(data)
                data = wf.readframes(1024)
            stream.close()
            p.terminate()

# Example: Watch chat for TTS triggers
def on_chat_message(username: str, message: str):
    if message.startswith("!tts "):
        text = message[5:]  # Remove "!tts " prefix
        speak(text)
    elif message.startswith("!say "):
        text = message[5:]
        speak(text)

For lower latency, use the SSE streaming endpoint:

Streaming for lower latency
import requests
import base64
import json
import pyaudio

def speak_streaming(text: str):
    """Stream audio for faster playback start."""
    response = requests.post(
        "https://api.murmr.dev/v1/audio/speech/stream",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json"
        },
        json={"input": text, "voice": VOICE_ID},
        stream=True
    )

    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=24000,
        output=True
    )

    for line in response.iter_lines():
        line = line.decode("utf-8")
        if line.startswith("data: "):
            data = json.loads(line[6:])
            if "audio_chunk" in data:
                audio_data = base64.b64decode(data["audio_chunk"])
                stream.write(audio_data)

    stream.close()
    p.terminate()

Latency expectations

SSE streaming: ~450ms to first audio chunk. WebSocket (Realtime plan): ~250ms to first audio chunk. For most VTuber use cases, SSE is sufficient.

Best Practices

1. Save your VTuber voice

Create your character's voice once with VoiceDesign, save it, and use the voice ID for all subsequent requests. This ensures consistency across streams.

2. Use appropriate text length

For real-time TTS, keep inputs short (1-2 sentences). Longer text adds latency. If you need to read longer passages, split them by sentence.

3. Handle special characters

Japanese text often includes emoji, kaomoji (顔文字), and special characters. Filter or replace these before sending to TTS:

python
import re

def clean_text(text: str) -> str:
    # Remove emoji
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map
        "\U0001F1E0-\U0001F1FF"  # flags
        "]+",
        flags=re.UNICODE
    )
    text = emoji_pattern.sub('', text)

    # Remove kaomoji (common patterns)
    text = re.sub(r'[((][^))]*[))]', '', text)

    return text.strip()

4. Consider politeness levels

Japanese has distinct politeness levels (casual, polite, formal/keigo). Match your voice description to the expected speech pattern:

  • Casual: VTuber chatting with audience
  • Polite: Customer service, announcements
  • Formal (keigo): Business, official content

Example Voice Descriptions

Ready-to-use descriptions for common VTuber archetypes:

Idol VTuber

text
"A young Japanese idol with a bright, energetic voice.
Speaks with enthusiasm and warmth. Slightly high-pitched
with an endearing quality. Perfect for streaming and
interacting with fans."

Gaming VTuber

text
"A casual, friendly Japanese gamer voice. Young woman,
speaks naturally and expressively. Can show excitement
and frustration. Conversational and relatable."

ASMR VTuber

text
"A soft, gentle Japanese woman's voice. Speaks very
quietly and slowly. Soothing and calming. Whisper-like
quality suitable for ASMR content."

Male VTuber

text
"A friendly young Japanese man with a warm, approachable
voice. Mid-range pitch, speaks clearly with natural energy.
Suitable for gaming streams and casual conversation."

Start building: Try these voice descriptions in the Voice Playground. Save your favorite and start streaming! Try Voice Playground →

mT

murmr Team

Engineering

Building the next generation of multilingual text-to-speech.

Related Posts