Intermediate

SSML — Speech Synthesis Markup Language

SSML gives you fine-grained control over how text is spoken — adding pauses, emphasis, pronunciation guidance, prosody adjustments, and multi-language support to create polished, natural-sounding speech output.

What is SSML?

SSML is an XML-based markup language defined by the W3C for controlling speech synthesis. It lets you annotate text with instructions for the TTS engine, going beyond what plain text can express.

Essential SSML Elements

Element Purpose Example
<speak> Root element (required) <speak>Hello world</speak>
<break> Insert a pause Step one.<break time="500ms"/>Step two.
<emphasis> Stress a word This is <emphasis level="strong">very</emphasis> important.
<prosody> Control rate, pitch, volume <prosody rate="slow" pitch="+2st">Listen carefully.</prosody>
<say-as> Specify interpretation <say-as interpret-as="date">2026-03-15</say-as>
<phoneme> Custom pronunciation <phoneme alphabet="ipa" ph="təˈmeɪ.toʊ">tomato</phoneme>
<sub> Substitution alias <sub alias="World Wide Web Consortium">W3C</sub>
<lang> Switch language The word <lang xml:lang="fr-FR">bonjour</lang> means hello.

Practical SSML Examples

Controlling Pauses and Pacing

<speak>
  Welcome to AI School.
  <break time="1s"/>
  Today we will cover three topics.
  <break time="500ms"/>
  <prosody rate="90%">
    First, speech synthesis fundamentals.
  </prosody>
  <break time="300ms"/>
  Second, neural voice technology.
  <break time="300ms"/>
  And third, practical applications.
</speak>

Numbers, Dates, and Abbreviations

<speak>
  The meeting is on
  <say-as interpret-as="date" format="mdy">3/15/2026</say-as>
  at <say-as interpret-as="time">2:30pm</say-as>.

  Call us at
  <say-as interpret-as="telephone">+1-800-555-0123</say-as>.

  The total is
  <say-as interpret-as="currency">$1,234.56</say-as>.

  Version <say-as interpret-as="characters">3.2.1</say-as>
  is now available.
</speak>

Multi-Language Content

<speak>
  In English, we say hello.
  <lang xml:lang="fr-FR">En français, on dit bonjour.</lang>
  <lang xml:lang="es-ES">En español, decimos hola.</lang>
  <lang xml:lang="de-DE">Auf Deutsch sagen wir hallo.</lang>
</speak>

Platform-Specific SSML Extensions

💻

Azure MSTTS

Microsoft extends SSML with <mstts:express-as> for emotional styles (cheerful, sad, angry), <mstts:silence>, and <mstts:backgroundaudio> for background music.

🔈

Google Extensions

Google Cloud TTS supports the full W3C SSML spec plus <media> for parallel audio and <par>/<seq> for complex audio composition.

📋

Amazon Polly

Polly adds <amazon:effect name="whispered"> for whispered speech, <amazon:breath> for breathing sounds, and newscaster speaking style.

SSML Best Practices

  • Start Simple: Use plain text first, then add SSML only where the default pronunciation or pacing is not satisfactory.
  • Test Across Voices: SSML may render differently across different voices and platforms. Always test with your target voice.
  • Use <say-as> for Ambiguity: Numbers, dates, addresses, and abbreviations are common sources of mispronunciation. SSML resolves ambiguity.
  • Strategic Pauses: Use <break> elements to create natural-sounding pauses between sections, list items, and important points.
  • Validate Your SSML: SSML is XML, so it must be well-formed. Use an XML validator to catch syntax errors before sending to the API.