SSML

When returning a response to the Google Assistant, you can use a subset of the Speech Synthesis Markup Language (SSML) in your responses. By using SSML, you can make your agent's responses seem more life-like. The following shows an example of SSML markup and how it's read back by the Google Assistant.

Markup

<speak>
  Here are <say-as interpret-as="characters">SSML</say-as> samples.
  I can pause <break time="3s"/>.
  I can play a sound
  <audio src="https://www.example.com/MY_MP3_FILE.mp3">didn't get your MP3 audio file</audio>.
  I can speak in cardinals. Your number is <say-as interpret-as="cardinal">10</say-as>.
  Or I can speak in ordinals. You are <say-as interpret-as="ordinal">10</say-as> in line.
  Or I can even speak in digits. The digits for ten are <say-as interpret-as="characters">10</say-as>.
  I can also substitute phrases, like the <sub alias="World Wide Web Consortium">W3C</sub>.
  Finally, I can speak a paragraph with two sentences.
  <p><s>This is sentence one.</s><s>This is sentence two.</s></p>
</speak>
Here are S S M L samples. I can pause [3 second pause]. I can play a sound [audio file plays].
I can speak in cardinals. Your number is ten.
Or I can speak in ordinals. You are tenth in line.
Or I can even speak in digits. The digits for ten are one oh.
I can also substitute phrases, like the World Wide Web Consortium.
Finally, I can speak a paragraph with two sentences. This is sentence one. This is sentence two.

For information on how to use the client library for SSML output, see the SSML section in the Actions SDK or API.AI dialogs and fulfillment guides. Note: SSML is supported in the Google Home Web Simulator, but not the API.AI simulator.

Support for SSML elements

The following table describes the SSML elements that you can use:

Element Description
<speak>

The root element of the SSML response. The required xml:lang attribute specifies the language of the root document.

The following example shows how to use the <speak> element:

<speak>
  my SSML content
</speak>

For more information, see speak Element.

<break>

An empty element that controls pausing or other boundaries between words. Using <break> between any pair of words is optional.

The following example shows how to use the <break> element to pause between two steps:

<speak>
  Step 1, take a deep breath. <break time="2s" />
  Step 2, exhale.
</speak>

For more information, see break Element.

<say‑as>

Lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.

The <say‑as> element has the required attribute, interpret-as, which determines how the value is spoken. Optional attributes format and detail may be used depending on the particular interpret-as value. The interpret-as attribute supports the following values:

  • cardinal
  • The following example is spoken as "Twelve thousand three hundred forty five" (for US English) or "Twelve thousand three hundred and forty five (for UK English)":
    <speak>
      <say-as interpret-as="cardinal">12345</say-as>
    </speak>
    
  • ordinal
  • The following example is spoken as "First":
    <speak>
      <say-as interpret-as="ordinal">1</say-as>
    </speak>
    
  • characters
  • The following example is spoken as "see ay en":
    <speak>
      <say-as interpret-as="characters">can</say-as>
    </speak>
    
  • date
  • The following example is spoken as "September ten nineteen hundred sixty":
    <speak>
      <say-as interpret-as="date" format="ymd">1960-09-10</say-as>
    </speak>
    

    The format attribute is a sequence of date field character codes. Supported field character codes in format are {y, m, d} for year, month, and day (of the month) respectively. If the field code appears once for year, month, or day then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the date text may be separated by punctuation and/or spaces.

    The detail attribute controls the spoken form of the date. For detail='1' only the day fields and one of month or year fields are required, although both may be supplied. This is the default when less than all three fields are given. The spoken form is "The {ordinal day} of {month}, {year}".

    The following example is spoken as "The tenth of September, nineteen sixty":
    <speak>
      <say-as interpret-as="date" format="yyyymmdd" detail="1">
        1960-09-10
      </say-as>
    </speak>
    
    The following example is spoken as "The tenth of September":
    <speak>
      <say-as interpret-as="date" format="dm">10-9</say-as>
    </speak>
    

    For detail='2' the day, month, and year fields are required and this is the default when all three fields are supplied. The spoken form is "{month} {ordinal day}, {year}".

    The following example is spoken as "September tenth, nineteen sixty":
    <speak>
      <say-as interpret-as="date" format="dmy" detail="2">
        10-9-1960
      </say-as>
    </speak>
    
  • time
  • The following example is spoken as "Two thirty P.M.":
    <speak>
      <say-as interpret-as="time" format="hms12">2:30pm</say-as>
    </speak>
    

    The format attribute is a sequence of time field character codes. Supported field character codes in format are {h,m, s, Z, 12, 24} for hour, minute (of the hour), second (of the minute), time zone, 12-hour time, and 24-hour time respectively. If the field code appears once for hour, minute, or second then the number of digits expected are 1, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the time text may be separated by punctuation and/or spaces. If hour, minute, or second are not specified in the format or there are no matching digits then the field is treated as a zero value. The default format is "hms12".

    The detail attribute controls whether the spoken form of the time is 12-hour time or 24-hour time. The spoken form is 24-hour time if detail='1' or if detail is omitted and the format of the time is 24-hour time. The spoken form is 12-hour time if detail='2' or if detail is omitted and the format of the time is 12-hour time.

  • telephone
  • See the interpret-as='telephone' description in the W3C SSML 1.0 say-as attribute values WG note.

For more information, see say-as Element.

<audio>

Supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output.

The following are the currently supported settings for audio:

  • Format: MP3 (MPEG v2)
    • 24K samples per second
    • 24K ~ 96K bits per second, fixed rate
  • Format: Opus in Ogg
    • 24K samples per second (super-wideband)
    • 24K - 96K bits per second, fixed rate
  • Format (deprecated): WAV (RIFF)
    • PCM 16-bit signed, little endian
    • 24K samples per second
  • For all formats:
    • Single channel is preferred, but stereo is acceptable.
    • 120 seconds maximum duration.
    • 5 megabyte file size limit.
    • Source URL must use HTTPS protocol.
    • Our UserAgent when fetching the audio is "Google-Speech-Actions".

The following example outputs the sound stored at the src URL:

<speak>
  <audio src="https://.../meow.mp3">
    a cat meowing
  </audio>
</speak>

The contents of the <audio> element are optional and are used if the audio file cannot be played or if the output device does not support audio.

The src URL must also be an https URL (Google Cloud Storage can host your audio files on an https URL).

For more information, see audio Element.

<p>,<s>

Sentence and paragraph elements.

<p><s>This is sentence one.</s><s>This is sentence two.</s></p>

For more information, see p and s Elements.

<sub>

Indicate that the text in the alias attribute value replaces the contained text for pronunciation.

<sub alias="World Wide Web Consortium">W3C</sub>

For more information, see sub Element.