The Actions on Google platform supports a number of SSML Beta features in addition to the Actions on Google standard SSML elements.
Summary of supported Beta SSML features:
<phoneme>: Customize the pronunciation of specific words.
<say-as interpret-as="duration">: Specify durations.
<voice>: Switch between voices in the same request.
<lang>: Use multiple languages in the same request.
- Timepoints: Use the
<mark>tag to return the timepoint of a specified point in your transcript.
You can use the
<phoneme> tag to produce custom pronunciations of words
inline. Actions on Google accepts the
X-SAMPA phonetic alphabets. See the
phonemes page for a list of supported
languages and phonemes.
Each application of the
<phoneme> tag directs the pronunciation of a single
<phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme> <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>
There are up to three levels of stress that can be placed in a transcription:
- Primary stress: Denoted with
ˈin IPA and
- Secondary stress: Denoted with
ˌin IPA and
- Unstressed: Not denoted with a symbol (in either notation).
Some languages might have fewer than three levels or not denote stress placement at all. See the phonemes page to see the stress levels available for your language. Stress markers are placed at the start of each stressed syllable. For example, in US English:
Broad vs narrow transcriptions
As a general rule, keep your transcriptions more broad and phonemic in nature.
For example, in US English, transcribe intervocalic
t (instead of using a
There are some instances where using the phonemic representation makes your TTS results sound unnatural (for example, if the sequence of phonemes is anatomically difficult to pronounce).
One example of this is voicing assimilation for
s in English. In this case the
assimilation should be reflected in the transcription:
Every syllable must contain one (and only one) vowel. This means that you should avoid syllabic consonants and instead transcribe them with a reduced vowel. For example:
You can optionally specify syllable boundaries by using
.. Each syllable must
contain one (and only one) vowel. For example:
The Actions on Google platform supports
<say-as interpret-as="duration"> to correctly
read durations. For example, the following example would be verbalized as "five
hours and thirty minutes":
<say-as interpret-as="duration" format="h:m">5:30</say-as>
The format string supports the following values:
<voice> tag allows you to use more than one voice in a single SSML
request. In the following example, the default voice is an English male voice.
All words will be synthesized in this voice except for "qu'est-ce qui t'amène
ici", which will be verbalized in French using a female voice instead of the
default language (English) and gender (male).
<speak>And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui t'amène ici</voice><break time="250ms"/> in her sweet and gentle voice.</speak>
Alternatively, you can use a
<voice> tag to specify an individual voice (the
voice name on the
supported voices and languages page)
rather than specifying a
<speak>The dog is friendly<voice name="fr-CA-Wavenet-B">mais la chat est mignon</voice><break time="250ms"/> said a pet shop owner</speak>
When you use the
<voice> tag, Actions on Google expects to receive either
name of the voice you
want to use) or a combination of the following attributes. All three
attributes are optional but you must provide at least one if you don't provide a
gender: One of
variant: Used as a tiebreaker in cases where there are multiple possibilities of which voice to use based on your configuration.
language: Your desired language. Only one language can be specified in a given
<voice>tag. Specify your language in BCP-47 format. You can find the BCP-47 code for your language in the language code column on the supported voices and languages page.
You can also control the relative priority of each of the
language attributes using two additional tags:
required: If an attribute is designated as
requiredand not configured properly, the request fails.
ordering: Any attributes listed after an
orderingtag are considered as preferred attributes rather than required. The SSML considers preferred attributes on a best effort basis in the order they are listed after the
orderingtag. If any preferred attributes are configured incorrectly, Actions on Google might still return a valid voice but with the incorrect configuration dropped.
Examples of configurations using the
<speak>And there it was <voice language="en-GB" gender="male" required="gender" ordering="gender language">a flying bird </voice>roaring in the skies for the first time.</speak>
<speak>Today is supposed to be <voice language="en-GB" gender="female" ordering="language gender">Sunday Funday.</voice></speak>
You can use
<lang> to include text in multiple languages within the same SSML
request. All languages will be synthesized in the same voice unlesss you use the
<voice> tag to explicitly change the voice. The
xml:lang string must contain
the target language in BCP-47 format (this value is listed as "language code" in
the supported voices
table). In the following example "chat" will be verbalized in French instead of
the default language (English):
<speak>The french word for cat is <lang xml:lang="fr-FR">chat</lang></speak>
Actions on Google platform supports the
<lang> tag on a best effort basis. Not all
language combinations produce the same quality results if specified in the same
SSML request. In some cases, a language combination might produce an effect that
is detectible but subtle or perceived as negative. Known issues:
- Japanese with Kanji characters is not supported by the
<lang>tag. The input is transliterated and read as Chinese characters.
- Semitic languages such as Arabic, Hebrew, and Persian are not supported by
<lang>tag and will result in silence. If you want to use any of these languages we recommend using the
<voice>tag to switch to a voice that speaks your desired language (if available).