AMAI TTS provides realistic real-time multilangual multispeaker speech synthesis with customizable emotions.
Request examples:
{"format": "ogg",
"data": [
{"type": "text",
"lang": "en",
"speaker": "Elias",
"data": [{"text": "A stellarator is a machine that uses magnetic fields to confine plasma in the shape of a donut, called a torus. These magnetic fields allow scientists to control the plasma particles and create the right conditions for fusion reactions. Stellarators use extremely strong electromagnets to generate twisting magnetic fields that wrap the long way around the donut shape.",
"emotion": [9]} ] } ] }
POST /synth HTTP 1.1
{
"format": "wav",
"data": [
{
"type": "text",
"lang": "ru",
"speaker": "Michael",
"data": [
{
"pauseBefore": 4000,
"text": "Токамак (тороидальная камера с магнитными катушками) — тороидальная установка для магнитного удержания плазмы с целью достижения условий, необходимых для протекания управляемого термоядерного синтеза. ",
"emotion": "флирт"
} ] } ] }
Hosted version provides speakers for both Russian and English. This can be set by setting value of lang
field to “en” or “ru”. Example: "lang": "en"
Currently multispeaker is only available for English language
Field speaker
is responsible for setting speaker. Example: "speaker": "Elias"
List of supported English speakers:
Elias
Drakula
katrin
Vuk-A
Vuk-B
The format of the streamed audio.
pcm - default format
PCMA, PCMU - G.711 codec only available when using ssml route.
wav, mp3, ogg
In this case tts will wait until all parts of the request are synthed and then will send them back merged and converted (this is planned to be expanded with real-time options in future releases).
Fields pauseBefore
and pauseAfter
are responsible for pauses before and after the synth.
Stress markup is done by inserting “+” before vowels in field text.
For example: "text": "Я люблю м+ороженое"
.
Notice: there is a small group of words, where explicitly marked up stress is ignored
Important notice: stress markup is ignored when using english speaker
emotion
field correspond for emotion of the voiced text. It can be represented as any of corresponding synonyms.
0. "флирт", "любовь", "flirting", "love", "flirt", "lubov"
1. "грусть", "печаль", "sadness", "sorrow", "melancholy", "grust", "pechal"
2. "любопытство", "интерес", "curiosity", "interest", "lyupopytstvo", "interes"
3. "отвращение", "disgust", "aversion", "revulsion", "otvrashchenie", "презрение", "contempt", "scorn", "prezrenie"
4. "радость", "счастье", "joy", "happiness", "radost", "schastye"
5. "разочарование", "disappointment", "disillusionment", "frustration", "razocharovanie"
6. "страх", "fear", "strah", "испуг", "fright", "consternation", "funk", "ispug"
7. "удивление", "astonishment", "surprise", "wonder", "udivlenie"
8. "злость", "anger", "wrath", "rage"
9. "default", "умолч", "по умолч"
Default sample rate is 22050 Hz.
Sample rate control is only available in self-hosted version via ssml route.
Time dilation of voice can be adjusted by adding speed
field. The default value is 1
When using real-time synthesis there is a time window in ms for synthesing. The larger the time window, the better the synthesis will be. It can be adjusted by adding first-chunk-latency-threshold
field, default value is 400.