Emotional Text-to-Speech

FREEMIUM

By Amai AI Interaction Corp | Updated 25 दिन पहले | Media

Popularity

0.4 / 10

Latency

127,327ms

Service Level

0%

Health Check

N/A

README

AMAI on-premise text-to-speech engine

AMAI TTS provides realistic real-time multilangual multispeaker speech synthesis with customizable emotions.

Usage

HTTP request

Request examples:

{"format": "ogg",
 "data": [
     {"type": "text",
      "lang": "en",
      "speaker": "Elias",
      "data": [{"text": "A stellarator is a machine that uses magnetic fields to confine plasma in the shape of a donut, called a torus. These magnetic fields allow scientists to control the plasma particles and create the right conditions for fusion reactions. Stellarators use extremely strong electromagnets to generate twisting magnetic fields that wrap the long way around the donut shape.",
		"emotion": [9]}  ] } ] }

POST /synth HTTP 1.1

{
  "format": "wav",
 "data": [
     {
      "type": "text",
      "lang": "ru",
      "speaker": "Michael",
      "data": [
        {
					"pauseBefore": 4000,
					"text": "Токамак (тороидальная камера с магнитными катушками) — тороидальная установка для магнитного удержания плазмы с целью достижения условий, необходимых для протекания управляемого термоядерного синтеза. ",
		"emotion": "флирт"
    }  ] } ] }

Languages

Hosted version provides speakers for both Russian and English. This can be set by setting value of lang field to “en” or “ru”. Example: "lang": "en"

Speaker

Currently multispeaker is only available for English language

Field speaker is responsible for setting speaker. Example: "speaker": "Elias"

List of supported English speakers:

Elias
Drakula
katrin
Vuk-A
Vuk-B

Format

The format of the streamed audio.

Real-time formats

pcm - default format
PCMA, PCMU - G.711 codec only available when using ssml route.

Non-real-time formats

wav, mp3, ogg

In this case tts will wait until all parts of the request are synthed and then will send them back merged and converted (this is planned to be expanded with real-time options in future releases).

pauseBefore and PauseAfter

Fields pauseBefore and pauseAfter are responsible for pauses before and after the synth.

Stress

Stress markup is done by inserting “+” before vowels in field text.

For example: "text": "Я люблю м+ороженое".

Notice: there is a small group of words, where explicitly marked up stress is ignored

Important notice: stress markup is ignored when using english speaker

Emotion

emotion field correspond for emotion of the voiced text. It can be represented as any of corresponding synonyms.

0. "флирт", "любовь", "flirting", "love", "flirt", "lubov"
1. "грусть", "печаль", "sadness", "sorrow", "melancholy", "grust", "pechal"
2. "любопытство", "интерес", "curiosity", "interest", "lyupopytstvo", "interes"
3. "отвращение",  "disgust", "aversion", "revulsion", "otvrashchenie",  "презрение", "contempt", "scorn", "prezrenie"
4. "радость", "счастье", "joy", "happiness", "radost", "schastye"
5. "разочарование", "disappointment", "disillusionment", "frustration", "razocharovanie"
6. "страх", "fear", "strah", "испуг", "fright", "consternation", "funk", "ispug"
7. "удивление", "astonishment", "surprise", "wonder", "udivlenie"
8. "злость", "anger", "wrath", "rage"
9. "default", "умолч", "по умолч"

Sample rate

Default sample rate is 22050 Hz.

Sample rate control is only available in self-hosted version via ssml route.

Time Dilation

Time dilation of voice can be adjusted by adding speed field. The default value is 1

Synthesis threshold

When using real-time synthesis there is a time window in ms for synthesing. The larger the time window, the better the synthesis will be. It can be adjusted by adding first-chunk-latency-threshold field, default value is 400.

Followers: 7

API Creator:

Amai AI Interaction Corp

amai-ai-interaction-corp

Rate API:

Rating: 2.8 - Votes: 4