The Audio API provides a text-to-speech endpoint, speech, based on our TTS (text-to-speech) model. It comes with 6 build voices and can be used to:
Narrate a written blog post
Produce spoken audio in multiple languages
Give real-time audio output using streaming
Here is an example of the alloy voice:
The speech endpoint takes in three key inputs: the model name, the text that should be turned into audio, and the voice to be used for the audio generation. A simple request would look like the following:
{
"model": "tts-1",
"input": "Today is a wonderful day to build something people love!",
"voice": "alloy"
}
Experiment with different voices (alloy, echo, fable, onyx, nova, and shimmer) to find one that matches your desired tone and audience. The current voices are optimized for English.
- Alloy
- Echo
- Fable
- Onyx
- Nova
- Shimmer
You can see samples for each voice in the following list:
The default response format is “mp3”, but other formats like “opus”, “aac”, or “flac” are available.
The TTS model generally follows the Whisper model in terms of language support. Whisper supports the following languages and performs well despite the current voices being optimized for English:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
You can generate spoken audio in these languages by providing the input text in the language of your choice.