The Language Confidence Pronunciation Assessment API allows you to generate a detailed pronunciation report for a recording of English speech. The API is powered by state of the art deep learning Artificial Intelligence models and allows you to build english learning or testing applications that are fully automated and scalable.
In order to use the Pronunciation API you will need to authorize your requests using you secret unique API key.
Your API keys are managed via the RapidAPI service and a default one should be created automatically when you first setup your account.
For more details on how to manage your API keys you can view the official RapidAPI docs here
To authorize your requests you then simply need to include your api key in the
x-rapidapi-key header of your request
Your API key is susceptible to being stolen and misused if you do not take precaution to store it securely. Here are some guidelines:
In order for your API key to work you will need to subscribe to a pricing plan in the Pricing tab here. We offer a free trial plan with limited requests for you to get up and running.
Once you are setup and integrated with the API you can subscribe to one of our paid plans. If one of the plans doesn’t fit your needs you can contact as at email@example.com
Our API currently accepts the following audio specifications:
Audio Formats supported
Sample rate: anything above 16Khz
Bit rate: anything above 16Bit
Number of audio channels: Mono or Stereo
Audio Length: 50 seconds max BUT we recommend keeping the audio under 20s for optimal accuracy.
Our API will accept any of the formats documented above for ease of use, but for optimal performance and latency we strongly recommend that you send us you audio in the following format:
Sample Rate: 16Khz
Number of audio channels: MONO
Audio Length: For best accuracy and latency we recommend that you assess sentences individually and keep the recordings < 20 seconds.
In order to send us your audio for assessment you will need to encode your recording as a base64 string, this is a way to encode the binary data of your audio and send it in a HTTP request.
Most programming languages have a built in base64 encoding/decoding library. For testing purposes you can use this online converter: https://base64.guru/converter/encode/audio.
Bash also has a built in base64 CLI executable.
If you are building a web based application you will need a solution to record the users audio through their browser.
Here are some libraries we recommend to get you started:
The Pronunciation API outputs a JSON object containing the AI’s pronunciation report for the given recording. We will go over each item in the report
score: Overall pronunciation score for the audio content.
accent_predictions: Contains a prediction accent score for American (US), British (UK), and Australian (AU) accents. The percent score represents what accent you sound like the most. This is useful if the user is trying to target a specific accent with his pronunciation.
score_predictions: Contains predictions for common official English tests, currently
pte_general. This represents an estimate of what your pronunciation would score on those standardized tests.
words: Contains the list of words in the expected content, each word object contains.
word.label: The label for the given word. e.g: apple.
word.score: Pronunciation score at the word level.
word.syllables: Contains the list of syllables in the word, each syllable object contains:
syllable.label: Contains the CMU label for the syllable.
syllable.label_ipa: Contains the IPA label for the syllable.
syllable.score: Pronunciation score at the syllable level.
syllable.phones: Contains the list of phonemes in the syllable.
word.phones: Contains the list of phonemes in the word, each phone object contains:
phone.label: The CMU label for the phoneme, you can see the full list of CMU phonemes here
phone.label_ipa: The IPA label for the phoneme, you can see more details on the IPA phone set here
phone.score: Numeric pronunciation score out of 100 for the phoneme. You can interpret this as a percentage of correctness for the phoneme. A score above 90% represents a very accurate phoneme pronunciation.
phone.confidence: Represents a percentage of how confident the AI is in it’s prediction. Because of the nature of language the model might be more or less confident in different situations as certain phones are harder to differentiate in different contexts.
phone.error: Represents a binary pass/fail for the phoneme, if true this means that the phone should be considered as erroneous if false the phoneme should e considered as pronounced properly. We recommend using this binary option when scoring your phonemes, a pass/fail often makes more sense to end users expecting feedback on phone level pronunciation.
phone.sounds_like: Contains a list of the top 3 phonemes the AI model estimates it heard. Each entry contains the phone label and the confidence the model has that phoneme was pronounced. If the phoneme scores as a pass the expected phoneme will usually be at the top of the list, however if the user made a phone substitution error the substituted phone is likely to be at the top of the list. This is a good opportunity to give useful feedback to your end users. e.g: You were expected to pronounce
AA but you sounded more like