Pronunciation Assessment


9.7 / 10



Service Level


Back to All Tutorials (4)

Getting started with the Pronunciation Assessment API


The Language Confidence Pronunciation Assessment API allows you to generate a detailed pronunciation report for a recording of English speech. The API is powered by state of the art deep learning Artificial Intelligence models and allows you to build english learning or testing applications that are fully automated and scalable.

Calling the API

API keys

In order to use the Pronunciation API you will need to authorize your requests using you secret unique API key.

Your API keys are managed via the RapidAPI service and a default one should be created automatically when you first setup your account.
For more details on how to manage your API keys you can view the official RapidAPI docs here

To authorize your requests you then simply need to include your api key in the x-rapidapi-key header of your request

Storing your API key

Your API key is susceptible to being stolen and misused if you do not take precaution to store it securely. Here are some guidelines:

  • Do not embed your API key directly into your code or commit it to source control. Store it in a secure environment variable or other secure method.
  • Rotate your API key if you detect any security breach.
  • Do not store your API key in your application front-end, preferable send your requests from your back-end so that your API key is not accessible. If you have to send requests from your front-end make sure that you obfuscate your API key properly.

Gaining access

In order for your API key to work you will need to subscribe to a pricing plan in the Pricing tab here. We offer a free trial plan with limited requests for you to get up and running.

Once you are setup and integrated with the API you can subscribe to one of our paid plans. If one of the plans doesn’t fit your needs you can contact as at

Audio requirements

Our API currently accepts the following audio specifications:

Audio Formats supported

  • WAV
  • MP3
  • M4a
  • OGG

Sample rate: anything above 16Khz
Bit rate: anything above 16Bit
Number of audio channels: Mono or Stereo
Audio Length: 50 seconds max BUT we recommend keeping the audio under 20s for optimal accuracy.

Recommended format
Our API will accept any of the formats documented above for ease of use, but for optimal performance and latency we strongly recommend that you send us you audio in the following format:

Format: WAV
Sample Rate: 16Khz
Bitrate: 16Bit
Number of audio channels: MONO
Audio Length: For best accuracy and latency we recommend that you assess sentences individually and keep the recordings < 20 seconds.

Base64 encoding

In order to send us your audio for assessment you will need to encode your recording as a base64 string, this is a way to encode the binary data of your audio and send it in a HTTP request.

Most programming languages have a built in base64 encoding/decoding library. For testing purposes you can use this online converter:
Bash also has a built in base64 CLI executable.

Recording audio in the browser

If you are building a web based application you will need a solution to record the users audio through their browser.
Here are some libraries we recommend to get you started:

Using the API output

The Pronunciation API outputs a JSON object containing the AI’s pronunciation report for the given recording. We will go over each item in the report

score: Overall pronunciation score for the audio content.

accent_predictions: Contains a prediction accent score for American (US), British (UK), and Australian (AU) accents. The percent score represents what accent you sound like the most. This is useful if the user is trying to target a specific accent with his pronunciation.

score_predictions: Contains predictions for common official English tests, currently ielts, toefl, cefr, pte_general. This represents an estimate of what your pronunciation would score on those standardized tests.

words: Contains the list of words in the expected content, each word object contains.

word.label: The label for the given word. e.g: apple.

word.score: Pronunciation score at the word level.

word.syllables: Contains the list of syllables in the word, each syllable object contains:

syllable.label: Contains the CMU label for the syllable.

syllable.label_ipa: Contains the IPA label for the syllable.

syllable.score: Pronunciation score at the syllable level.

syllable.phones: Contains the list of phonemes in the syllable.

word.phones: Contains the list of phonemes in the word, each phone object contains:

phone.label: The CMU label for the phoneme, you can see the full list of CMU phonemes here

phone.label_ipa: The IPA label for the phoneme, you can see more details on the IPA phone set here

phone.score: Numeric pronunciation score out of 100 for the phoneme. You can interpret this as a percentage of correctness for the phoneme. A score above 90% represents a very accurate phoneme pronunciation.

phone.confidence: Represents a percentage of how confident the AI is in it’s prediction. Because of the nature of language the model might be more or less confident in different situations as certain phones are harder to differentiate in different contexts.

phone.error: Represents a binary pass/fail for the phoneme, if true this means that the phone should be considered as erroneous if false the phoneme should e considered as pronounced properly. We recommend using this binary option when scoring your phonemes, a pass/fail often makes more sense to end users expecting feedback on phone level pronunciation.

phone.sounds_like: Contains a list of the top 3 phonemes the AI model estimates it heard. Each entry contains the phone label and the confidence the model has that phoneme was pronounced. If the phoneme scores as a pass the expected phoneme will usually be at the top of the list, however if the user made a phone substitution error the substituted phone is likely to be at the top of the list. This is a good opportunity to give useful feedback to your end users. e.g: You were expected to pronounce AA but you sounded more like AH.