Yarn /vocabulate
endpoint returns response in JSON format with three top level fields:
page
- object containing vocabulated page responseaverages
- object containing averages data of all yarn vocabulated pages.leaderboards
- object containing two leaderboards: one with all time####page
page.overAllScore
- number within 1-100 range determining difficulty of given page.
It’s derived from average difficulty scores of each word on the page (based on Yarn 100k word difficulty index)
excluding most common words and taking into account overall scores of other already indexed pages.
Sample overall score ranges interpretation:
page.article
- array containing each word from processed page main content in natural reading order,
useful of presenting page content with difficulty levels applied for each word.
page.article[]word
- string - word as it appears in main page contentpage.article[]difficultyScore
- number within range 1-100 based on Yarn 100k word difficulty index.page.article[]category
- enum - one of ‘easy’, ‘medium’ or ‘hard’.page.wordsCategorized
- object with 3 fields: easy
, medium
and hard
each of which is an array
containing page words belonging to a given difficulty category sorted by difficulty in descending order.
Each array item an object with following fields:
page.wordsCategorized{category}[]word
- string a word as it appears in page main contentpage.wordsCategorized{category}[]difficultyScore
- number within range 1-100 based on Yarn 100k word difficulty indexpage.wordsCategorized{category}[].occurrencesCount
- number determining how many times given word appeared in main page contentpage.difficultyCategories
- object with 3 fields easy
, medium
and hard
each containing percentage
number
that is determining how many percent of page content words belong to a given difficulty category.
For example: page.difficultyCategories.easy.percentage = 12
, means there are 12% of easy words on a given page
page.difficultyDistribution
- array sorted by difficulty level ascending with data that shows how many percent words belongs to each difficulty level.
page.difficultyDistribution[]difficultyScore
- number in 1-100 rangepage.difficultyDistribution[]wordsPercentage
- number describing how many percent of words occurs in page content for given page.difficultyDistribution[]difficultyScore
####averages
averages.difficultyCategories
- same as page.difficultyCategories
but calculated based on all already vocabulted by Yarn pages.averages.difficultyDistribution
- same as page.difficultyDistribution
but calculated based on all already vocabulted by Yarn pages.####leaderboards
leaderboards.daysBack
- object representing leaderboard for ‘x’ days back (where ‘x’ can be send as request parameter: leaderboard[daysBack]
) .leaderboard[count]
)leaderboards.total
- object representing leaderboard for top ‘y’ most difficult and easy pages out of all vocabulated by Yarn pages (‘y’ can be send as a request parameter: leaderboard[count])
leaderboard{daysBack/total}.difficult[]
- array containing top ‘y’ most difficult to read pagesleaderboard{daysBack/total}.easy[]
- array containing top ‘y’ easiest to read pagesleaderboard{daysBack/total}.current
- object representing current page (page for provided URL) in context of given leaderboard.difficult
or easy
array (marked with isCurrent
flag set to true
)The frequency with which a word occurs across a broad corpus of texts has been found to be a decent proxy
for word difficulty: the less often a word is used, the the more difficult it’s likely to be.
We have built a 100k word difficulty index based on how often words occur across the 850 million words
of the Corpus of Contemporary American English, the Corpus of Historical American English,
the British National Corpus, and the Corpus of American Soap Operas (in other words, many and diverse texts).
###…One number
Using a logarithmic mapping to smooth out the raw data a little, we give each word in the Yarn
index a difficulty score on a scale of 1 to 100. Then we use a weighted average of the difficulty scores
of the words in the main content of each web page we index to get to the score you see.
Yarn is not meant to be taken too seriously — after all, the difficulty of a text is clearly about a lot more
than the difficulty of its individual constituent words. We hope you find it thought provoking and fun nonetheless.