ПРОЧТИ МЕНЯ

API high-level features

What if you have a bunch of or just one long text in any language and none of summarisation tools work for you?
Texts are fuzzy

If you have been in such a situation and also you have texts from social media or news sources that you cannot always trust in terms of how clean of noise they are, do they have URLs, hashtags, people names, addresses and so on.

And if you wanted to sift through the texts with some filter, like I want only nouns and only verbs to capture who did what.
Or I want only adjectives and nouns to capture with what colour do texts describe arbitrary objects.
Or I just want to have links out of all texts.

Now you can do all of that with one call to our SemanticCloud API.

Let’s pick the following tweet with hashtags and an URL in it and a mix of two languages: Russian and English:
Я голосую за сильного президента, за сильную независимую Россию и за тех, кто привык спрашивать только с себя, а не винить в своей лени остальных! I vote for a strong president, for a strong Russia! #выборыпрезидента #RussiaElections2018 #ЯГолосую #ЗаПутина #Putin http://pic.twitter.com/zkY8axHqZA

And let’s ask the system to output nouns, verbs, adverbs, adjectives, names, hyperlinks.

Two top words by count are: strong and сильный (translation pair):
{ “word”: “strong”, “stem”: “strong”, “partOfSpeech”: “Unknown”, “count”: 2, “lemma”: false, “keyword”: false },

{ “word”: “сильный”, “stem”: “сильный”, “partOfSpeech”: “Adjective”, “count”: 2, “lemma”: true, “keyword”: false }

But we also parsed the words out of hashtags:

{ “word”: “яголосую”, “stem”: “яголос”, “partOfSpeech”: “Unknown”, “count”: 1, “lemma”: false, “keyword”: false },
{ “word”: “выборыпрезидента”, “stem”: “выборыпрезидент”, “partOfSpeech”: “Unknown”, “count”: 1, “lemma”: false, “keyword”: false }

and a URL:

{ “word”: “http://pic.twitter.com/zky8axhqza”, “stem”: “http://pic.twitter.com/zky8axhqza”, “partOfSpeech”: “Hyperlink”, “count”: 1, “lemma”: false, “keyword”: false }

In addition we can ask the API to give us only top N words (by frequencies) along with lemmas (where applicable). And, more importantly, we can ask to count our secret word, that we are monitoring. Whether or not our secret word is present in the texts, it will be returned back:

{ “word”: “петербург”, “stem”: “петербург”, “partOfSpeech”: “Noun”, “count”: 0, “lemma”: true, “keyword”: true }

API end-points and parameters

End-point: /semcloud/v2/wordscloud/text

Payload:
{ "text": "Прогулку по Васильевскому острову лучше всего начать со Стрелки — его восточной оконечности. Попасть на Стрелку можно двумя способами. Первый, с Петроградской стороны — пешком от станции метро «Спортивная» по проспекту Добролюбова и Биржевому мосту. Второй способ — через Дворцовый мост, пешком от метро «Адмиралтейская» или на троллейбусе или автобусе от любой из станций метро на Невском проспекте («Площадь Восстания», «Маяковская», «Невский проспект», «Гостиный двор»).", "pos": [ "Noun", "Verb" ], "keywords": [ "Петербург" ], "catsInCloud": 10, "filterSimilarAdjective": true }

Parameters:
text contains the target text to extract top semantic words from.
pos is an array of POS tags to filter on. Complete list of supported tags: Noun, Adjective, Verb, Adverb, Name, Hyperlink, Unknown.
keywords is an array of words that the system should try its best to find and if they are found, mark them in the response.
catsInCloud limits the number of top words returned
filterSimilarAdjective will filter nouns and adjectives that are similar to one of them (noun or adjective) with the highest frequency in the Russian language.

Response:
[ { "word": "метро", "stem": "метро", "partOfSpeech": "Noun", "count": 3, "lemma": true, "keyword": false }, { "word": "проспект", "stem": "проспект", "partOfSpeech": "Noun", "count": 3, "lemma": true, "keyword": false }, { "word": "стрелок", "stem": "стрелок", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "станция", "stem": "станция", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "способ", "stem": "способ", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "мост", "stem": "мост", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "добролюбова", "stem": "добролюбов", "partOfSpeech": "Unknown", "count": 1, "lemma": false, "keyword": false }, { "word": "сторона", "stem": "сторона", "partOfSpeech": "Noun", "count": 1, "lemma": true, "keyword": false }, { "word": "оконечность", "stem": "оконечность", "partOfSpeech": "Noun", "count": 1, "lemma": true, "keyword": false }, { "word": "попасть", "stem": "попасть", "partOfSpeech": "Verb", "count": 1, "lemma": true, "keyword": false }, { "word": "петербург", "stem": "петербург", "partOfSpeech": "Noun", "count": 0, "lemma": true, "keyword": true } ]

The recognized part of speech tags are: NOUN, ADJECTIVE, VERB, ADVERB, CONJUNCTION, PARTICLE, INTERJECTION, PREPOSITION, PARENTHESIS, PREDICATE, PRONOUN_ADJECTIVE, NUMERAL, PRONOUN, COMPARISON, ADVERB_PREPOSITION, UNKNOWN.

Along with POS tag the following output values are provided:
word the surface form as found in the text
stem its base form
partOfSpeech detected part of speech
count number of times the word occurs in the text
lemma if word and its base form are the same
keyword boolean flag. True if it is taken from the request’s keywords array. Check count to know, whether it occurred in the text.

Подписчики: 10
Ресурсы:
Сайт продукта Условия использования
Создатель API:
Rapid account: Insider
Insider
insider
Войдите, чтобы оценить API
Оценки: 5 - Голосов: 1