What if you have a bunch of or just one long text in any language and none of summarisation tools work for you?
If you have been in such a situation and also you have texts from social media or news sources that you cannot always trust in terms of how clean of noise they are, do they have URLs, hashtags, people names, addresses and so on.
And if you wanted to sift through the texts with some filter, like I want only nouns and only verbs to capture who did what.
Or I want only adjectives and nouns to capture with what colour do texts describe arbitrary objects.
Or I just want to have links out of all texts.
Now you can do all of that with one call to our SemanticCloud API.
Let’s pick the following tweet with hashtags and an URL in it and a mix of two languages: Russian and English:
Я голосую за сильного президента, за сильную независимую Россию и за тех, кто привык спрашивать только с себя, а не винить в своей лени остальных! I vote for a strong president, for a strong Russia! #выборыпрезидента #RussiaElections2018 #ЯГолосую #ЗаПутина #Putin http://pic.twitter.com/zkY8axHqZA
And let’s ask the system to output nouns, verbs, adverbs, adjectives, names, hyperlinks.
Two top words by count are: strong and сильный (translation pair):
{ “word”: “strong”, “stem”: “strong”, “partOfSpeech”: “Unknown”, “count”: 2, “lemma”: false, “keyword”: false },
{ “word”: “сильный”, “stem”: “сильный”, “partOfSpeech”: “Adjective”, “count”: 2, “lemma”: true, “keyword”: false }
But we also parsed the words out of hashtags:
{ “word”: “яголосую”, “stem”: “яголос”, “partOfSpeech”: “Unknown”, “count”: 1, “lemma”: false, “keyword”: false },
{ “word”: “выборыпрезидента”, “stem”: “выборыпрезидент”, “partOfSpeech”: “Unknown”, “count”: 1, “lemma”: false, “keyword”: false }
and a URL:
{ “word”: “http://pic.twitter.com/zky8axhqza”, “stem”: “http://pic.twitter.com/zky8axhqza”, “partOfSpeech”: “Hyperlink”, “count”: 1, “lemma”: false, “keyword”: false }
In addition we can ask the API to give us only top N words (by frequencies) along with lemmas (where applicable). And, more importantly, we can ask to count our secret word, that we are monitoring. Whether or not our secret word is present in the texts, it will be returned back:
{ “word”: “петербург”, “stem”: “петербург”, “partOfSpeech”: “Noun”, “count”: 0, “lemma”: true, “keyword”: true }
End-point: /semcloud/v2/wordscloud/text
Payload:
{ "text": "Прогулку по Васильевскому острову лучше всего начать со Стрелки — его восточной оконечности. Попасть на Стрелку можно двумя способами. Первый, с Петроградской стороны — пешком от станции метро «Спортивная» по проспекту Добролюбова и Биржевому мосту. Второй способ — через Дворцовый мост, пешком от метро «Адмиралтейская» или на троллейбусе или автобусе от любой из станций метро на Невском проспекте («Площадь Восстания», «Маяковская», «Невский проспект», «Гостиный двор»).", "pos": [ "Noun", "Verb" ], "keywords": [ "Петербург" ], "catsInCloud": 10, "filterSimilarAdjective": true }
Parameters:
text
contains the target text to extract top semantic words from.
pos
is an array of POS tags to filter on. Complete list of supported tags: Noun, Adjective, Verb, Adverb, Name, Hyperlink, Unknown.
keywords
is an array of words that the system should try its best to find and if they are found, mark them in the response.
catsInCloud
limits the number of top words returned
filterSimilarAdjective
will filter nouns and adjectives that are similar to one of them (noun or adjective) with the highest frequency in the Russian language.
Response:
[ { "word": "метро", "stem": "метро", "partOfSpeech": "Noun", "count": 3, "lemma": true, "keyword": false }, { "word": "проспект", "stem": "проспект", "partOfSpeech": "Noun", "count": 3, "lemma": true, "keyword": false }, { "word": "стрелок", "stem": "стрелок", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "станция", "stem": "станция", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "способ", "stem": "способ", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "мост", "stem": "мост", "partOfSpeech": "Noun", "count": 2, "lemma": true, "keyword": false }, { "word": "добролюбова", "stem": "добролюбов", "partOfSpeech": "Unknown", "count": 1, "lemma": false, "keyword": false }, { "word": "сторона", "stem": "сторона", "partOfSpeech": "Noun", "count": 1, "lemma": true, "keyword": false }, { "word": "оконечность", "stem": "оконечность", "partOfSpeech": "Noun", "count": 1, "lemma": true, "keyword": false }, { "word": "попасть", "stem": "попасть", "partOfSpeech": "Verb", "count": 1, "lemma": true, "keyword": false }, { "word": "петербург", "stem": "петербург", "partOfSpeech": "Noun", "count": 0, "lemma": true, "keyword": true } ]
The recognized part of speech tags are: NOUN, ADJECTIVE, VERB, ADVERB, CONJUNCTION, PARTICLE, INTERJECTION, PREPOSITION, PARENTHESIS, PREDICATE, PRONOUN_ADJECTIVE, NUMERAL, PRONOUN, COMPARISON, ADVERB_PREPOSITION, UNKNOWN.
Along with POS tag the following output values are provided:
word
the surface form as found in the text
stem
its base form
partOfSpeech
detected part of speech
count
number of times the word occurs in the text
lemma
if word and its base form are the same
keyword
boolean flag. True if it is taken from the request’s keywords array. Check count
to know, whether it occurred in the text.