UrlDetector

फ्रीमियम
द्वारा integraatio | अपडेट किया गया 3 महीने पहले | Text Analysis

UrlDetector अवलोकन

फॉलोवर: 0
संसाधन:
उपयोग की शर्तें
API निर्माता:
avatar
integraatio
integraatio
API को रेट करें:
API को रेट करने के लिए लॉग इन करें

रीडमी

Very powerful and easy-to-integrate parsing engine able to detect and extract urls from a body of text (capable of parsing text, HTML, JSON, XML etc, and multi-scheme IPv4, IPv6 and textual domains).

Depending on the options specified during each request, the parser is able to find and detect many varied url formats such as:

HTML 5 Scheme - //www.linkedin.com
Usernames - user:pass@linkedin.com
Email - fred@linkedin.com
IPv4 Address - 192.168.1.1/hello.html
IPv4 Octets - 0x00.0x00.0x00.0x00
IPv4 Decimal - http://123123123123/
IPv6 Address - ftp://[::]/hello
IPv4-mapped IPv6 Address - http://[fe30:4:3:0:192.3.2.1]/

Note: The parser will err on the side of caution and over-detects urls for comprehensiveness, assuming that generally it is better to over-detect than under-detect. It does NOT perform filtering based on Top Level Domain name, so will return things that look like urls, but are not (e.g. http://notrealurl.jpg). If your particular use case requires that valid results are limited to some subset of TLDs, then we recommend that your application filter the Response results locally.

Note also, that instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), the parser tries to detect based on browser behaviour, optimising detection for urls that are visit-able through the address bar of Chrome, Firefox, Internet Explorer, and Safari.

The parser will return all recognised parts of urls. For example, for the url: http://user@linkedin.com:39000/hello?boo=ff#frag

Scheme - "http"
Username - "user"
Password - null
Host - "linkedin.com"
Port - 39000
Path - "/hello"
Query - "?boo=ff"
Fragment - “#frag”

Key Points:

  • All queries to parse text require the input text, and an array of one or more url detection options, specifying which rules and filters will be applied during the url extraction.
  • Url detect options are applied in a bitwise manner (i.e. a HTML page source could be uploaded and scanned with the options for ‘Default’, ‘HTML’, ‘Javascript’ enabled, which might be a good combination for extracting urls from the page content as well as any script included in the page).
  • The best combination of detection options will depend on your individual use case, some experimentation may be beneficial.

Endpoints notes

The Detection Options must be passed in to match the values found at the:

/urls/list-options

Returns an list of all available url detection options that can be passed during a url detection request. One or more values must be passed, and they are bitwise additive.

/urls/detect

Post a body of text, along with any url detection options required, and receive back a list of matching urls detected in the input.

रेटिंग: 5 - वोट: 1