Better-than-regex, comprehensive and flexible detection of urls in text / HTML / JSON etc. Capable of parsing many schemes, domain formats and input text formats. Need higher usage, custom integration or additional features? Contact us for custom pricing.
Very powerful and easy-to-integrate parsing engine able to detect and extract urls from a body of text (capable of parsing text, HTML, JSON, XML etc, and multi-scheme IPv4, IPv6 and textual domains).
Depending on the options specified during each request, the parser is able to find and detect many varied url formats such as:
HTML 5 Scheme - //www.linkedin.com
Usernames - user:pass@linkedin.com
Email - fred@linkedin.com
IPv4 Address - 192.168.1.1/hello.html
IPv4 Octets - 0x00.0x00.0x00.0x00
IPv4 Decimal - http://123123123123/
IPv6 Address - ftp://[::]/hello
IPv4-mapped IPv6 Address - http://[fe30:4:3:0:192.3.2.1]/
Note: The parser will err on the side of caution and over-detects urls for comprehensiveness, assuming that generally it is better to over-detect than under-detect. It does NOT perform filtering based on Top Level Domain name, so will return things that look like urls, but are not (e.g. http://notrealurl.jpg). If your particular use case requires that valid results are limited to some subset of TLDs, then we recommend that your application filter the Response results locally.
Note also, that instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), the parser tries to detect based on browser behaviour, optimising detection for urls that are visit-able through the address bar of Chrome, Firefox, Internet Explorer, and Safari.
The parser will return all recognised parts of urls. For example, for the url: http://user@linkedin.com:39000/hello?boo=ff#frag
Scheme - "http"
Username - "user"
Password - null
Host - "linkedin.com"
Port - 39000
Path - "/hello"
Query - "?boo=ff"
Fragment - “#frag”