Universal Web Scraper

FREEMIUM
Popularity

8.1 / 10

Latency

1,758ms

Service Level

100%

Health Check

100%

Followers: 1
API Creator:
Rapid account: Matthew Dvertola
Matthew Dvertola
softservesoftware
Log In to Rate API
Rating: 5 - Votes: 1

README

Welcome to Universal Web Scraper

Our goal here is to create a lightweight & flexible web crawler that can take in any url and return a predefined json object with data from the webpage. Under the hood we open a webpage using puppeteer, we then extract the html source of the given webpage, remove extraneous data, and provide GPT the body of text & some prompts to return a formatted JSON object of information on the webpage.

Wikipedia Example

Below we give the system a wikipedia URL for Paris along with various details related to what information we would like to extract from the webpage.

Example Request:

{
    "target_url" : "https://en.wikipedia.org/wiki/Paris",
    "keys_list" : [
        {
            "key_to_extract": "country",
            "description_of_key": "country of the city provided"
        },
        {
            "key_to_extract": "population",
            "description_of_key": "total population of provided city"
        },
        {
            "key_to_extract": "temperature_summary",
            "description_of_key": "provide a summary of the weather over the course of a year"
        },
        {
            "key_to_extract": "city_president", <- example of a data point not in webpage
            "description_of_key": "president of the city"
        }
    ],
    "openai_key" : "sk-...", (optional)
    "openai_org" : "org-...", (optional)
    "model_name" : "gpt-4-0613" (must support json object response format) (optional)
    "model_temperature" : 0.7, (optional)
    "model_max_tokens" : 2048, (optional)
    "model_top_p" : 1, (optional)
    "model_frequency_penalty" : 0, (optional)
    "model_presence_penalty" : 0, (optional)
    "system_prompt" : "You are an intelligent html (web page) parser. Please find the provided values in the webpage for the keys you were given",  (optional)
    "user_prompt" : "" (optional)
}

Example Response:

{
    "country": "France",
    "population": 2102650,
    "temperature_summary: "In Paris, the weather varies throughout the year, from chilly and damp winters to warm and sunny summers, interspersed with mild springs and autumns that bring a mix of rain and sunshine.",
    "city_president": ""
}

We are able to collect a standardized output of the provided webpage.

Note when a data point is not available, such as the non-existent president of paris, we instruct GPT to write the field as an empty string to avoid hallucinations.

Usage

Since the nature of GPT generation/ analysis at this point is async, we have created a simple polling system with two endpoints. One to initiate the request and one to poll the status of the request.

Initiate a request:

POST - /parser

Body:

{
    "target_url" : "https://en.wikipedia.org/wiki/Paris",
    "keys_list" : [
        {
            "key_to_extract": "country",
            "description_of_key": "country of the city provided"
        },
        {
            "key_to_extract": "population",
            "description_of_key": "total population of provided city"
        },
        {
            "key_to_extract": "temperature_summary",
            "description_of_key": "provide a summary of the weather over the course of a year"
        },
        {
            "key_to_extract": "city_president", <- example of a data point not in webpage
            "description_of_key": "president of the city of paris"
        }
    ],
}

Response:

{
    id: "####-####-####-####",
    timestamp: 123456,
    ...,
    status: "Collecting HTML"
}

Polling requests

GET - /parser/:id/:timestamp

Response:

{
    id: "####-####-####-####",
    timestamp: 123456,
    ...,
    status: "Collecting HTML" | "Standardizing JSON" | "Complete",
    processed_data: {
        "country": "France",
        "population": 2102650,
        "temperature_summary: "In Paris, the weather varies throughout the year, from chilly and damp winters to warm and sunny summers, interspersed with mild springs and autumns that bring a mix of rain and sunshine.",
        "city_president": ""
    }
}