Universal Web Scraper

FREEMIUM
Popularity

7.9 / 10

Latency

1,535ms

Service Level

100%

Health Check

100%

Back to All Tutorials (2)

Standardize any webpage into JSON

In the era of data-driven decision-making, extracting and analyzing information from web pages is crucial. The Universal Web Scraper aims to make this process seamless by converting webpage content into a standardized JSON format. Let’s dive into how to use this powerful tool.

Introduction to Universal Web Scraper

The Universal Web Scraper is designed to be lightweight and flexible, capable of taking any URL and returning a predefined JSON object with the data from the webpage. It uses Puppeteer to open webpages, extract HTML, and then employs GPT to format this information into structured data.

Getting Started

Before you begin, ensure you have the necessary setup, which includes the scraper endpoint and optionally, your OpenAI credentials to enhance data processing with AI models.

Example Setup

Here’s how to structure a request to extract data from a Wikipedia page about Paris, targeting specific pieces of information like the country, population, and weather:

  1. Prepare the Request
    {
        "target_url": "https://en.wikipedia.org/wiki/Paris",
        "keys_list": [
            { "key_to_extract": "country", "description_of_key": "country of the city provided" },
            { "key_to_extract": "population", "description_of_key": "total population of provided city" },
            { "key_to_extract": "temperature_summary", "description_of_key": "summary of yearly weather" },
            { "key_to_extract": "city_president", "description_of_key": "president of the city of paris" }
        ]
    }
    

Initiate the Request

  • Send a POST request to baseUrl/parser with the JSON payload you’ve prepared.

Example Response

  • You will receive an initial response like this:
    {
        "id": "####-####-####-####",
        "timestamp": 123456,
        "status": "Collecting HTML"
    }
    

Polling for Results

To check the status of your request and eventually get the standardized JSON data, you need to poll the endpoint:

  • GET baseUrl/parser/:id/:timestamp

Final Response

When the process is complete, you’ll receive a response containing the extracted data in JSON format:

{
    "id": "####-####-####-####",
    "timestamp": 123456,
    "status": "Complete",
    "processed_data": {
        "country": "France",
        "population": 2102650,
        "temperature_summary": "In Paris, the weather varies from chilly winters to sunny summers, with mild springs and autumns.",
        "city_president": ""
    }
}

Note that if certain data (like “city_president” in our example) is not available on the webpage, the scraper smartly returns an empty string to prevent any hallucinations.

Conclusion

With the Universal Web Scraper, converting webpage data into structured JSON is simplified, making it a valuable tool for data analysts, developers, and content managers. Whether it’s for market research, competitive analysis, or just curating content, this tool stands out for its ability to streamline the process and deliver reliable data.