In the era of data-driven decision-making, extracting and analyzing information from web pages is crucial. The Universal Web Scraper aims to make this process seamless by converting webpage content into a standardized JSON format. Let’s dive into how to use this powerful tool.
The Universal Web Scraper is designed to be lightweight and flexible, capable of taking any URL and returning a predefined JSON object with the data from the webpage. It uses Puppeteer to open webpages, extract HTML, and then employs GPT to format this information into structured data.
Before you begin, ensure you have the necessary setup, which includes the scraper endpoint and optionally, your OpenAI credentials to enhance data processing with AI models.
Here’s how to structure a request to extract data from a Wikipedia page about Paris, targeting specific pieces of information like the country, population, and weather:
{
"target_url": "https://en.wikipedia.org/wiki/Paris",
"keys_list": [
{ "key_to_extract": "country", "description_of_key": "country of the city provided" },
{ "key_to_extract": "population", "description_of_key": "total population of provided city" },
{ "key_to_extract": "temperature_summary", "description_of_key": "summary of yearly weather" },
{ "key_to_extract": "city_president", "description_of_key": "president of the city of paris" }
]
}
baseUrl/parser
with the JSON payload you’ve prepared.{
"id": "####-####-####-####",
"timestamp": 123456,
"status": "Collecting HTML"
}
To check the status of your request and eventually get the standardized JSON data, you need to poll the endpoint:
baseUrl/parser/:id/:timestamp
When the process is complete, you’ll receive a response containing the extracted data in JSON format:
{
"id": "####-####-####-####",
"timestamp": 123456,
"status": "Complete",
"processed_data": {
"country": "France",
"population": 2102650,
"temperature_summary": "In Paris, the weather varies from chilly winters to sunny summers, with mild springs and autumns.",
"city_president": ""
}
}
Note that if certain data (like “city_president” in our example) is not available on the webpage, the scraper smartly returns an empty string to prevent any hallucinations.
With the Universal Web Scraper, converting webpage data into structured JSON is simplified, making it a valuable tool for data analysts, developers, and content managers. Whether it’s for market research, competitive analysis, or just curating content, this tool stands out for its ability to streamline the process and deliver reliable data.