Our goal here is to create a lightweight & flexible web crawler that can take in any url and return a predefined json object with data from the webpage. Under the hood we open a webpage using puppeteer, we then extract the html source of the given webpage, remove extraneous data, and provide GPT the body of text & some prompts to return a formatted JSON object of information on the webpage.
Below we give the system a wikipedia URL for Paris along with various details related to what information we would like to extract from the webpage.
{
"target_url" : "https://en.wikipedia.org/wiki/Paris",
"keys_list" : [
{
"key_to_extract": "country",
"description_of_key": "country of the city provided"
},
{
"key_to_extract": "population",
"description_of_key": "total population of provided city"
},
{
"key_to_extract": "temperature_summary",
"description_of_key": "provide a summary of the weather over the course of a year"
},
{
"key_to_extract": "city_president", <- example of a data point not in webpage
"description_of_key": "president of the city"
}
],
"openai_key" : "sk-...", (optional)
"openai_org" : "org-...", (optional)
"model_name" : "gpt-4-0613" (must support json object response format) (optional)
"model_temperature" : 0.7, (optional)
"model_max_tokens" : 2048, (optional)
"model_top_p" : 1, (optional)
"model_frequency_penalty" : 0, (optional)
"model_presence_penalty" : 0, (optional)
"system_prompt" : "You are an intelligent html (web page) parser. Please find the provided values in the webpage for the keys you were given", (optional)
"user_prompt" : "" (optional)
}
{
"country": "France",
"population": 2102650,
"temperature_summary: "In Paris, the weather varies throughout the year, from chilly and damp winters to warm and sunny summers, interspersed with mild springs and autumns that bring a mix of rain and sunshine.",
"city_president": ""
}
We are able to collect a standardized output of the provided webpage.
Note when a data point is not available, such as the non-existent president of paris, we instruct GPT to write the field as an empty string to avoid hallucinations.
Since the nature of GPT generation/ analysis at this point is async, we have created a simple polling system with two endpoints. One to initiate the request and one to poll the status of the request.
POST - /parser
Body:
{
"target_url" : "https://en.wikipedia.org/wiki/Paris",
"keys_list" : [
{
"key_to_extract": "country",
"description_of_key": "country of the city provided"
},
{
"key_to_extract": "population",
"description_of_key": "total population of provided city"
},
{
"key_to_extract": "temperature_summary",
"description_of_key": "provide a summary of the weather over the course of a year"
},
{
"key_to_extract": "city_president", <- example of a data point not in webpage
"description_of_key": "president of the city of paris"
}
],
}
Response:
{
id: "####-####-####-####",
timestamp: 123456,
...,
status: "Collecting HTML"
}
GET - /parser/:id/:timestamp
Response:
{
id: "####-####-####-####",
timestamp: 123456,
...,
status: "Collecting HTML" | "Standardizing JSON" | "Complete",
processed_data: {
"country": "France",
"population": 2102650,
"temperature_summary: "In Paris, the weather varies throughout the year, from chilly and damp winters to warm and sunny summers, interspersed with mild springs and autumns that bring a mix of rain and sunshine.",
"city_president": ""
}
}