Extractor

FREEMIUM
By extractorapi | Updated 17 दिन पहले | Data
Health Check

N/A

Back to All Tutorials (1)

Creating an Automated Text Extraction Workflow

Intro

Developing a workflow that automatically extracts relevant text from URLs can be laborious.

Whether you’re collecting articles or blogs for a dataset, scanning press releases to spot mentions of competitors, or composing a company news feed, the process of gathering URLs, extracting the text, and storing the results requires quite a bit of hand-holding.

What Tools Are Out There?

To actually gather the URLs you want to extract text from, you can use the powerful News API (they have a decent free tier for non-commercial use), or the well-priced and speedy Newscatcher API.

If you’re crawling the same sources, your IP might eventually be blocked, so you’ll need a service that offers proxy rotation. From our experience, scraperapi.com and zenscrape.com are both solid options for extracting HTMLs while handling proxies and retries.

You’ll also need something that reliably extracts clean, boilerplate-free text from the HTML. The most robust free option I’ve come across is newspaper3k — if you’re privy to the Python world, it’s the requests of text extraction. For the most part, it’s accurate and fast.

The 600 lbs gorilla, Diffbot, comes with a swath of awesome APIs but starts at $300, which is ridiculous if you’re just extracting text. Scrapinghub’s News API, Extractor API, and plenty more are better priced if you want an affordable alternative; plus, Extractor API includes a no-code online tool for extracting hundreds of articles at once, if you want to do things via UI.

Finally, you need to store the results somewhere — a local or hosted database you can easily query, like Digital Ocean’s Managed Database or AWS’s DynamoDB. You can also use Extractor API to store all your extracted text, and easily query your jobs via a RESTful API — we’ll dive into that soon.

The Technology We’ll Be Using

In this guide, we’ll be using the News API to gather relevant URLs and Extractor API to extract relevant data and store our work for later querying.

To use News API, head over to their pricing page and sign up for the free Developer plan. For Extractor API, the pricing page includes a free plan, which we’ll be using today.

When you’re set, let’s dive in!

Gathering the URLs

We’ll be using News API’s free tier to gather news URLs mentioning “artificial intelligence” in some of my favorite tech news sources: Ars Technica, Wired, Bloomberg, Fast Company, MIT Technology Review, Gizmodo, and others.

Here’s how to set it up (you can also see News API’s Python client setup guide here):

import requests
from newsapi import NewsApiClient

newsapi = NewsApiClient(api_key="YOUR_API_KEY")

## We can leave this as a list to easily add or remove sources later
domains = [
    "wired.com", "bloomberg.com", "arstechnica.com", "technologyreview.com", 
    "gizmodo.com", "engadget.com", "fastcompany.com", "theverge.com", 
    "techcrunch.com"
]

## Join the list into a string to feed it to News APi
domains = ",".join(domains)

## We'll be looking for mentioned of 'AI' in article titles - note the single quotes inside the double quotes, for an exact case match
all_articles = newsapi.get_everything(
  qintitle="'AI'",
  domains=domains,
  language="en",
  page_size=100
)

## In my case, I got 58 articles for this instance
target_urls = [article["url"] for article in all_articles["articles"]]

Extracting and Storing the Text

Now that we have our 58 target URLs, it’s time we started extracting text and storing the results for later use. Extractor API helps us do this easily with the help of the Jobs endpoint.

If you take a look at the documentation, you’ll see the Create Jobs endpoint, which allows you to feed the API a list of URLs under a job_name of your choice.

A few notes about the Jobs:

  • Extraction happens server-side, so you can check the progress via the API. When all the URLs in the job have finished processing, you’ll get a 100% completion status (I’ll show you how to query for this).

  • You can fetch the paginated list of URLs (along with extracted text, titles, and so on) at any time for downstream processing using a simple GET request. The extracted data is securely stored on the Extractor API server, so you’ll have quick access to your crawls.

  • You can access your jobs (and see their status) on the Jobs page on the Extractor API website, and download results in either .json or .csv format.

Once you’re logged in, you can head over to your Dashboard, where you can retrieve your API key.

Here’s how we go about creating a job with the Extractor API (we’re continuing the script from above):

api_key = "YOUR_API_KEY"

endpoint = "https://extractorapi.com/api/v1/jobs"
headers = {
  "Authorization": f"Bearer {api_key}"
}

data = {
  "job_name": "ai_articles",
  "url_list": target_urls
}

r = requests.post(endpoint, headers=headers, data=data)

## We'll be using the job_id to check the status of the job
job_id = r.json()["id"]

We can check the status programmatically:

endpoint = f"https://extractorapi.com/api/v1/jobs/{job_id}"

## We already have the authorization header set, so we can go ahead and send the GET request
r = requests.get(endpoint, headers=headers)

print(r.json()["job_status"])

## Status output
{
  "urls_processed": 58,
  "percent_complete": 100,
  "errors": 1
}

Or we can head over to the Jobs page (under the Online Text Extractor dropdown in the menu). We can see all except one of the URLs were successfully processed.

Querying Your Job

Once your job is done, you can inspect the results using the Job URLs endpoint. Just add /urls (no ending slash needed) and you’ll get a paginated list of results.

endpoint = f"https://extractorapi.com/api/v1/jobs/{job_id}/urls"

## We already have the authorization header set, so we can go ahead and send the GET request
r = requests.get(endpoint, headers=headers)

## Let's see our options
print(r.json().keys())

## Available keys
dict_keys(["count", "next", "previous", "results"])

count is the total number of results (in our case, 57 successfully extracted URLs and one error). By default, each “page” shows 10 results. To access the next or previous pages, simply use the URL found in the next or previous key.

For example, to collate all extracted text into a Python list, you could do this:

import math

## This will give the number of pages we'll have to traverse - in our case, six
num_pages = math.ceil(r.json()["count"]/10)

articles = []

for page in range(num_pages):
  if page != 0:
    page += 1
    r = requests.get(endpoint+f"?page={page}", headers=headers)
  for article in r.json()["results"]:
    if article["status"] == "COMPLETE":
      articles.append(article["text"])

Filtering Your Results

You might not want all the articles in your job — you might need a specific subset, say, all those that contain the word “acquisition” in the full text. In that case, you can apply a filter parameter:

#We're using the same endpoint from above, so we just need to add the filter to our parameters
params = {
  "text__icontains": "acquisition"
}

r = requests.get(endpoint, headers=headers, params=params)

print(article["title"] for article in r.json()["results"])

## In our case, it's a single article
["IBM and Red Hat expand their telco, edge and AI enterprise offerings"]

For the **Job URLs **you can search the title and text for any keywords. For the **List Jobs **endpoint you can search the job name.

To construct the parameter, simply attach __contains (for an exact match) or __icontains for a case-agnostic match. A few examples: text__icontains=facebook (for the **Job URLs **endpoint), title__contains=AI (for the **Job URLs **endpoint), and job_name__contains=my_articles (for the **List Jobs **endpoint).

Tools for Automation

So let’s wrap up for this part. As we saw above, using just two APIs you can start to piece together a mechanism for retrieving relevant URLs, extracting and saving the URL text data, then querying your jobs.

Next, we’ll be looking at interesting methods to automate the URL retrieval and text extraction process. Till next time!