Web Scraping is a term normally used for using an automated method of extracting data from a website. Web scraping is a way to programmatically access the content on websites that don’t provide APIs. Keep in mind, scraping a website might be against their terms of service, so be sure to check those first.
In this article, we’d like to give you a quick introduction to web scraping and to show you how to do it using Ruby.
How Does Web Scraping Work?
Web scraping is very simple on the surface. You start by loading the website on your browser and then, using developer tools, you can analyze the structure of the content. Websites usually have a predictable structure, identifiers and class names, that can allow you to pinpoint different elements in the site. Take for example this piece of HTML, which renders a pricing table:
<div class="card-deck mb-3 text-center"> <div class="card mb-4 shadow-sm"> <div class="card-header"> <h4 class="my-0 font-weight-normal">Free</h4> </div> <div class="card-body"> <h1 class="card-title pricing-card-title">$0 <small class="text-muted">/ mo</small></h1> <ul class="list-unstyled mt-3 mb-4"> <li>10 users included</li> <li>2 GB of storage</li> <li>Email support</li> <li>Help center access</li> </ul> <button type="button" class="btn btn-lg btn-block btn-outline-primary">Sign up for free</button> </div> </div> <div class="card mb-4 shadow-sm"> <div class="card-header"> <h4 class="my-0 font-weight-normal">Pro</h4> </div> <div class="card-body"> <h1 class="card-title pricing-card-title">$15 <small class="text-muted">/ mo</small></h1> <ul class="list-unstyled mt-3 mb-4"> <li>20 users included</li> <li>10 GB of storage</li> <li>Priority email support</li> <li>Help center access</li> </ul> <button type="button" class="btn btn-lg btn-block btn-primary">Get started</button> </div> </div> </div>
You can loop over all the elements that have the .card
class to get both the pricing cards, then in each of them, you can get the plan name using the .card-header
class. Using the li
elements you can even get the characteristics of a plan. Let’s do this in Ruby:
require 'nokogiri' # The whole document html = '<div class="card-deck mb-...' # create a Nokogiri document doc = Nokogiri::HTML(html) # Get all the plan names doc.css('.card').map { |card| card.css('.card-header h4').text } # => ["Free", "Pro"] # Get structured data doc.css('.card').map do |card| title = card.css('.card-header h4').text price = card.css('.card-body .pricing-card-title').text features = card.css('.card-body ul li').map do |feature| feature.text end { title: title, price: price, features: features } end # => [{:title=>"Free", # :price=>"$0 / mo", # :features=> # ["10 users included", # "2 GB of storage", # "Email support", # "Help center access"]}, # {:title=>"Pro", # :price=>"$15 / mo", # :features=> # ["20 users included", # "10 GB of storage", # "Priority email support", # "Help center access"]}]
We’re using Nokogiri, a great library for parsing XML/HTML documents in Ruby (more on this later). In the first highlighted line we load the document into Nokogiri. Then, in the second highlighted line, we extract the plan names. We do this by looping over all the elements with the CSS class of .card
like we discussed above. In the last highlighted block, we take it a bit further and extract the title, price, and all the features. Using different CSS selectors you can pretty much get anything you want from a document.
Different Methods and Dynamic Content
There are different methods of performing web scraping. It all depends on your needs and how the data is formatted. It is not unusual to have to resort to a combination of ways of rendering and then mining the data. For example, a simple web page like mine is very easy to extract data from, because there’s a simple structure and nothing dynamic. Compare that, for example, with Facebook’s website, where there’s tons of dynamic content, and you’ve gotten yourself in a very complex data extraction problem.
<html lang="en"> <head> <title>Single Page Application</title> </head> <body> <div id="app"></div> <script src="https://example.org/some-framework.js"></script> <script> MyFramework.start(document.getElementById('app')); </script> </body> </html>
The thing about dynamic websites (also called “Single Page Applications”), is that the HTML that is first rendered is usually almost empty. It’s Javascript behind the scenes that loads a framework and does some network requests that finally renders the content. For these situations, you need something like a browser to render the website, and only then can you start running your data extraction. Look for example at the snippet above: there’s no content! This is because the framework itself will do all the heavy lifting after the fact.
Headless Chrome and Proxies
One solution to the problem above is to use an actual web browser to fetch the content. If you were to fetch a website like that using cURL
, you’ll just get that (almost) empty document. You could open your browser and load the website, copy the rendered HTML from developer tools and parse that, but you’ll need to automate that process at some point. Here’s where tools like Headless Chrome comes in. Headless Chrome allows you to control an instance of Chrome programmatically, and without it needing a window open with the full UI (hence “headless”). Since it’s a full browser, you get all the features, specifically the rendering of the final size you need.
When scraping websites, there’s also the problem of where the connection is coming from. If you set up a server to extract some data from a website, you will most likely start hitting rate limits. This means you either will have to very slowly mine the data from the target website, or not be able to at all. One solution to this is to use proxies. Using a proxy would allow you to balance your requests between multiple source locations, so you avoid hitting rate limits. You should of course always keep the website’s terms of service in mind at all times.
One of such proxy services is ScrapingBee, which does both these things: headless chrome and proxying. And their services are available in RapidAPI, so getting started is as easy as it gets.
Let’s Scrape Something!
We’ll be building a simple scraper that takes advantage of ScrapingBee to do the website rendering and connection proxying. For this, you will need:
- Ruby,
- The nokogiri gem,
- The excon gem,
- A RapidAPI account
To install ruby, go ahead and check to their official installation guide. We’ll also need Nokogiri. This gem, as you saw before, allows you to parse and extract data from HTML documents with ease. Installing it is a bit tricky, and it really depends on your platform. Thus, we’d also like to refer you to their installation guides, which should point you in the right direction as to how to get it installed. As long as you can run gem install nokogiri
in your terminal without errors, you should be good to go. Also make sure you install Excon, by running gem install excon
.
To get your RapidAPI account, just head on to rapidapi.com to sign up. ScrapingBee’s API has a free usage tier, but it requires that you set up a credit card to cover any overages. You won’t need to pay for anything we do in this tutorial, as the free plan is should cover the number of calls we will make
Setup ScrapingBee
Go to ScrapingBee’s page on RapidAPI, and click on the Subscribe button for their free plan. Enter your credit card details if needed.
After that, go to the Endpoints tab, and grab your Host and API Key. These are shown as X-RapidAPI-Host
and X-RapidAPI-Key
, respectively. You’ll need both for the next sections.
Extracting API Data from RapidAPI’s Website
As an example of what you can do with ScrapingBee and Ruby, we’ll extract a list of APIs from RapidAPI. Let’s first look at how we can extract some basic data. If we head to rapidapi.com and then click on a category on the left, you’ll be taken to a list of APIs. This is the list we’ll be targeting.
If you’re using Chrome, you can right-click on an element and select “Inspect”. This opens developer tools, and lets you easily see the CSS selectors you can use to extract what you need. In our case, using the class ApiItemstyled__ResponsiveWrapper-qvgmn9-8
gets us all the APIs in display in the category view. Using the ApiItemstyled__ApiItemWrapperName-qvgmn9-2
class we can extract the name of the API. The description can be grabbed using the ApiItemstyled__Description-qvgmn9-4
class, and using the ApiItemstyled__Footer-qvgmn9-5
class we can grab the stats at the bottom, like popularity. Let’s write some code:
require 'nokogiri' require 'excon' API_KEY = 'YOUR API KEY' HOST = 'scrapingbee.p.rapidapi.com' target = 'https://rapidapi.com/category/Data' response = Excon.get( "https://scrapingbee.p.rapidapi.com/?render_js=True&url=#{target}", headers: { 'X-RapidAPI-Host' => HOST, 'X-RapidAPI-Key' => API_KEY } ) doc = Nokogiri::HTML(response.body) doc.css('.ApiItemstyled__ResponsiveWrapper-qvgmn9-8').map do |api| title = api.css('.ApiItemstyled__ApiItemWrapperName-qvgmn9-2').text description = api.css('.ApiItemstyled__Description-qvgmn9-4').text popularity, latency, success = api.css('.ApiItemstyled__Footer-qvgmn9-5 div div div').map(&:text) { title: title, description: description, popularity: popularity, latency: latency, success: success } end
Make sure you use the developer tools in your browser so you can follow along with how we’re using CSS selectors here. In the first highlighted block we make the request to ScrapingBee via RapidAPI. We specify render_js
to true
so our instance of headless Chrome will render the websites Javascript. Next, we parse the response using Nokogiri. Finally, we grab all the API elements in the page, loop over them, and extract the title, description, and stats from them. Pay special attention to how we get the stats. The (very simplified) HTML looks like this:
<div class="ApiItemstyled__Footer-qvgmn9-5 kTwcIW"> <div> <div class="FlexLayouts__FlexRowCenterCenter-sc-1j19v2f-8 jlDTbN"> <img src="/static-assets/default/popularity.svg"> <div class="ApiItemstyled__FooterItem-qvgmn9-6 dzVIkT">9.9</div> </div> </div> <div> <div class="FlexLayouts__FlexRowCenterCenter-sc-1j19v2f-8 jlDTbN"> <img src="/static-assets/default/latency.svg"> <div class="ApiItemstyled__FooterItem-qvgmn9-6 dzVIkT">170ms</div> </div> </div> <div> <div class="FlexLayouts__FlexRowCenterCenter-sc-1j19v2f-8 jlDTbN"> <img src="/static-assets/default/success-new.svg"> <div class="ApiItemstyled__FooterItem-qvgmn9-6 dzVIkT">83%</div> </div> </div> </div>
So, we grab all third divs from the element with class name ApiItemstyled__Footer-qvgmn9-5
. Then we map that array of results and get the text from each element. Now, after running our script, we get an array with something like this:
[{:title=>"City Geo-Location Lookup", :description=> "This API gives you Latitude, Longitude, Time-Zone of any city", :popularity=>"9.8", :latency=>"1492ms", :success=>"94%"}, {:title=>"Get Video and Audio URL", :description=> "Get direct links to download video or audio files from almost any hosting website, like Youtube, Twitch, SoundCloud, etc with download api.", :popularity=>"9.8", :latency=>"6977ms", :success=>"99%"}, {:title=>"Currency Exchange", :description=> "Live currency and foreign exchange rates by specifying source and destination quotes and optionally amount to calculate. Support vast amount of quotes around the world.", :popularity=>"9.8", :latency=>"1347ms", :success=>"96%"}, # and more... ]
Conclusions and Tips
We hope this gave you some ideas on how to best use web scraping. Here are some tips to make the process easier:
- Test CSS selectors in the browser — by using something like
document.querySelectorAll()
you can test your CSS selectors and verify that they are actually selecting what you need and not more. - Some websites might require some interaction to load everything. For example, you’d need to scroll to trigger an infinite scrolling loader. In this case, it’s possible to send Javascript code that can be executed in the headless Chrome instance. Just make sure it’s Base64 encoded.
- If you need the ScrapingBee to wait some seconds before sending you the content of the website (for example the website needs more time than normal to render), you can send an extra parameter named
wait
, in which you can specify up to 10,000 milliseconds of wait time.
Leave a Reply