Python Web Scraping using Beautiful Soup
Background
Let’s assume that we have two competitors selling similar pairs of shoes in the same area. Typically, if a competitor wants to know of another competitor’s pricing, competitor A would enquire from someone close to competitor B.
These days, it is quite different. If we want to purchase a bouquet of roses, we just check the seller’s platform for the price. This simply defines web scraping—the art of extracting data from a website. But we can automate the above examples in Python with Beautiful Soup module.
Dos and don’ts of web scraping
Web scraping is legal in one context and illegal in another context. For example, it is legal when the data extracted is composed of directories and telephone listing for personal use. However, if the extracted data is for commercial use—without the consent of the owner—this would be illegal. Thus, we should be careful when extracting data from a website and always be mindful of the law.
Getting started
There are three standard methods we can use to scrape data from a web page on a website. We can use a regular expression, Beautiful Soup, and CSS selectors. If you know of any other approach to scrape data from a web page, kindly make it available in the comments section.
Before we dive straight into scraping data from a stock exchange site, let’s understand a number of basic terms in web scraping.
- Web Crawling: Web crawling simply refers to downloading of HTML pages on a website via user agents known as crawlers/user-agents. Google bots, baiduspider, Bingbot, and others.
- Robots.txt: Robots.txt is a file which contains a set of suggestions/instructions purposely for crawlers. These set of instructions/suggestions specify whether a crawler has the right to access a particular web page on a website or not.
- Sitemap Files: Sitemap files are provided by websites to make crawling a bit easier for crawlers/user-agents. It simply helps crawlers to locate updated content of pages on websites. Instead of crawling web pages of a website, crawlers check the updated content of a website via the sitemap files. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html
- Beautiful Soup: Beautiful Soup is a popular module in Python that parses (or examines) a web page and provides a convenient interface for navigating content. I prefer Beautiful Soup to a regular expression and CSS selectors when scraping data from a web page. It is also one of the recommended Python libraries by the #1 Stack Overflow answerer, Martijn Pieters. But if you want, you can also build a web scraper in Node.js.
Apart from the Beautiful Soup, which we will use to scrape data from a web page, there are modules in Python to help us know technical aspects of our web target. We can use the builtwith
module to know more of our target’s technical details. You can install the builtwith
module by doing the following:
pip install builtwith
The builtwith
module exposes arrays of technologies a website was built upon. Web intermediaries (i.e WAFs or proxies) may block other technical aspects for security reasons. For instance, let’s try to examine Bloomberg’s website
import builtwith
builtwith.parse(“http://www.bloomberg.com”)
Below is a screenshot of the output:
Before we scrape the name and price of the index on Bloomberg, we need to check the robot.txt file of our target before we take any further steps. To remind us again of its purpose, I initially explained that robots.txt is a file composed of suggestions for crawlers (or web robots).
For this project, our target is Bloomberg. Let’s check out Bloomberg’s restrictions for web crawlers.
Just type the following in the url space bar:
http//:www.bloomberg.com/robots.txt
The code above simply sends a request to the web server to retrieve robots.txt file. Below is the robots.txt file retrieved from the web server. Now let’s check the web robots rules of Bloomberg.
Crawling our target
With the help of robots.txt file, we know where we can allow our crawler to download HTML pages and where we should not allow our crawler to tread. As good web citizens, it is advisable to obey bots rules. However, it is not impossible for us to allow our crawler to venture into restricted areas. Bloomberg may ban our IP address for an hour or a longer period.
For this project, it is not necessary to download/crawl a specific web page. We can use a Firebug extension to check or inspect the page where we want to scrape our data from.
Now let’s use Firebug to find the related HTML of the index’s name and price of the day. Similarly, we can use the browser’s native inspector, too. I prefer to use both.
Just hover or move your cursor to the index name and click the related HTML tags. We can see the name of the index, which should look like something similar to the one below:
Let examine the sitemap file of our target
Sitemap files simply provide links to updated content of a website. Therefore, it allows crawlers to efficiently crawl web pages of interest. Below are a number of Bloomberg’s sitemap files:
Let’s scrape data from our target:
Now it is time to scrape a particular data on our target site: www.bloomberg.com. There are diverse ways to scrape data from a web page. We can use CSS selectors, regular expressions, and the popular BeautifulSoup module. Among these three approaches, we are going to use the BeautifulSoup to scrape data from a web page. The name we will use to install pip via this module is quite different from when we import it. When choosing between text editors, you can choose to use Sublime, Atom or, Notepad++. Others are available, too.
Now let’s assume we don’t have BeautifulSoup. Let’s install BeautifulSoup via pip.
Next, we import urllib2 and BeautifulSoup4:
#import libraries
import urllib2 // urllib2 is used to fetch url(s) via urlopen()
from bs4 import BeautifulSoup // when importing ‘Beautiful Soup’ don’t add 4.
From datetime import datetime // contains functions and classes for working with dates and times, separately and together
Now, let’s define and declare variable for the url:
quote_page = ‘http://www.bloomberg.com/SPGSCITIR:IND’
Here, we include the datetime
module as:
t1 = datetime()
Now let’s use the Python urllib2 to get HTML page of the URL stored in the quote_page
variable and return to the variable page.
page = urllib2.urlopen(quote_page)
Afterward, let’s parse the HTML page into the BeautifulSoup module:
soup = BeautifulSoup(page, ‘html.parser’)
Since we know where the name and price of the index are in the HTML tags via the screenshot, it is not difficult to query the specific class name:
name_store = soup.find(‘h1’ , attrs={‘class’: ‘name’} )
Now let’s get the name of the index by getting its text via the dot notation and thereafter store in the variable data_name
.
data_name = name_store.text.strip()
Same as the index name, let do the same for the index price:
price_store = soup.find('div', attrs={'class': 'price'})
price = price_store.text
We print the data_name
of our index as:
print data_name
Also, print the price:
print price
Finally, we calculate the total time of the program by the following:
t2 = datetime()
total = t2 – t1
print ‘scraping completed in ’, total
Below is the full source code:
import urllib2
from bs4 import BeautifulSoup
from datetime import datetime
quote_page = 'http://www.bloomberg.com/quote/SPGSCITR:IND'
t1 = datetime.now()
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_store = soup.find('h1', attrs={'div' : 'name'})
data_name = name_store.text.strip()
price_store = soup.find('div', attrs={'class': 'price'})
price = price_store.text
print data_name
t2 = datetime.now()
total = t2 -t1
print 'scraping completed in ' , total
This should be the output:
Wrapping up:
And with that, we just learned how to scrape data with Beautiful Soup which, in my opinion, is quite easy in comparison with regular expression and CSS selectors. And just so you are aware, this is just one of the ways of scraping data with Python.
And just to reiterate this important point: web scraping is legal in one context, and illegal in another. Before you scrape data from a webpage, it is strictly advisable to check the bot rules of a website by appending the robots.txt at the end of the URL, just like this: www.example.com/robots.txt. Your IP address may be restricted till further notice if you fail to do so. Hope you’ll use the skill you just learned appropriately, cheers!
Author’s Bio
Michael is a budding Cybersecurity Engineer and a technical writer based in Ghana, Africa. He works with AmericanEyes Security as a part-time WordPress security consultant. He is interested in Ruby on Rails and PHP security.