Getting Started with Scraping in Python
Web scraping is a technique used to extract data from websites — the data is extracted and saved to a local file on your computer. The data can be displayed on your own website and application, used to perform data analysis, or applied in other ways.
Why scrape instead of using APIs?
First, not every site has an API. Second, even if the site has an API, it may not provide the information you want (or you may not be able to afford it).
How do you web scrape?
Broadly speaking, it’s a three-step process:
- Getting the content (in most cases, the page's HTML).
- Parsing the response.
- Optimizing/improving the process and preserving the data.
Let's understand with this problem statement:
We're going to fetch the title, author, and upvotes for all of the proposals for Pycon India 2016.
The proposals list is currently ordered by the date proposals were added. We’d like to see the proposals sorted by the number of upvotes.
As you can see, there’s no way to do this on the website. To get what we want, we're going to scrape the proposals, store them in a Python list, and then sort that list by the number of upvotes for each proposal.
Getting the content
Most of the time, you'll be only interested in getting the HTML content. In Python, if you have the link, you can use urllib2 to get page's HTML contents.
However, this turns out to be a lot of code, which is kind of fragile, typically. I’d suggest that you use the wonderful library, Requests, which is “HTTP for humans.” Look at this simple example.
We're going to use Requests to do a GET or POST request to the server and store the response received in return. Here’s the code snippet for it:
import requests
response = requests.get("https://in.pycon.org/cfp/2016/proposals/")
if response.status_code == 200 print "Fetched the page sucessfully" html_doc = response.text
Parsing the response
Now that we have the response, we need a way to extract the information we need. The response.text will be a string. We can find all the required content from HTML using regular expressions and basic Python, but that’s very complex.
Instead, we'll use third party libraries. Some popular libraries are BeautifulSoup4, lxml, and HTMLParser. These libraries employ different techniques of parsing underneath, so they vary in performance and ease of use. We’ll be using Beautiful Soup here, because it's the most popular and user-friendly among the three.
Creating the soup object:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify())
Let’s see what's going on in these three lines. First, we'll import the Beautiful Soup module in the first line. Then, we create a soup object passing in the response we received and then specify which parser to use (html.parser is the default parser).
You can also use lxml here. A parser offers an interface for programmers to easily access and modify various parts of the HTML string code. Here’s a link to a popular blog post comparing various parsers’ performances.
Another useful link is in the documentation for Beautiful Soup. We print out the HTML in the last line, and soup.prettify() is a way to format the HTML in human readable form before printing.
Finding the content required
Now that we have the soup object, we can call various methods on it and easily extract the data we want from the HTML.
In general, there are three ways of approaching this:
- Specifying elements
- Matching attributes of elements (class, id, etc)
- Using XPath
We can use
CSS Selectors for the first two ways. The other way is to use XPath. Note that Beautiful Soup doesn't support XPath, so we’ll have to stick with CSS Selector syntax. There are libraries like Scrapy and Selenium that support XPath.
Let’s quickly breeze through some commonly used actions we can perform on our soup object. Beautiful Soup already has very awesome Documentation, so please just go through the Quick Start section once before moving ahead.
Observing the page structure
Now comes the important part. You need to understand the HTML page structure in order to get the desired content. We're interested in getting the following values for all of the proposals on the page from Pycon India 2016's site:
- The number of votes.
- The number of comments.
- Title.
- Link to the description.
- The author of the proposal.
Here's how we proceed. Looking at the HTML structure, we see that every proposal is contained within a div
of class user-proposals
. First, we'd want to get all of the div
with class user-proposals
.
{% highlight python linenos %}
proposal_divs = soup.find_all("a", class_="user-proposals")
{% endhighlight %}
Now, we can locate where the required data is stored inside this user-proposal
div. A quick way to do this is to simply go to the element and click on Inspect element
. For example, if I'm looking for the element containing "author," I simply right-click over any author name and select 'Inspect element.'
Careful observation leads us to the following useful conclusions:
Number of votes
is placed in the only html tag of the div with the class panel-body.Number of comments
is placed in the only span tag, just after the i tag with class fa fa-comments-o. Note: It also could've been the only span tag of the div with the class space on top, but I'm just trying to cover different possible ways.Title
of the proposal is placed in the a tag with class proposal-title.Link
to the proposals' description is placed inside the href attribute of the a tag with class proposal-title. Now, take a look at the code to extract all of this.
proposal_divs = soup.find_all("div", class_="user-proposals")
print 'No. of proposals',len(proposal_divs)
results = []
for proposal_div in proposal_divs:
#Initialize an empty dictionary to store all the data
proposal_dict = {}
#Using CSS Selectors to get the Number of votes
proposal_dict['votes'] = int(proposal_div.select('.panel-body > h4')[0].get_text())
#Using chained find methods to get the number of comments
proposal_dict['comments'] = proposal_div.find('div', class_='space-on-top').find('span').get_text()
# We can also pass other attributes to the find method inside a dictionary
proposal_dict['name'] = proposal_div.find('h3', {'class': 'proposal--title'}).find('a').get_text()
That's pretty much it! I hope this post gave you a basic understanding of scraping. I'd recommend that you explore the Scrapy project. It's Open Source and scraping more complicated sites is much more manageable with Scrapy.
This post is originally published by the author here. This version has been edited for clarity and may appear different from the original post.