Web scraping using Python and BeautifulSoup
Intro
In the era of data science it is common to collect data from websites for analytics purposes.
Python is one of the most commonly used programming languages for data science projects. Using python with beautifulsoup makes web scrapping easier. Knowing how to scrap web pages will save your time and money.
Prerequisite
- Basics of python programming (
python3.x
). - Basics of
html
tags.
Installing required modules
First thing first, assuming python3.x
is already install on your system you need to install requests http library and beautifulsoup4 module.
Install requests
and beautifulsoup4
$ pip install requests
$ pip install beautifulsoup4
Collecting web page data
Now we are ready to go. In this tutorial our goal is to get the list of presidents of United States from this wikipedia page.
Go to this link and right click on the table containing all the information about the United States presidents and then click on the inspect to inspect the page (I am using Chrome. Other browsers have similar option to inspect the page).
The table content is within the tag table
and class wikitable
(see the image below). We will need these information to extract the data of interest.
Import the installed modules
import requests
from bs4 import BeautifulSoup
To get the data from the web page we will use requests
API's get()
method
url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
It is always good to check the http
response status code
print(page.status_code) # This should print 200
Now we have collected the data from the web page, let's see what we got
print(page.content)
The above code will display the http
response body.
The above data can be view in a pretty format by using beautifulsoup
's prettify()
method. For this we will create a bs4
object and use the prettify
method
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
This will print data in format like we have seen when we inspected the web page.
<table class="wikitable" style="text-align:center;">
<tbody>
<tr>
<th colspan="9">
<span style="margin:0; font-size:90%; white-space:nowrap;">
<span class="legend-text" style="border:1px solid #AAAAAA; padding:1px .6em; background-color:#DDDDDD; color:black; font-size:95%; line-height:1.25; text-align:center;">
</span>
<a href="/wiki/Independent_politician" title="Independent politician">
Unaffiliated
</a>
(2)
</span>
<span style="margin:0; font-size:90%; white-space:nowrap;">
...
...
As of now we know that our table is in tag table
and class wikitable
. So, first we will extract the data in table
tag using find
method of bs4
object. This method returns a bs4
object
tb = soup.find('table', class_='wikitable')
This tag has many nested tags but we only need text under title
element of the tag a
of parent tag b
(which is the child tag of table
). For that we need to find all b
tags under the table
tag and then find all the a
tags under the b
tags. For this we will use find_all
method and iterate over each of the b
tag to get the a
tag
for link in tb.find_all('b'):
name = link.find('a')
print(name)
This will extract data under all the a tags
<a href="/wiki/George_Washington" title="George Washington">George Washington</a>
<a href="/wiki/John_Adams" title="John Adams">John Adams</a>
<a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
<a href="/wiki/James_Madison" title="James Madison">James Madison</a>
<a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>
...
...
<a href="/wiki/Barack_Obama" title="Barack Obama">Barack Obama</a>
<a href="/wiki/Donald_Trump" title="Donald Trump">Donald Trump</a>
The eleemnt title
can be extracted from all a
tags using the method get_text()
. So modifyng the above code snippet
for link in tb.find_all('b'):
name = link.find('a')
print(name.get_text('title'))
and here is the desired result
George Washington
John Adams
Thomas Jefferson
James Monroe
...
...
Barack Obama
Donald Trump
Putting it all together
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find('table', class_='wikitable')
for link in tb.find_all('b'):
name = link.find('a')
print(name.get_text('title'))
We have successfully scrapped a web page in less than 10 lines of python code!! Bingo!
Leave a feedback in the comment box. Let me know if you have any questions in your mind or having any difficulty with this tutorial.
I am getting the desired output but followed by an attribute error non type object has no attribute get_text.
Any ideas?
It means that at some point in the code
link.find('a')
returnsNone
(meaning there was no<a>
tag in that link object.) So you can’t.get_text()
from something that doesn’t exist. I’d be moe than happy to help you out with this and any web scraping questions. An also, by the way, it would be far easier to usepandas
in the example given above as it specifically parses<table>
tags (using BeautifulSoup under the hood).Tks, dear!
Awesome!