How to scrape data from a website using Python
Introduction
This tutorial would walk you through how to scrape data from a table on Wikepedia.
The page we would be scraping data from is List of countries and dependencies by population. You can visit the link to a get a feel of how the page looks. The table with data to be scraped is shown below -
Packages used
-
Csv - A module that is part of python's standard library for reading and writing data to a file object in csv format.
-
Beatifulsoup - A library for pulling data out of html and xml files.
-
Requests - A library for making HTTP requests in python.
Setup
It is assumed that you already have Python 3 installed on your machine if not follow here to install Python 3 on OSx.
Run the commands below to install the beatifulsoup and requests library
pip install requests
pip install beautifulsoup4
Scrape the data
Navigate to a specific directory on your machine and run the command below to create a new file named main.py
touch main.py
In the main.py add the following code:
import csv
import requests
from bs4 import BeautifulSoup
def scrape_data(url):
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find_all('table')[1]
rows = table.select('tbody > tr')
header = [th.text.rstrip() for th in rows[0].find_all('th')]
with open('output.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(header)
for row in rows[1:]:
data = [th.text.rstrip() for th in row.find_all('td')]
writer.writerow(data)
if __name__=="__main__":
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
scrape_data(url)
let us break this apart and see how it works line by line.
-
Lines 1 - 3 Imports all the packages needed to run the application.
-
Line 6 We define the fuction
scrape_data
that takes aurl
parameter. -
Line 8 We make a get request to the
url
using the get method of therequests
library.
When making HTTP requests with the request library, it is important to set a timeout incase the server does not respond in a timely manner. This prevents your program from hanging indefinitely while waiting for a response from the server.
-
Line 9 We create a beatuful soup tree structure from the content of the response from the server. This object is easy to navigate and search through.
-
Line 11 We search throught the beatufiful object
soup
to find the second table in the document which contains the data we want using the it'sfind_all
method. The beautifulsoup object'sfind_all
method searches for all html tags that match the filter/search-term in the tree structure. -
Line 13 This line of code selects all the
tr
elements where
the parent is atbody
element from thetable
.tr
elements represents the table rows. -
Line 15 The first row ussually contains the header cells. We serch throught the first row in the rows list to get the text values of all
th
elements in that row.
we also ensure to remove the all trailing whitespaces in the text using therstrip
python string method. -
Line 17 - 22 This opens a file and creates a new file object. The
w
mode is used to ensure the file is open for writing.
First we write the header row, then loop through the rest of the rows ignoring the first row to get the data contained within and write the data for all those rows to the file object. -
Line 25 -27 We check to ensure the module is run as the main program and call the function
scrape_data
with a specifiedurl
to scrape the data.
on a the terminal run the command below to scrape the data
python main.py
An output file named output.csv
containing the data should produced in the root folder
Conclusion
Before you begin scraping data from any website, ensure to study the HTML markup/ content of the website to determine the location of the data you want.
If you have any questions or comments, please add them to the comments section below.
https://main.sci.gov.in/judgments
I want to scrape this website please help me
Please anyone help me
Hi
http://dtemaharashtra.gov.in/frmInstituteSummary.aspx?InstituteCode=4649
i want to scrape this website
please help me out by code as such as possible