Codementor Events

python Web Scrapping (requests_html not beautiful-soup)

Published Apr 19, 2020
python Web Scrapping (requests_html not beautiful-soup)

Why scrapping
In my quest to being data scientist (being a driver) , i have learned having data is the fuel that keeps automobile (data science) on the highway. Just like crude oil you have to mine it. At a times you are not provided with the data at hand, but does that prevent us from getting our automobiles from running, if the answer is no.(you score an A). You have to look for ways of getting what you need (can we call that basic reasoning), and my help came in handy . Thank you to web scrapping .(it was a pleasure getting to work wit you)


Web Scrapping!!??
Wonder who is that ….yea. Worry not i also had the same face….Actually as the word scrap meaning….its to remove something from a surface (thank you for partially agreeing but this ain't a language lesson). When you add the word web then it changes the default meaning;- automated processes of copying specific data the web, typically into a central local database or spreadsheet,csv , for later retrieval or analysis


Prerequisites:-
Just like in my campus days lecturers echoed this word. to clearly understand (and pass actually they meant not get a supplementary) you should have understood concepts taught in CSC 325 you need skill of units CSC 215 , CSC 115 and CSC 311. So le me borrow that from them, to navigate your way through this story you need a basic understanding:-
1.html, css (psst these are not programming languages)Screenshot from 2020-04-19 14-52-04.png

2.python (functions mainly)

3.Knowledge of using the pandas (data crunching library)


Tools
Just like a plumber need eeee…. okay i meant a florist needs eee (jembe)… what i mean is items that will help you execute your tasks. We also need:-
1.IDE (obvious make sure you set up your python environment)
2.request (python3 -m pip install requests)- this will helps us make request to a online site an retrieve data if the links exist
3.requests-html (python3 -m pip install requests-html):- this will help s in our scrappy actions. On checking for the alternatives of this, beautiful-soup came in play but i did not find it fancy or more able-like than request-html. If i could score more goals with request-html why then use beautiful-soup4. I don't want to loose more hair figuring it out.
4,Site : Box Office Mojo:- we are going to scrap data about movies…from this site we can clearly see that 2019 Lion King ranked in second of the Avenger End Game.
Screenshot from 2020-04-18 14-02-27.png

Am not a very big movies enthusiast so i can't give any further details of the two i have not set an eye on any

--
Screenshot from 2020-04-19 15-05-28.png
Showing you its not all talk

Getting Data from url (using request.get(URL_TO_WHERE_YOU_SCRAP_DATA)
Fire up your IDE, Create a file with your preferred name i call mine scrapper.py
I will create a function that takes in a url and filename ,,(trying to achieve DRY). It takes in a url and file_name(we intend to save the data in a file to validate we are getting we are obtaining it from the site)
We will then initialize the function so execute our data fetch requests

import requests
def url_to_txt(url, file_name="files/world.txt"):
    r = requests.get(url)
    if r.status_code == 200:
        html_text = r.text
        with open(file_name, 'w') as f:
            f.write(r.text)
        return html_text
    return ""
url = "https://www.boxofficemojo.com/year/world"
html_text = url_to_txt(url)

Reading the Extracting the data from url to a tabular format (using requests_html)
We have validated that our data is being saved. We need to get the data objects (actual text from the web page) i.e for our case we are traversing through a table data.
We get the html attribute of the data

def convert_to_list_of_list(data_object):
    # get the html data
    html_data = HTML(html=data_object)

Step 2 we find the class(css class) attribute of our element (for us its the table class which contains data which we intend to scrap)

the class attribute containing data

#get the table with contain the data
table_class = ".imdb-scroll-table"
r_table = html_data.find(table_class)

Screenshot from 2020-04-18 22-46-54.png

Next is getting the table elements containing our data. We get all the table rows and iterate to get data and add it to a list in the table as bellow.

parsed_table = parsed_table = r_table[0]
rows = parsed_table.find("tr")
# the header
header_row = rows[0]
header_row_col = header_row.find("th")
# get the header names
header_name = [x.text for x in header_row_col]
# don't iterate through header elements
for row in rows[1:]:
    cols = row.find("td")
row_data = []
    # get index
    for i, col in enumerate(cols):
        # print(i, col.text, "\n\n")
        row_data.append(col.text)
    table_data.append(row_data)


Putting Data to CSV
Now we have our data. Printing you data you should have something like this. Thus table-like format of data.
Screenshot from 2020-04-18 23-08-40.png
header and first row of the data
Now lets save out data into a csv. Define a function that accept the data ,table header and the file name . We use pandas to create a data frame of our formatted data and then save it


def put_data_to_csv(data, headers, file_name=None):
    df = pd.DataFrame(data, columns=headers)
    df.to_csv(f'files/{file_name}', index=False)

End of our journey, We have our data saved in a csv , you can create models to save your data to a database but our main intent was to get data. What you do with it, i leave it to your own discretion or creativity.
Conclusion
You can get the full update code on my github page.https://github.com/abedkiloo/data_scrapping

Discover and read more posts from Abednego Kilonzo Wambua
get started