Codementor Events

Simple way to get data from web page using python

Published Dec 15, 2018
Simple way to get data from web page using python

Can you guess a simple way you can get data from a web page? It’s through a technique called web scraping.
In case you are not familiar with web scraping, here is an explanation:
“Web scraping is a computer software technique of extracting information from websites”
“Web scraping focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.”
Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. Websites like Rotten tomatoes and Twitter provides API to access data. But if a web page doesn’t provide an API, you can use Python to scrape data from that webpage.

I will be using two Python modules for scraping data.

  • Urllib
  • Beautifulsoup

So, are you ready to scrape a webpage? All you have to do to get started is follow the steps given below:

Understanding HTML Basics
Scarping is all about html tags. So you need to understand html inorder to scrape data.
This is an example for a minimal webpage defined in HTML tags. The root tag is <html> and then you have the <head> tag. The page includes the title of the page and might also have other meta information like the keywords. The <body> tag includes the actual content of the page. <h1>, <h2> , <h3>, <h4>, <h5> and <h6> are different header levels.
I encourage you to inspect a web page and view its source code to understand more about html.

Scraping A Web Page Using Beautiful Soup

I will be scraping data from bigdataexaminer.com. I am importing urllib2, beautiful soup(bs4), Pandas and Numpy.

What beautiful = urllib2.urlopen(url).read() does is, it goes to bigdataexaminer.com and gets the whole html text. I then store it in a variable called beautiful.

Now I have to parse and clean the HTML code. BeautifulSoup is a really useful Python module for parsing HTML and XML files. Beautiful Soup gives aBeautifulSoup object, which represents the document as a nested data structure.

Prettify

You can use prettify() function to show different levels of the HTML code.
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <h1> tag, just say soup.h1.prettify():

Contents

soup.tag.contents will return contents of a tag as a list.
In[18] : soup.head.contents
The following function will return the title present inside head tag.
In[45] : x = soup.head.title
Out [45]: <title></title>
.string will return the string present inside the title tag of big data examiner. As big dataexaminer.com doesn’t have a title, the value returned is None.

Descendants
Descendants lets you iterate over all of a tags children, recursively.
You can also look at the strings using .strings generator
In[56]: soup.get_text()
extracts all the text from Big data examiner.com

FindALL

You can use Find_all() to find all the ‘a’ tags on the page.
To get the first four ‘a’ tags you can use limit attribute.
To find a particular text on a web page, you can use text attribute along with find All. Here I am searching for the term ‘data’ on big data examiner.

Get me the attribute of the second ‘a’ tag on big data examiner.
You can also use a list comprehension to get the attributes of the first 4 a tags on bigdata examiner.

Conclusion

A data scientist should know how to scrape data from websites, and I hope you have found this article useful as an introduction to web scraping with Python. Apart from beautiful soup there is another useful python library called pattern for web scraping. I also found a good tutorial on web scraping using Python.

Instead of taking the difficult path of web scraping using an in-house setup built by you from scratch, you could always safely trust PromtCloud’s web scraping service to take end-to-end ownership of your project.

Web scraping is not all about “coding” per se, you need to be adept in coding, internet protocols, database warehousing, service-request, code cleansing, converting unstructured data to structured data, and even some machine learning nowadays.

Discover and read more posts from Pham Van
get started
post comments6Replies
Isaac Solano
5 years ago

Well, none of the packages listed are available, would it be for the python version? I’m using 3.8.2, if thats the reason of the problem, what packages on this current version will work the same as you describe?

I really need the tags of any website I can get link…

David
6 years ago

THIS IS GOOD POST, WILL WAIT NEXT POST. THANKS PHAM.

Somnath Yadav
6 years ago

that is what I really want…u have nice way of explanation… thanx Pham

Show more replies