Codementor Events

Web scraping for Machine Learning

Published Mar 10, 2020
Web scraping for Machine Learning

Web scraping is one of the powerful technique used to collect large amounts of data from internet. Companies with quality data strive in today's world when it comes to Machine learning.

Let's take a scenario. You set out to build worlds best restaurant review classification system. You collect all the reviews from several restaurants and use a fancy deep learning algorithm to do the classification.Turns out your classification algorithm is not doing well out in public. What went wrong ?

Well, machine learning is all about capturing the pattern and generalizing it so well that unseen data will also work well. Given the situation you are in, you are two options. Try GPU, incorporate latest ML techniques, build an ensemble of many models, revisit feature engineering... or

Get more data. As trivial as it might sound, fetching more data would enable any ML algorithm to capture more pattern with in the data and perform well on unseen data.

I am going to talk about not so famous python package which works really well and is quick to get it and start scraping.

Goose3

Lets see the steps to install and scrape:

pip3 install goose3
from goose3 import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)
article.title
u'Occupy London loses eviction fight'
article.cleaned_text[:150]
(CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi

It's that simple and is so easy to work with. I hope this article gave some insights on how to scrape news articles in Python and why knowing the skill of scraping is important from Machine learning!

Discover and read more posts from Manishanker Talusani
get started