Codementor Events

The one where we build a web scrapper and a slackbot - Part 1

Published Mar 30, 2020Last updated Sep 25, 2020
The one where we build a web scrapper and a slackbot - Part 1

As software engineers, part of what we do revolves around making seemingly easy things a little bit easier. who would imagine doing these three things would be a chore?

  • Visit Brainyquote
  • Find and copy a random quote about excellence from the site.
  • Post the quote to a slack channel.

It seems simple enough to do but if done every day for a year becomes boring and tedious.

Python is a scripting language built for things like this! With python, we could automate this whole process and not have to do the same thing every day.

Well, that's exactly what we would be doing šŸŽ‰. In this two-part series, we would be building a slackbot that periodically sends a random quote about excellence to a specified slack channel. Some of our MVP features would include

  • Scraping tool: This would be responsible for getting a whole lot of quotes and saving them to a JSON file for future use
  • A Slack bot: That would be responsible for periodically(maybe every morning?) sending one random quote to a slack channel. This part of the project would require us to write some simple code for posing the message to a Slack channel at intervals.

Prerequisites

  • A python environment and some basic knowledge of Python. That's it

Part 1: The scrapping tool

First off we need to get some groundwork done by creating a basic project setup, a virtual environment and installing some packages

- cd newly_created_folder
- mkdir scrapping-tool
- cd scrapping-tool
- touch __init__.py main.py scroll.py selenium_driver.py

At this point, we're good to go but I strongly recommend you create a virtual environment for this project. If you have virtualenv installed on your PC all you have to do is run the following commands

- virtualenv --python=python3 venv
- source venv/bin/activate

If you don't or have questions around what a virtualenv is... you may want to Read this

Next, install the following 3rd party packages

  • BeautifulSoup to help us scrape any website for data
  • selenium to automate browser interactions while doing so and lxml to interface with BeautifulSoup and parse data to LXML.

run the following command on your terminal

pip3 install BeautifulSoup selenium lxml

Finally, download chrome driver by following basic instructions here. This would enable us to run a headless version of chrome when using selenium for automation. If you're on a mac you can simply run

brew cask install chromedriver

Setup sometimes may endure for a night but code comes in the morning... UNKNOWN

Let's write some code!

In the scrapping-tool folder you created, locate the selenium_driver.py file and paste the following code in

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/usr/local/bin/chromedriver", chrome_options=options)

This piece of code imports webdriver from selenium and adds some configuration options for webdriver like incognito, headless mode, etc, finally we make use of the chromedriver we installed earlier by pointing to the path where it was downloaded to. we save this in a driver variable for future use.

By adding the __init__.py file in our folder we told python to consider every file in that folder a package. This means functions, variables, etc are exposed by default from any location in our app šŸ˜Ž.

Part of the hassles that come with automation in web browsers comes up when human interaction is needed. For example, the website we are trying to scrape has some functionalities you would notice once you open the site.

  • On the first visit to the website, you would have to click and accept the privacy policy
  • After that, we see the page with all those quotes we would like to get, but then this page implements an infinite scroll.

We won't be doing much automation if we were to help our browser click that button or help the browser scroll when it gets to the bottom of the page. These problems bring us to our next step scroll.py.

The key to scrapping a website properly lies in your ability to hit inspect and find that class or id with which you can access that element

In the file scroll.py Copy and paste the code below.

import time def scroll(driver, timeout): scroll_pause_time = timeout # wait for terms modal to popup and then click
 driver.implicitly_wait(timeout) privacy_button = driver.find_elements_by_css_selector(".qc-cmp-buttons > button:nth-child(2)") privacy_button[0].click() time.sleep(2) # Get scroll height
 last_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to bottom
 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page
 time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height
 new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: # If heights are the same it will exit the function
 break last_height = new_height

A few things to note

  • We create a scroll function which takes two parameters, driver(our page source) and timeout(wait time).
  • We make use of some methods available on the driver object like find_elements_by_css_selector this would help us locate elements. like in our case locate the privacy button and where to start our infinite scrolling.
  • We also make use of execute_script method which takes our browsers window object as a parameter to enable us to scroll the website, determine page height, etc
  • Notice the while loop? This loop checks our browser and calculate's new scroll height by comparing our current height with the last scroll height. if both heights are the same we break the loop meaning we are at the end of the page.

Bringing it all together, we build the scrapper itself.

In main.py still within the scrapping-tool folder, add the following code

import re
import json
from bs4 import BeautifulSoup
from selenium_driver import driver # here we import the driver we configured earlier
from scroll import scroll # the scroll method
 def get_quotes(url): try: # implicitly_wait tells the driver to wait before throwing an exception
 driver.implicitly_wait(30) # driver.get(url) opens the page
 driver.get(url) # This starts the scrolling by passing the driver and a timeout
 scroll(driver, 5) # Once scroll returns bs4 parsers the page_source
 soup = BeautifulSoup(driver.page_source, "lxml") # Them we close the driver as soup_a is storing the page source
 driver.close() # Empty array to store the links
 quotes = [] regex_quotes = re.compile('^b-qt') regex_authors = re.compile('^bq-aut') quotes_list = soup.find_all('a', attrs={'class': regex_quotes}) authors_list = soup.find_all('a', attrs={'class': regex_authors}) quotes = [] zipped_quotes = list(zip(quotes_list, authors_list)) for i, x in enumerate(zipped_quotes): quote = x[0] author = x[1] quotes.append({ "id": f"id-{i}", "quote": quote.get_text(), "author": author.get_text(), "author-link": author.get('href') }) with open("quotes.json", 'w') as json_file: json.dump(quotes, json_file) except Exception as e: print(e, '>>>>>>>>>>>>>>>Exception>>>>>>>>>>>>>>') get_quotes('https://www.brainyquote.com/topics/excellence-quotes')

What do we have here?

  • We import the BeautifulSoup4 library, some inbuilt python packages like re(regular expression ) and json.
  • We also import the functions packages we created earlier like scroll and driver.
  • We create a get_quotes function that takes in a URL as a parameter.
  • With this, we tell our browser to wait a Lil before throwing an error(sometimes network issues may slow things down).
  • We called the scroll function to do its thing.
  • And once that is done we pass driver.page_source to BeautifulSoup4. printing driver.page_source at this point would show a bunch of HTML tags -We call close to stop browser interactions, we have all we need now

The goal is to scrape a quote, its author and a link to get all of that author's quotes. at this point, we have all of that data albeit in a format we cannot work with yet(HTML tags) also notice from the code that we are extracting data for the author separately and the same for quotes. How do we link each quote to its author? we also need to create a python dictionary containing all those pieces of information, give them unique id's and also form the author's links. Python zip function to the rescue, to put it simply this function takes two lists and generates a series of tuples containing elements from each list. We also made use of enumerate function this means we can unpack index and data from the tuples returned from the zip function. With that, we unpack and loop over the returned tuple, create a python dictionary containing the data we want and append that to the quotes array. We also called a BeautifulSoup4 method get_text() on the author and quote to enable us to return actual texts from our HTML tags. we also called get('href') which returns any property of a tag we specify, in our case href, this is how we get the link to the author's quotes. Finally, we save the contents of our quotes list to a json file by creating a quotes.json file and dumping our data into it by calling json.dump.

To run the scrapper

python scrapping-tool/main.py

To see all this in action, you can comment out this piece of code options.add_argument('--headless') in the file selenium_driver.py.

Yo! Thatā€™s it for now. Feel free to leave a comment, feedback or opinions in the comments. In part two of this article, we will go through creating a slackbot that would display these scrapped quotes on a slack channel. That would also mean we configure a flask project that would enable us to run a server and implement a scheduler!

To view the full code for this article click here

Discover and read more posts from Victory Akaniru
get started
post comments4Replies
Evgheni Boico
5 years ago

Hey Victory! Thanks for an interesting project idea and implementation. When can we expect the second part to be published? :)

rew hytr
5 years ago

Dr George Bryant is a professional hackers beyond human imaginations , I am pleased to announce to the world about these hackers only the serious clients can contact him via : georgebryant110@gmail.com or Whatsapp:+1 (385) 273 0329 he have spent up to 2 decades in the Hacking profession as a legend and his reputation still remain the same positivelyā€¦People choose he because he donā€™t fail to deliverā€¦ they make use of a fast software plugins without any database notificationsā€¦
He offer all manners of Hacking services which are not limited to the following;
-Loading of Bitcoin
-Removal of old credit score
-Unlimited Blank ATM Card with Tracking code for pick up

  • Help Sign up to ILLUMINATI and get famous faster
    -delete Youtube videos or increase views
    -school transfer and certificate forgery
    -Verify accounts for Paypal transfer and bank logins
    -Hack Court System and Clear criminal records
  • Bitcoin mining
    -Credit card delivery
    -School grades upgrade
    -Facebook hack
    -Whatsapp Hack
    -IP location
    -Clears bad Driving
    -Hack Bank Logins
  • Instagram hack and Recovering of password
    And other countless services but not for free ā€¦they deliver perfectly without any regret from any clientā€¦ Email Contact: georgebryant110@gmail.com
    truly he is the world internet gurusā€¦ because your smile is their happiness
elisa martin
5 years ago

Looking for a distinctive design for your company or website with quality and reasonable prices, do you have an idea that you think is
difficult to implement? Implement all your ideas now so that you become a tangible reality beyond your imagination.
<a href=ā€œhttps://www.epsonprintersupport247.com/fix-epson-printer-wont-print-black-ink/ā€>Epson Printer Wonā€™t Print Black</a>

Show more replies