How I learned Scraping data from website
About me
I am a enthusiast web developer, I've 8+ years of experience in Web Development. I've been practicing php since I've started my career. I never got enough of it, It teaches something new everyday. Just like I learn scrapping using Laravel dusk & Selenium.
Why I wanted to learn Scraping data from website
I came across to this methodology of scraping when one of my client asked me to fetch data from Magento admin panel without API or DB access. All he has provided was Magento Admin panel. Before learning about Selenium/Dusk I've parsed/scrap the websites which doesn't have complex User interface like Magento Dashboard.
How I approached learning Scraping data from website
In terms of learn scraping you'll required to learn either of the following method, These methods are entirely made up by me.
- Other than Laravel (Mostly core php applications)
- Using Laravel Dusk
1. Other than Laravel
I took this one first because at time of starting I didn't know that Laravel already provides built in library to automates the task like testing and scraping. In order to start you'll need following tools.
Selenium
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.
Source: SeleniumHQ
Chromedriver
WebDriver is an open source tool for automated testing of webapps across many browsers. It provides capabilities for navigating to web pages, user input, JavaScript execution, and more. ChromeDriver is a standalone server which implements WebDriver's wire protocol for Chromium.
Source: Chrome driver
Facebook's PHP library for WebDriver
Well, Nothing much to say about this. Library speaks for itself, Go grab it.
Source: PHP Webdriver
2.Using Laravel Dusk
Well, You do not need anything to install manually. Laravel Dusk provides an expressive, easy-to-use browser automation and testing API. By default, Dusk does not require you to install JDK or Selenium on your machine. Instead, Dusk uses a standalone ChromeDriver installation. However, you are free to utilize any other Selenium compatible driver you wish.
Source: Laravel Dusk
Challenges I faced
As I mentioned earlier, You face a new challenge every time you start scraping a website. You don't know the nature of the website, What kind of element it has, How deep your data resides you've to go to each and every page and it's element to find the right data. One challenge I faced initially was to get the right CSS selector to scrap data. For example: Don't know how to select element, While automate click event on menu which has hierarchical data. Later I learn that it can easily done by Google Chrome's CSS selector. This article helped me to get through
Key takeaways
I enjoyed in my first scrapping assignment was downloading files from server, Task was to export Magento orders CSV. That was just one liner for client but for me tasks.
- Open Browser and hit Magento admin panel.
- Enter username and password in login box and hit submit button
- Wait for dashboard to load
- Traverse to the orders page using multilevel menu.
- Filter the data by data and submit
- Wait for data to filtered and hit the export CSV button.
That's it.!!
Tips and advice
Analyse everything before you start the scraping task, I struggled to install selenium and get it working on server where there was no GUI available. It was a dark as a hell linux server. In my case I've jumped to the task using selenium and after 3-4 days when things started being smooth I found out about Laravel dusk. Laravel dusk did it in 5-10 minutes which I took around 3 days to do it using selenium. I am not criticising selenium, It is good if you're using Java, python or some other technology. I am a php guy, In my case I had option to choose Laravel which has built in automated testing tool named Dusk. So point is you do your research if you're not familiar with php or laravel. You may find your laravel dusk in your respective technology.
Final thoughts and next steps
With Laravel Dusks posibilities are endless, You can use it for test automation or web application administration tasks automation.