Codementor Events

How to Handle Content Scraping in Laravel

Published Jan 10, 2020
How to Handle Content Scraping in Laravel

In this article, we will discuss “How to handle content scraping in Laravel”.

Prerequisite

  • Guzzle HTTP Package: Guzzle is a PHP HTTP client that makes it easy to send HTTP requests.
  • Symfony Dom Crawler: The DomCrawler component eases DOM navigation for HTML and XML documents.

I’m note adding the Laravel installation steps in this article, if you are not familiar with the Laravel then please check our Laravel article here.

Install Guzzle HTTP Package

Use the following composer command to install the package.

composer require guzzlehttp/guzzle

After setup of the package, you can use this as follow.

use GuzzleHttp\Exception\GuzzleException;
use GuzzleHttp\Client;

$client = new Client();

We will discuss more on the Guzzle Http in our future post.

Install Symfony Dom Crawler Package

Use the following composer command to install the package.

composer require symfony/dom-crawler

Create Controller Class

Use the following composer command to create a controller.

php artisan make:controller ContentCrawler

In “ContentCrawler” controller, we will implement our logic to handle content scraping.

Include the required packages, after the namespace mentioned.

use GuzzleHttp\Exception\GuzzleException;
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

Class Constructor and Guzzle Instance

Create a class constructor, and create the guzzle client instance. You can get more information on the guzzle HTTP here.

...
    private $client;

    public function __construct()
    {
        $this->client = new Client([
                'timeout' => 10,
                'verify' => false
            ]);
    }
...

Crawl Content

Check the following code snippet. In this controller class, we are fetching the content using guzzle HTTP and filter all those content using the dom-crawler.

<?php

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use Symfony\Component\DomCrawler\Crawler;
use GuzzleHttp\Exception\GuzzleException;
use GuzzleHttp\Client;
use Exception;

class ContentCrawler extends Controller
{
    private $client;

    /**
     * Class __contruct
     */
    public function __construct()
    {
        $this->client = new Client([
                'timeout' => 10,
                'verify' => false
            ]);
    }

    /**
     * Content Crawler
     */
    public function getCrawlerContent()
    {
        try {
            $response = $this->client->get('<URL>'); // URL, where you want to fetch the content

            // get content and pass to the crawler
            $content = $response->getBody()->getContents();

            $crawler = new Crawler( $content );
            
            $_this = $this;
            $data = $crawler->filter('div.card--post')
                            ->each(function (Crawler $node, $i) use($_this) {
                                return $_this->getNodeContent($node);
                            }
                        );
            dump($data);
            
        } catch ( Exception $e ) {
            echo $e->getMessage();
        }
    }

    /**
     * Check is content available
     */
    private function hasContent($node)
    {
        return $node->count() > 0 ? true : false;
    }

    /**
     * Get node values
     * @filter function required the identifires, which we want to filter from the content.
     */
    private function getNodeContent($node)
    {
        $array = [
            'title' => $this->hasContent($node->filter('.post__content h2')) != false ? $node->filter('.post__content h2')->text() : '',
            'content' => $this->hasContent($node->filter('.post__content p')) != false ? $node->filter('.post__content p')->text() : '',
            'author' => $this->hasContent($node->filter('.author__content h4 a')) != false ? $node->filter('.author__content h4 a')->text() : '',
            'featured_image' => $this->hasContent($node->filter('.post__image a img')) != false ? $node->filter('.post__image a img')->eq(0)->attr('src') : ''
        ];

        return $array;
    }
}

In the end, hope you like this article “How to Handle Content Scraping in Laravel”. We will discuss more on the same in our future articles.

In our next post, we will discuss “how to handle pagination data in content scraping”. It’s easy to handle, I hope this will helps you with your content scraping requirement. If you have any query then please feel free to add in the comment section 🙂

This post is originaly published on my blog. Please check once and help me to improve my writing.

You may like:

Laravel Authorization with Gates — Part 1
Laravel Authorization Policies — Part 2
Brief Understanding on Laravel Observers
Laravel Scout with TNTSearch Driver

Discover and read more posts from Pankaj Sood
get started