Codementor Events

Scaling PhantomJS: Taking Thousands of Full Page Screenshots Every Day

Published Oct 13, 2017Last updated Apr 10, 2018
Scaling PhantomJS: Taking Thousands of Full Page Screenshots Every Day

This article will show you how to use PhantomJS at scale to make multiple website screenshots as a RESTful service. I implemented this service in my own way, and there are many different ways to do this, but do keep in mind that I am talking about a real life example that serves 1000+ customers a day. The use case I’m talking about is Custodee, a SaaS App based on web crawling, analytics, and screenshots.

Challenges of creating full page screenshots

After building and running Custodee (you can read about how I built and launched it on Product Hunt here), I went through the trouble of creating full-page website screenshots. While doing this, I was confronted with three problems:
Modern technologies like Chrome Selenium still have problems taking full-page screenshots without faking them. This means Selenium will take multiple screenshots by scrolling down and putting them together into one image afterwards. This won’t work for most websites because of fixed HTML components.
Older technologies like PhantomJS, as well as newer ones, are all quite heavy on CPU and RAM resources. Optimizing it to run at scale and small servers was quite important to me to keep server costs as low as possible.
PhantomJS is buggy and will crash randomly from time to time. This needs to be handled.

PhantomJS in my back-end architecture

Because I fell in love with Node.js, I experimented with a lot of different npm packages, but none of them did the job. Therefore, I decided to just use a PhantomJS wrapper and build the functionality on top of it.

The code in this article is a simplified version of my Custodee back-end, which ran on multiple servers and crawls thousands of websites per day — see the diagram representing Custodee’s architecture below.

Notes

  • The website is on the front-end server with Node.js and AngularJS.
  • There can be many more back-end servers, depending on the traffic (hence the +n).
  • This is the reason why there’s the AWS ELB (load balancer) — it routes the traffic to the back-end servers, which shares the load between them.
  • The Node.js application on the front-end server has two purposes:
    • Importing and saving all the images as a REST API (the images are sent from the back-end servers)
    • Pushing premium users’ images to their Dropbox

(If you’d like to read more about this, you can refer to my previous post.)

For this post, I stripped down my current back-end implementation to just do full-page screenshots, so my examples won’t become too complex. You can get the whole project from my [GitHub]((https://github.com/TonySchu/) and run it on your local machine. I included a simple front-end to use the service, but you can just do post requests as well.

To run it on your local machine, just download it here and follow the instructions.

After running the node application, you can test it on localhost:8089.

Overview

Because the process of creating a new PhantomJS instance uses a lot of CPU resources, I could not just start multiple browsers for each crawling process. After testing different approaches, I knew these aspects had to be addressed:
Reuse an existing PhantomJS instance for as long as possible, but close it before it crashes.
Because everything works in async and PhantomJS can’t handle multiple operations at once, all functions have to be designed to handle async functions (create a browser tab, open an URL, extract the HTML, render the screenshots, etc.)
Limit the back-end to run a maximum of four parallel instances. Otherwise, the server will crash. (I tested this on small EC2 instances on Amazon Web Services. If I want to run more than four instances, I have to use a bigger server or scale the number of servers.)

The API

This is a simple API to post website links and a username to the application. On your local machine, the endpoint will be http://localhost:8089/api/phantom/:user. The last part of the URL (:user) will be used to create a folder on your machine to store the screenshots.

   // API to post an array of websites to PhantomJS
    app.post('/api/phantom/:user', function (req, res) {
        //user or folder for the images
        var user = req.params.user;
        // array of websites from the post body
        var websites = req.body.websites;
        if(typeof user != "undefined" && typeof websites !="undefined"){
            // object for the our functions
            var crawlStatus = {index: 0, max: websites.length, user: user};
            // return true if successful
            var runPhantomJs = Crawler.startCrawler(websites, crawlStatus);
            // response for the http request
            var crawlAnswer;

            if (runPhantomJs == true) {
                crawlAnswer = "Start to make Screenshots of: " + websites;
            } else {
                crawlAnswer = "PhantomJS is too busy. :( Please try later";
            }
            res.send(crawlAnswer);
        }else{
            res.send("You need to define a user websites to crawl.");
        }
    });


The Crawler

Here, we’ll start with the CrawlerObject, which will be used to pass around data between the different processes. It contains the current iteration, website URLs, the PhantomJS instance, and configurations. Because PhantomJS is quite hungry for resources, I limited the application to run only fourprocesses in parallel. You can change this in the global variable “maxInstances,” depending on the power of your machine or server. This should work well on a small EC2 instance on AWS.

// Requires
var phantom = require('phantom');
var fs = require('fs');
// global array of active PhantomJS instances
var phantomChildren = [];
var maxInstances = 4; //change this to run more PhantomJS instances in parallel
var maxIterations = 20; // the max of websites to run through a PhantomJS instance before creating a new one

// Object for crawling websites
function CrawlObject(index, websites, phantomInstance, crawlStatus) {
    return {
        index: index, // current index of the websites array
        websites: websites,
        processId: phantomInstance.process.pid, // process id of the child process
        crawlStatus: crawlStatus,
        phantomInstance: phantomInstance,
        resourceTimeout: 7000, // timeout to wait for phantom
        viewportSize: {width: 1024, height: 768}, // viewport of the PhantomJS browser
        format: {format: 'png', quality: '5'}, // format for the image
        timeOut: 5000 //Max time to wait for a website to load
    }
}

Create a PhantomJS instance

This function creates a new, fresh PhantomJS instance. Its process ID will be stored in a global array, so we can always check how many instances are currently running or use this ID to kill a buggy instance on our server. The newly created CrawlObject is now passed into the next function createWebsiteScreenshots. The first thing to check is the current index of the CrawlObject. This is important because the more websites you run through a PhantomJS instance, the more likely it is to randomly crash.

While optimizing Custodee’s back-end, I figured out that 20 iterations is a good limit to work with. After 20 iterations, I will shut down the active instance and continue the crawling process with a fresh one, just to be safe. If you ask yourself — why not just use a fresh instance for each website screenshot every time — it is mainly because the creation process is the main reason for the high CPU usage. Reusing an instance to render multiple websites is very important in keeping the usage of the server’s resources low, and in making the screenshot process faster.

//create browser instance
function initPhantom(websites, crawlStatus) {
    //only allow 4 instances at once
    if (checkPhantomStatus() == true)
        phantom.create(['--ignore-ssl-errors=no', '--load-images=true'], {logLevel: 'error'})
            .then(function (instance) {
                console.log("===================> PhantomJs instance: ", instance.process.pid);
                // store the process id in an array
                phantomChildren.push(instance.process.pid);
                var crawlObject = new CrawlObject(0, websites, instance, crawlStatus);
                createWebsiteScreenshots(crawlObject);
                return true;
            }).catch(function (e) {
            console.log('Error in initPhantom', e);
        });
    }
    return checkPhantomStatus();
}

Rendering website screenshots

Instead of just running a normal loop over this function, we are always waiting for one iteration to finish before we call the next one. This is necessary for the same reason we are reusing a PhantomJS instance multiple times. It is just super heavy on a machine and will not work at scale. The way PhantomJS works is that you can’t do operations in parallel, like taking screenshots, working with the HTML content, or even clicking and navigating on a website. (Yes, you can do inputs, clicks, file uploads, but you need to wait until each step is completed before beginning the next step.)

After setting up the necessary properties and configurations, we can call page.open, which will create a browser tab of with the specific URL we want to crawl. Now, we can do operations like getting the HTML content or creating full-page screenshots. The PNG file will be stored under /public/images/username, and can be directly called from the server, like on the application’s index.html.

// create a tab and make a screenshot
function createWebsiteScreenshots(crawl) {
    var website = crawl.websites[crawl.index];
    var user_folder = 'public/images/' + crawl.crawlStatus.user;
    var checkIterations = crawl.index >= maxIterations;
    var page;
    
    // if a PhantomJS instance is running for too long, it tends to crash sometimes
    // so start a fresh one
    if (checkIterations) {
        crawl.phantomInstance.exit();
        return restartPhantom(crawl);
    }

    crawl.phantomInstance.createPage()
        //open page in a tab
        .then(function (tab) {
            page = tab;
            page.property('viewportSize', crawl.viewportSize);
            page.setting("resourceTimeout", crawl.resourceTimeout);
            return page.open(website);
        })
        // get HTML content if you want to work with it
        .then(function () {
            // use a delay to make sure page is rendered properly
            return delay(crawl.timeOut).then(function () {
                return page.property('content');
            })
        })
        //render website to png file
        .then(function (content) {
            console.log("render %s / %s", crawl.index + 1, crawl.websites.length, "processId:", crawl.processId);
            var image = user_folder + "/" + new Date().toString() + "." + crawl.format.format;
            return page.render(image, crawl.format);
        })
        // close tabnd continue with loop
        .then(function () {
            page.close();
            continuePhantomLoop(crawl);
        })
        .catch(function (e) {
            restartPhantom(crawl, e);
        });
}

Any kind of error will be caught and handled with a new start of a fresh PhantomJS instance. This will insure that it will not fail on a specific website crawl. Keep in mind that I wrote also some helper functions like continuePhantomLoop(). This is not delivered by npm PhantomJS, so feel free to check the whole code on GitHub.

How to scale PhantomJS

As already mentioned, the application is limited by the configuration to run only X number of instances in parallel. The way I scale this service to crawl thousands of websites per day is to run it on multiple small servers in parallel. This way the API calls will go through a load balancer, which will direct and balance the requests to various servers, depending on the current load and number of users. I’ve also automated the servers to boot and shut down depending on the amount of requests. In production, the images and data are then pushed to another server, which can be called from the front-end. Images are also pushed to premium users’ Dropbox accounts.

I tried to keep this article short to explain my implementation of this service in a simple and understandable way. However, if you have any questions, feel free to contact me on http://tonys.io or Twitter @TonySchumaker. 😃

I am also happy to answer any questions in the comments below. 👇

Discover and read more posts from Tony Schumacher
get started
post comments6Replies
Geoffrey Callaghan
2 years ago
Risto Novik
7 years ago

Hello Tony, if it’s not secret what EC instances are you running? Some of the websites are really huge with many pictures and could easily take 1.5Gig of memory.

Many sites have implemented the content lazy loading feature, how are you handling this?

Tony Schumacher
7 years ago

Hi Risito,

I am using small instances of EC2 servers. This is the absolut minimum. Micros will sometimes work but in most cases it will crash due to missing RAM. In 95% 2GB Ram are enough to handle my use cases.

You can avoid lazy when you automatically scroll down to the page and wait for the content to load. You can also inject custom Javascript into the site for this.

Andrej Gajdos
7 years ago

Thanks for this article. Did you try pupetter?

Show more replies