Crawling an entire Domain / Website

This post is going to be about crawling an entire domain in Node.js. You can find the first posts of the series here: Web Scraping / Web Crawling Pages with Node.js.

For testing purposes I have created a simple set of HTML pages, that should resemble a generic website. It has some page and we want our crawler to go through them and make sure it finds all of them, where they’re linked. That means when our crawler hits a page, it should keep track of the links it finds and then only proceed to pages it has not crawled yet.

In this example we’re just going to make those arrays, if you have a lot of data, if you want to shard the process or similar, you probably want to go for a message queue with redis or activemq or the likes.

To get the example below running on your machine, just install the following packages first: npm install lodash async cheerio request.

// core modules
var fs = require('fs');
var url = require('url');

// third party modules
var _ = require('lodash');
var async = require('async');
var cheerio = require('cheerio');
var request = require('request');

var base = 'projects.jonathanmh.com';
var firstLink = 'http://' + base + '/crawling-site';

var crawled = [];
var inboundLinks = [];

var makeRequest = function(crawlUrl, callback){
  var startTime = new Date().getTime();
  request(crawlUrl, function (error, response, body) {

    var pageObject = {};
    pageObject.links = [];

    var endTime = new Date().getTime();
    var requestTime = endTime - startTime;
    pageObject.requestTime = requestTime;

    var $ = cheerio.load(body);
    pageObject.title = $('title').text();
    pageObject.url = crawlUrl;
    $('a').each(function(i, elem){
      /*
       insert some further checks if a link is:
       * valid
       * relative or absolute
       * check out the url module of node: https://nodejs.org/dist/latest-v5.x/docs/api/url.html
      */
      pageObject.links.push({linkText: $(elem).text(), linkUrl: elem.attribs.href})
    });
    callback(error, pageObject);
  });
}

var myLoop = function(link){
  makeRequest(link, function(error, pageObject){
    console.log(pageObject);
    crawled.push(pageObject.url);
    async.eachSeries(pageObject.links, function(item, cb){
      parsedUrl = url.parse(item.linkUrl);
      // test if the url actually points to the same domain
      if(parsedUrl.hostname == base){
        /*
         insert some further link error checking here
        */
        inboundLinks.push(item.linkUrl);
      }
      cb();
    }
    ,function(){
      var nextLink = _.difference(_.uniq(inboundLinks), crawled);
      if(nextLink.length > 0){
        myLoop(nextLink[0]);
      }
      else {
        console.log('done!');
      }
    });
  });
}

myLoop(firstLink);

This little crawler will go through all links on the page, for real world usage a couple of things are missing like:

  • check the content type if it really is “text/html”
  • check if the destination points to something with a 200 status code
  • lots of error handling (you would not believe what people run as a website)

The data we’re gathering on each page is up to you really. Anything cheerio can find in the DOM and I’m saving the request time until it’s received on top of that.

So for each page:

  • URL
  • title from <title>
  • requestTime in milliseconds
  • links as {linkText: (from within the <a> tag), linkUrl: (from href="")}

For now though, this example will do and should show the following output:

{ links:
   [ { linkText: 'about it',
       linkUrl: 'http://projects.jonathanmh.com/crawling-site/about.html' },
     { linkText: 'all the posts',
       linkUrl: 'http://projects.jonathanmh.com/crawling-site/posts.html' } ],
  requestTime: 278,
  title: 'Web Crawling Example Site | Home',
  url: 'http://projects.jonathanmh.com/crawling-site' }
{ links:
   [ { linkText: 'HOME',
       linkUrl: 'http://projects.jonathanmh.com/crawling-site/about.html' },
     { linkText: 'all the posts',
       linkUrl: 'http://projects.jonathanmh.com/crawling-site/posts.html' } ],
  requestTime: 119,
  title: 'About',
  url: 'http://projects.jonathanmh.com/crawling-site/about.html' }
{ links:
   [ { linkText: 'Post #1',
       linkUrl: 'http://projects.jonathanmh.com/crawling-site/post1.html' },
     { linkText: 'Post #2',
       linkUrl: 'http://projects.jonathanmh.com/crawling-site/post2.html' } ],
  requestTime: 119,
  title: 'Posts List',
  url: 'http://projects.jonathanmh.com/crawling-site/posts.html' }
{ links: [],
  requestTime: 119,
  title: 'Post #1 about Web Crawling',
  url: 'http://projects.jonathanmh.com/crawling-site/post1.html' }
{ links:
   [ { linkText: 'the first post',
       linkUrl: 'http://projects.jonathanmh.com/crawling-site/post1.html' } ],
  requestTime: 121,
  title: 'Post #2 about Web Crawling',
  url: 'http://projects.jonathanmh.com/crawling-site/post2.html' }
done!

Finally, when the difference of the arrays crawled and inboundLinks is an empty array, the crawler stops and notifies us, that it’s done!.

I hope you’ve enjoyed our little excursion into web crawling land and you’ll build something amazing with it! Feel free to leave a comment, ask all the questions or write me on twitter!

Leave a Reply

Your email address will not be published. Required fields are marked *