Web Scraping / Web Crawling Pages with Node.js

This post series is going to discuss and illustrate how to write a web crawler in node.js. I'm going to write some posts on a topic that are database agnostic and the database part split up into the respective different databases you could imagine using.

  1. Web Scraping / Web Crawling Pages with Node.js
  2. Building your page object and extracting content
  3. Crawl an entire Domain
  4. Saving the crawled page

Node.js is great for a lot of things. Most importantly, it's JavaScript! For a personal project I played around with web crawling, that's not much different from what Google is doing with most of the pages on their search index.

Disclaimer: If you need to do this on a massive scale, it's probably a good idea do closely investigate performance of different languages and frameworks, because I'm quite sure this is not the fastest way to do it.

In a way I got this fun idea that I could develop this post independently from databases you want to use, but then I decided I would just write multiple posts and split them up for the database you want to use, no matter if it's mongodb, rethinkdb or mysql. This way I have to force myself to have a consistent API across the models.

Crawling a single page with Node.js

Crawling a page and saving the contents to a database or a file, is a very simple thing in node. To make it a little easier for us, there are two modules in particular that we can rely on:

  • request (this one's great for getting stuff)
  • cheerio (this one's great for getting stuff from elements on the page, like with jQuery)

To get a page into a variable, you just need to pass the URL as a parameter to the request module, but if you want to crawl an entire domain, you probably want to get all links on the page too, separately.

Let's have a look at how we can crawl a simple HTML page and store the contents of different elements into some variables.

  1. create a directory
  2. run npm install request cheerio
  3. create a file with the content below
  4. run node get-page.js
get-page.js
var request = require('request');
var cheerio = require('cheerio');

var url = "http://wikipedia.org";

request(url, function (error, response, body) {
  if (!error) {
    var $ = cheerio.load(body)

    var title = $('title').text();
    var content = $('body').text();
    var freeArticles = $('.central-featured-lang.lang1 a small').text()

    console.log('URL: ' + url);
    console.log('Title: ' + title);
    console.log('EN articles: ' + freeArticles);
  }
  else {
    console.log("We’ve encountered an error: " + error);
  }
});

You should see the following output:

% node get-page.js
URL: http://wikipedia.org
Title: Wikipedia
EN articles: 5 002 000+ articles

That means you've successfully extracted a part of the page, more explicitly the <title> elements content. With cheerio you can select not only the element, but its content through the .text() function.

The request module simply fetches the page for us and we load the response as a pseudo website with var $ = cheerio.load(body).

Summary

We've now had a look at how to load a page and parse it into cheerio to get the contents of different elements on the page.

In the next part we'll have a look at what information we should extract from a page, if we wanted to build some kind of analytics or niche search engine tool.

Tagged with: #cheerio #node.js #request #web crawling

Thank you for reading! If you have any comments, additions or questions, please tweet or toot them at me!