Web Scraping / Web Crawling Pages with Node.js

This post series is going to discuss and illustrate how to write a web crawler in node.js. I’m going to write some posts on a topic that are database agnostic and the database part split up into the respective different databases you could imagine using.

  1. Web Scraping / Web Crawling Pages with Node.js
  2. Building your page object and extracting content
  3. Crawl an entire Domain
  4. Saving the crawled page
    1. RethinkDB
    2. MongoDB
    3. MariaDB/MySQL
  5. Creating a work queue, crawl next page

Node.js is great for a lot of things. Most importantly, it’s JavaScript! For a personal project I played around with web crawling, that’s not much different from what Google is doing with most of the pages on their search index.

Disclaimer: If you need to do this on a massive scale, it’s probably a good idea do closely investigate performance of different languages and frameworks, because I’m quite sure this is not the fastest way to do it.

In a way I got this fun idea that I could develop this post independently from databases you want to use, but then I decided I would just write multiple posts and split them up for the database you want to use, no matter if it’s mongodb, rethinkdb or mysql. This way I have to force myself to have a consistent API across the models.

Crawling a single page with Node.js

Crawling a page and saving the contents to a database or a file, is a very simple thing in node. To make it a little easier for us, there are two modules in particular that we can rely on:

  • request (this one’s great for getting stuff)
  • cheerio (this one’s great for getting stuff from elements on the page, like with jQuery)

To get a page into a variable, you just need to pass the URL as a parameter to the request module, but if you want to crawl an entire domain, you probably want to get all links on the page too, separately.

Let’s have a look at how we can crawl a simple HTML page and store the contents of different elements into some variables.

  1. create a directory
  2. run npm install request cheerio
  3. create a file with the content below
  4. run node get-page.js

get-page.js

var request = require('request');
var cheerio = require('cheerio');

var url = "http://wikipedia.org";

request(url, function (error, response, body) {
  if (!error) {
    var $ = cheerio.load(body)

    var title = $('title').text();
    var content = $('body').text();
    var freeArticles = $('.central-featured-lang.lang1 a small').text()

    console.log('URL: ' + url);
    console.log('Title: ' + title);
    console.log('EN articles: ' + freeArticles);
  }
  else {
    console.log("We’ve encountered an error: " + error);
  }
});

You should see the following output:

% node get-page.js                                                                   
URL: http://wikipedia.org
Title: Wikipedia
EN articles: 5 002 000+ articles

That means you’ve successfully extracted a part of the page, more explicitly the <title> elements content. With cheerio you can select not only the element, but its content through the .text() function.

The request module simply fetches the page for us and we load the response as a pseudo website with var $ = cheerio.load(body).

Summary

We’ve now had a look at how to load a page and parse it into cheerio to get the contents of different elements on the page.

In the next part we’ll have a look at what information we should extract from a page, if we wanted to build some kind of analytics or niche search engine tool.

3 thoughts on “Web Scraping / Web Crawling Pages with Node.js”

  1. Hi, nice post.

    One thing though, on your disclaimer you said `this is not the fastest way to do it.` Why is that? Are there any better alternatives?

    As far as I know, regarding scraping, either use curl or headless browser. Curl , in general, is always faster. All the more coupled with event-driven language, Javascript. Thank you.

    1. Hi! I’m happy you’re asking. V8 and Node.js are pretty fast, but not THE fastest languages in the world. So instead of building a virtual DOM, you could write a custom parser, that processes the data faster. That is my assumption anyways. You process the same amount of data, but you might have less overhead, because your parser does not have to know what the document looks like.

Leave a Reply

Your email address will not be published. Required fields are marked *