Web Crawling with Node.js #2: Building the Page Object

Welcome to part 2 of the series crawling the web with Node.js. In this article we're going to have a look at what valuable content we can grab from a page. Important parts when writing a crawler are obviously links, because our crawler wouldn't know where to go next without them.

The data I'm going to extract from a page are not necessarily the ones you'll want and it really all depends what you want with the project. Maybe you only want the content of specific tags or status codes. I'll just put up some examples and you can see from there what's possible and see what would make sense for your purpose.

To extract all the links from page is valuable for both:

  • knowing where a page/document links to
  • finding out which page to crawl next

When crawling an entire domain of pages, you probably want to see which other documents, your current document links to. The essential lines for that are listed below, if you have request and cheerio installed in the same folder you can give it a shot:

var request = require('request');
var cheerio = require('cheerio');

var url = "http://jonathanmh.com";

request(url, function (error, response, body) {
  var links = [];
  var $ = cheerio.load(body);
  $('a').each(function(i, elem){
    console.log($(elem).text(), elem.attribs.href);
    links.push(elem);
  });
});

Which will give you the links like the following array:

console.log(links);
// output (excerpt):
Bootstrap 4 Grid only and SASS with Gulp http://jonathanmh.com/bootstrap-4-grid-only-and-sass-with-gulp/
Bootstrap http://getbootstrap.com
Continue reading Bootstrap 4 Grid only and SASS with Gulp http://jonathanmh.com/bootstrap-4-grid-only-and-sass-with-gulp/#more-1913
December 26, 2015December 26, 2015 http://jonathanmh.com/bootstrap-4-grid-only-and-sass-with-gulp/
code http://jonathanmh.com/category/blog/code/
git http://jonathanmh.com/tag/git/
Bootstrap http://jonathanmh.com/tag/bootstrap/
Gulp http://jonathanmh.com/tag/gulp/
SASS http://jonathanmh.com/tag/sass/

Note: some links might not have a title, since images are not converted to text by cheerio ;) We could surely figure out a smarter way or to use the image name, title or alt tag for that, maybe even have a separate field in our data.

Now that you have got the links, you can either save them as separate objects/rows in a table or with the source object. I would recommend creating an object that resembles the following, since you will be able to keep track of (at least the first) source of the link. For now, we'll keep track of if we already crawled that page with a boolean, whereas if you expect pages to change at some point, you might want to switch to a time-based format.

To generate an ID, you can either rely on your database that can create unique indexes for you, like MySQL/MariaDB with the auto incremented IDs or mongoDBs/rethinkDBs _id field or you can pick a module like hat.

{
"fromUrl": "http://jonathanmh.com/",
"id":  "0027bf05-e518-404e-a2fa-edf6235cc677",
"url": "http://jonathanmh.com/tag/gulp",
"visited": false,
"external": false
}

We should test if a link is outgoing/external or pointing to a page on the same domain.

Firstly, to crawl all pages on a domain, we somehow need to test each link in the array we just captured with cheerio if they are either relative or if they target a page on the same domain.

UPDATE: I've received some great comments on reddit to improve the next paragraph! Check out the URI.js module on npm!

Secondly you probably want to divide the links into outgoing links and links to the same domain. I wrote a quick regular expression to figure that out:

var base = 'jonathanmh.com';
var test = new RegExp('^https?:\/\/(www.)?' + base + '.?', 'i');

To interactively test the regex, I used regex101.com, which is an awesome site, that while you type and try out your regex gives you an explanation of what strings it would match.

online-regex-tester-101-interactive-javascript

In case a link is relative, we need to append it to the URL we are currently crawling (technically without the query parameters if there are any, but we'll leave those alone for now.)

To test if a link belongs to the same domain as the current, we should use a regular expression, to write regular expressions in general, I can recommend regex101.com, which interactively shows you if your RegEx matches a string or which portion of it. Also it shows you what the different characters in your RegEx mean, right in the sidebar.

For example, if you want to crawl both http and https links, you might want to write https?, which will result in the explanation:

^ assert position at start of the string http matches the characters http literally (case sensitive) s? matches the character s literally (case sensitive) Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
The base variable we get from the URL that is currently being crawled and extracted like this:

var url = 'http://jonathanmh.com/tag/vim'
var base = url.split('/');
base = base[2];
base = base.split('.');
base = base.join('\.');
console.log(base) // -> 'jonathanmh\.com'

console.log(url.match('/?.' + base + '.?/'));
/* output:
[ '//jonathanmh.com/',
  index: 5,
  input: 'http://jonathanmh.com/tag/vim' ]
*/
var myRegExp = new RegExp('^https?:\/\/(www.)?' + base + '.?', 'i');
console.log(myRegExp.test(url));
/* output:
true
*/

The second split serves the purpose of escaping the . in the base domain.tld. Else it would represent a RegExp specific character and not the literal ., . in the RegExp. On second look, there is also an npm module for this escape-string-regexp, but I just needed this one character for now.

The contents of the page

Wanting to save the actual page, I probably would build my pageObject like the following:

var pageObject = {
  title: // page title
  url: // the url of the currently crawled page
  body: // a whole lot of text
  meta: {
    description: // page meta description
  }
  links: [] // all links from the page, including link titles
}

So this is what we need to save to our database, how to do that, we'll look at in the next couple of posts for different databases, after that, we will just use the functions to save or load data.

To get the text content of a page, you can select the body element, to save the title with a page, you could select the title, like the following:

var bodyText = $('body').text();
var title = $('title').text();
Tagged with: #cheerio #node.js #request #web crawling #web scraping

Thank you for reading! If you have any comments, additions or questions, please tweet or toot them at me!