This post is going to be about crawling an entire domain in Node.js. You can find the first posts of the series here: Web Scraping / Web Crawling Pages with Node.js.
For testing purposes I have created a simple set of HTML pages, that should resemble a generic website. It has some page and we want our crawler to go through them and make sure it finds all of them, where they’re linked. That means when our crawler hits a page, it should keep track of the links it finds and then only proceed to pages it has not crawled yet.
Continue reading “Crawling an entire Domain / Website”
Welcome to part 2 of the series crawling the web with Node.js. In this article we’re going to have a look at what valuable content we can grab from a page. Important parts when writing a crawler are obviously links, because our crawler wouldn’t know where to go next without them.
The data I’m going to extract from a page are not necessarily the ones you’ll want and it really all depends what you want with the project. Maybe you only want the content of specific tags or status codes. I’ll just put up some examples and you can see from there what’s possible and see what would make sense for your purpose.
Continue reading “Web Crawling with Node.js #2: Building the Page Object”
This post series is going to discuss and illustrate how to write a web crawler in node.js. I’m going to write some posts on a topic that are database agnostic and the database part split up into the respective different databases you could imagine using.
Continue reading “Web Scraping / Web Crawling Pages with Node.js”