Web scraping is practically parsing the HTML output of a website and taking the parts you want to use for something. In theory, that’s a big part of how Google works as a search engine. It goes to every web page it can find and stores a copy locally. For this tutorial, you should have […]
This post is going to be about crawling an entire domain in Node.js. You can find the first posts of the series here: Web Scraping / Web Crawling Pages with Node.js. For testing purposes I have created a simple set of HTML pages, that should resemble a generic website. It has some page and we […]
Welcome to part 2 of the series crawling the web with Node.js. In this article we’re going to have a look at what valuable content we can grab from a page. Important parts when writing a crawler are obviously links, because our crawler wouldn’t know where to go next without them. The data I’m going […]
This post series is going to discuss and illustrate how to write a web crawler in node.js. I’m going to write some posts on a topic that are database agnostic and the database part split up into the respective different databases you could imagine using.