Web scraping is practically parsing the HTML output of a website and taking the parts you want to use for something. In theory, that’s a big part of how Google works as a search engine. It goes to every web page it can find and stores a copy locally.
For this tutorial, you should have go installed and ready to go, as in, your
$GOPATH set and the required compiler installed.
Continue reading “Web Scraping with Golang and goQuery”
This post is going to be about crawling an entire domain in Node.js. You can find the first posts of the series here: Web Scraping / Web Crawling Pages with Node.js.
For testing purposes I have created a simple set of HTML pages, that should resemble a generic website. It has some page and we want our crawler to go through them and make sure it finds all of them, where they’re linked. That means when our crawler hits a page, it should keep track of the links it finds and then only proceed to pages it has not crawled yet.
Continue reading “Crawling an entire Domain / Website”
Welcome to part 2 of the series crawling the web with Node.js. In this article we’re going to have a look at what valuable content we can grab from a page. Important parts when writing a crawler are obviously links, because our crawler wouldn’t know where to go next without them.
The data I’m going to extract from a page are not necessarily the ones you’ll want and it really all depends what you want with the project. Maybe you only want the content of specific tags or status codes. I’ll just put up some examples and you can see from there what’s possible and see what would make sense for your purpose.
Continue reading “Web Crawling with Node.js #2: Building the Page Object”