Web Scraping with Golang and goQuery

Web scraping is practically parsing the HTML output of a website and taking the parts you want to use for something. In theory, that's a big part of how Google works as a search engine. It goes to every web page it can find and stores a copy locally.

For this tutorial, you should have go installed and ready to go, as in, your $GOPATH set and the required compiler installed.

Parsing a page with goQuery

goQuery is pretty much like jQuery, just for go. It gives you easy access to the HTML structure of a page and enables you to pick which elements you want to access by attribute or content.

If you compare the functions, they are very close to jQuery with the .Text() for text content of an element and .Attr() or .AttrOr() for attribute names values.

In order to get started with goQuery, just run the following in your terminal:

go get github.com/PuerkitoBio/goquery

Now let's create our test project, I did that by the following:

# confirm my $GOPATH is set
echo $GOPATH
/home/jonathan/projects/go
# switch to the `src` path
cd $GOPATH/src
# create directory
mkdir tutorial-web-scraping
# switch to directory
cd tutorial-web-scraping

Now we can create the example files for the programs listed below. Usually you shouldn't have multiple main() functions inside one directory, but we'll make an exception, because we're beginners, right?

List all Posts on Blog Page

The following program will list all articles on my blogs front page, composed of their title and a link to the post.

Since we're using .Each() we also get a numeric index, which starts at 0 and goes as far as we have elements of the selector #main article .entry-title on the page.

// file: list_posts.go
package main

import (
    // import standard libraries
    "fmt"
    "log"

    // import third party libraries
    "github.com/PuerkitoBio/goquery"
)

func postScrape() {
    doc, err := goquery.NewDocument("http://jonathanmh.com")
    if err != nil {
        log.Fatal(err)
    }

    // use CSS selector found with the browser inspector
    // for each, use index and item
    doc.Find("#main article .entry-title").Each(func(index int, item *goquery.Selection) {
        title := item.Text()
        linkTag := item.Find("a")
        link, _ := linkTag.Attr("href")
        fmt.Printf("Post #%d: %s - %s\n", index, title, link)
    })
}

func main() {
    postScrape()
}

If you come from a language where functions can't have multiple returns, look at this for a second: link, _ := linkTag.Attr("href"), if we would define a name instead of _ and call it something like present, we could test if an attribute is actually set.

The output of the above program should be something like the following:

$ go run list_posts.go
Post #0: How to use SSH keys for Authentication (for beginners) - http://jonathanmh.com/how-to-use-ssh-keys-for-authentication-for-beginners/
Post #1: Using Sourcegraph Checkup with local file system storage - http://jonathanmh.com/using-sourcegraph-checkup-local-file-system-storage/
Post #2: Copenhagen Pride 2016 Photos - http://jonathanmh.com/copenhagen-pride-2016-photos/
Post #3: Searching the Google Books API with PHP [Quickstart] - http://jonathanmh.com/searching-google-books-api-php-quickstart/
Post #4: How to get a high score on Pagespeed Insights (and make your site fast) - http://jonathanmh.com/get-high-score-pagespeed-insights-make-site-fast/
Post #5: NGINX / Apache: Block Requests to PHP file (xmlrpc.php) - http://jonathanmh.com/nginx-apache-block-requests-php-file-xmlrpc-php/
Post #6: Distortion Copenhagen 2016 – Nørrebro / Wednesday - http://jonathanmh.com/distortion-copenhagen-2016-norrebro-wednesday/
Post #7: I need feminism because: Metal T-shirts - http://jonathanmh.com/need-feminism-metal-t-shirts/
Post #8: On Being Powerless - http://jonathanmh.com/on-being-powerless/
Post #9: How to get a Job in Tech - http://jonathanmh.com/get-job-tech/

Scraping all links on a page doesn't look much different to be honest, we just use a more general selector, body a and go through the logging for each of the links. I'm getting the content of the respective <a> tag by using linkText := linkTag.Text().

// file name: get_all_links.go
package main

import (
    // import standard libraries
    "fmt"
    "log"

    // import third party libraries
    "github.com/PuerkitoBio/goquery"
)

func linkScrape() {
    doc, err := goquery.NewDocument("http://jonathanmh.com")
    if err != nil {
        log.Fatal(err)
    }

    // use CSS selector found with the browser inspector
    // for each, use index and item
    doc.Find("body a").Each(func(index int, item *goquery.Selection) {
        linkTag := item
        link, _ := linkTag.Attr("href")
        linkText := linkTag.Text()
        fmt.Printf("Link #%d: '%s' - '%s'\n", index, linkText, link)
    })
}

func main() {
    linkScrape()
}

The output of the above code should be something like:

$ go run get_all_links.go
Link #0: 'Skip to content' - '#content'
Link #1: 'JonathanMH' - 'http://jonathanmh.com/'
Link #2: 'twitter' - 'https://twitter.com/JonathanMH_com'
Link #3: 'rss feed' - 'http://jonathanmh.com/feed/'
... (many more)
Link #172: 'Proudly powered by WordPress' - 'https://wordpress.org/'

Now we know how to get all links from a page, including their link text! That would probably be pretty useful to a bunch of SEO or analytics people, because it displays the context of how another website is linked, so what kind of keywords it should be associated with.

Get Title and Meta Data with Golang scraping

Lastly we should cover what we typically don't select with jQuery a lot, the page title and the meta description:

// file name: metadata.go
package main

import (
    // import standard libraries
    "fmt"
    "log"

    // import third party libraries
    "github.com/PuerkitoBio/goquery"
)

func metaScrape() {
    doc, err := goquery.NewDocument("http://jonathanmh.com")
    if err != nil {
        log.Fatal(err)
    }

    var metaDescription string
    var pageTitle string

    // use CSS selector found with the browser inspector
    // for each, use index and item
    pageTitle = doc.Find("title").Contents().Text()

    doc.Find("meta").Each(func(index int, item *goquery.Selection) {
        if( item.AttrOr("name","") == "description") {
            metaDescription = item.AttrOr("content", "")
        }
    })
    fmt.Printf("Page Title: '%s'\n", pageTitle)
    fmt.Printf("Meta Description: '%s'\n", metaDescription)
}

func main() {
    metaScrape()
}

This should yield:

Page Title: 'JonathanMH - Just a guy, usually in Denmark, blogging about things he couldn't get to work right away and then made a blog post in case others get stuck too.'
Meta Description: 'JonathanMH Coder, Blogger, Videographer, Webguy, Just a guy, usually in Denmark, blogging about things he couldn't get to work right away and then made a blog post in case others get stuck too.'

Now what's a little bit different in the above example is that we're using AtrrOr( value, fallback_value) in order to be sure we have data at all. This is kind of a short hand instead of writing a check if an attribute is found or not.

For the title we can just plainly select the Contents of a *Selection, because it's typically the only tag of its kind on a website: pageTitle = doc.Find("title").Contents().Text().

Summary

Go is still pretty new to me, but it's getting more and more familiar. Some things the compiler worries about make me re-think how I think of code in general, which is a great thing. In terms of libraries, goQuery is very awesome and I want to thank the author a lot for providing such a powerful parsing library, that is so incredibly easy to use.

Do you do web scraping / crawling? What do you use it for? Did you like the post or do you have some suggestions? Let me know in the comments!

Tagged with: #go #golang #goquery #web crawling #web scraping

Thank you for reading! If you have any comments, additions or questions, please tweet or toot them at me!