Golang Goroutines and Channels with Custom Types

In the previous post, we had a look at how to wait for goroutines to be finished before moving on. In this post, we'll have a look at using custom types with them. If you want to have a look at the Go Goroutine and channel basics, have a look at Goroutines, Channels and awaiting Asynchronously Operations in Go.

When using Goroutines and channels, chances are you want something to happen that is very specific to your program and not necesserily just a collection of numbers or strings.

I often use web crawling as an example, mostly because it's part of a project I'm working on. I crawl websites and either extract information or build some kind of error check or to ensure some output is as expected.

Using GoQuery to Crawl Multiple Pages Concurrently

The little program below will do the following:

  1. visit a page
  2. get the content of the <title> tag
  3. get the content of the meta description
  4. print all pages information at once

All these actions will be performed for each of the elements in the urls array, the code below is the simple version and does the crawling one page at a time:

package main

import (
    "fmt"
    "github.com/PuerkitoBio/goquery"
)

type PageMeta struct {
    Title           string `json:"title"`
    MetaDescription string `json:"metaDescription"`
    RequestURL      string `json:"requestURL"`
}

func main() {

    var results []PageMeta

    urls := []string{
        "https://jonathanmh.com",
        "https://gegenwind.dk",
        "http://photographerexcuses.com",
    }

    for _, element := range urls {
        result := getMeta(element)
        results = append(results, result)
    }

    fmt.Println(results)
}

func getMeta(URLString string) PageMeta {
    fmt.Printf("STARTING %v\n", URLString)

    doc, _ := goquery.NewDocument(URLString)

    var myMeta = PageMeta{}

    myMeta.RequestURL = doc.Url.String()

    myMeta.Title = doc.Find("title").Contents().Text()

    doc.Find("meta").Each(func(index int, item *goquery.Selection) {
        if item.AttrOr("name", "") == "description" {
            myMeta.MetaDescription = item.AttrOr("content", "")
        }
    })

    fmt.Printf("RETURNING %v\n", URLString)
    return myMeta
}

The output of the above code is:

STARTING https://jonathanmh.com
RETURNING https://jonathanmh.com
STARTING https://gegenwind.dk
RETURNING https://gegenwind.dk
STARTING http://photographerexcuses.com
RETURNING http://photographerexcuses.com
[{Hi, I'm JonathanMH Hi, I'm JonathanMH https://jonathanmh.com} {GegenWind We're GegenWind, a young photography (home) studio in Copenhagen. Say hello to us! https://gegenwind.dk} {Photographer Excuses  https://photographerexcuses.com/}]

Adding Concurrent Crawling to goQuery

Network requests are much slower and there is a lot of waiting time. Any even not very powerful computer, can do many network connections at the same time. The same goes for parsing HTML and getting the content of the <title> field.

Let's make sure that we start loading each of the pages before any of them need to be finished:

package main

import (
    "fmt"
    "sync"

    "github.com/PuerkitoBio/goquery"
)

type PageMeta struct {
    Title           string `json:"title"`
    MetaDescription string `json:"metaDescription"`
    RequestURL      string `json:"requestURL"`
}

func main() {

    var results []PageMeta

    urls := []string{
        "https://jonathanmh.com",
        "https://gegenwind.dk",
        "http://photographerexcuses.com",
    }

    c := make(chan PageMeta)
    var wg sync.WaitGroup

    for _, element := range urls {
        wg.Add(1)
        go getMeta(element, c, &amp;wg)
    }

    for range urls {
        results = append(results, <-c)
    }

    wg.Wait()

    fmt.Println(results)
}

func getMeta(URLString string, c chan PageMeta, wg *sync.WaitGroup) {
    fmt.Printf("STARTING %v\n", URLString)
    defer wg.Done()

    doc, _ := goquery.NewDocument(URLString)

    var myMeta = PageMeta{}

    myMeta.RequestURL = doc.Url.String()

    myMeta.Title = doc.Find("title").Contents().Text()

    doc.Find("meta").Each(func(index int, item *goquery.Selection) {
        if item.AttrOr("name", "") == "description" {
            myMeta.MetaDescription = item.AttrOr("content", "")
        }
    })

    fmt.Printf("RETURNING %v\n", URLString)
    c <- myMeta
    return
}

Note that we change a couple of things to ensure that the code runs in our new and preferred order (all goroutines at once):

  1. we define a channel c := make(chan PageMeta) that will transport data of the type PageMeta
  2. we create a WaitGroup var wg sync.WaitGroup
  3. each time the first loop runs we add to the WaitGroup wg.Add(1)
  4. we pass channel and WaitGroup to the function that does the network request: go getMeta(element, c, &wg)
  5. we remove the return from the getMeta function and instead send it to the channel c <- myMeta

The output should now be as the following:

STARTING http://photographerexcuses.com
STARTING https://jonathanmh.com
STARTING https://gegenwind.dk
RETURNING http://photographerexcuses.com
RETURNING https://gegenwind.dk
RETURNING https://jonathanmh.com
[{Photographer Excuses  https://photographerexcuses.com/} {GegenWind We're GegenWind, a young photography (home) studio in Copenhagen. Say hello to us! https://gegenwind.dk} {Hi, I'm JonathanMH Hi, I'm JonathanMH https://jonathanmh.com}]

Summary

If you're doing some things in series, that are not dependent on each other and that could be done in parallel, you probably should in order to speed up your program.

If you're looking for a more batteries included alternative to goQuery for web scraping, you should give colly a shot, which comes with concurrency easily implementable through a config option.

Tagged with: #go #golang

Thank you for reading! If you have any comments, additions or questions, please tweet or toot them at me!