For awhile now I’ve been playing around with a stack based language written in Go. The specifics of that project aren’t too important here, but I did notice something interesting today as I tried GCC for the first time with the project. First I wanted to make sure it worked so I ran go-5 test and happily all tests passed. Next, I thought I’d see how the benchmark looked. Here I think I was most surprised. Every single benchmark I had written was between 1.5x & 2x slower with GCC compared to GC. The versions I used were

gcc version 5.3.1 20160101 (Debian 5.3.1-5) go version go1.4.2 gccgo (Debian 5.3.1-5) 5.3.1 20160101 linux/amd64

go version go1.5.2 linux/amd64

go version go1.4.3 linux/amd64

Go GC 1.4 & 1.5 ended up having very similar performance to one another in my tests. The most interesting tests were two that iterated over a linked list. Below is a recreation of the benchmark that was run. There are really two things that can be considered tested here. Interface method invocation & direct pointer method invocation.

package main

import (
	"fmt"
	"testing"
)

type Lister interface {
	Next() Lister
	Name() string
}

type List struct {
	link Lister
	name string
}

func (l List) Name() string {
	return l.name
}

func (l List) Next() Lister {
	return l.link
}

func BenchmarkLinkedList(b *testing.B) {
	var l Lister
	for i := 1; i < 1000; i++ {
		l = List{link: l, name: fmt.Sprintf("test%d", i)}
	}

	first := "test0"
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		for ll := l; ll != nil; ll = ll.Next() {
			if ll.Name() == first {
				break
			}
		}
	}
}

type List2 struct {
	link *List2
	name string
}

// Test without the interface function calls
func BenchmarkPureLinkedList(b *testing.B) {
	var l *List2
	for i := 1; i < 1000; i++ {
		l = &List2{link: l, name: fmt.Sprintf("test%d", i)}
	}

	first := "test0"
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		for ll := l; ll != nil; ll = ll.link {
			if ll.name == first {
				break
			}
		}
	}
}
$ go version
go version go1.4.3 linux/amd64
$ go test -bench .
testing: warning: no tests to run
PASS
BenchmarkLinkedList       100000             14888 ns/op
BenchmarkPureLinkedList  1000000              2166 ns/op
$ go version
go version go1.5.2 linux/amd64
$ go test -bench .
testing: warning: no tests to run
PASS
BenchmarkLinkedList-2             100000             12744 ns/op
BenchmarkPureLinkedList-2        1000000              2179 ns/op
$ go version
go version go1.4.2 gccgo (Debian 5.3.1-5) 5.3.1 20160101 linux/amd64
$ go test -bench .
# _/home/harley/tmp/tests_go/test_linked_list
ar: `u' modifier ignored since `D' is the default (see `U')
# testmain
ar: `u' modifier ignored since `D' is the default (see `U')
testing: warning: no tests to run
PASS
BenchmarkLinkedList        50000             25005 ns/op
BenchmarkPureLinkedList   200000             10096 ns/op

As you can see, Go GC 1.4 & 1.5 aren’t too far off one another (with 1.5 being a bit better in the BenchmarkLinkedList test.) However, compare that to GCC. It’s roughly twice as slow in the first test and nearly 5x slower in the second, when compared to Go GC 1.5.

Here’s the output of pprof for GC 1.5.2

(pprof) top10
21.39s of 21.46s total (99.67%)
Dropped 3 nodes (cum < = 0.11s)
      flat  flat%   sum%        cum   cum%
     6.52s 30.38% 30.38%     16.40s 76.42%  main.LinkedList
     5.01s 23.35% 53.73%      5.01s 23.35%  main.(*List).Name
     5.01s 23.35% 77.07%      5.05s 23.53%  main.PureLinkedList
     4.85s 22.60% 99.67%      4.85s 22.60%  main.(*List).Next
         0     0% 99.67%     21.45s   100%  main.main
         0     0% 99.67%     21.45s   100%  runtime.goexit
         0     0% 99.67%     21.45s   100%  runtime.main

And the output of pprof for GCC 5.3.1

(pprof) top10
25.83s of 25.83s total (  100%)
      flat  flat%   sum%        cum   cum%
    25.83s   100%   100%     25.83s   100%  runtime_sigprof
         0     0%   100%     25.83s   100%
         0     0%   100%         8s 30.97%  __go_strcmp
         0     0%   100%     18.16s 70.31%  main.LinkedList
         0     0%   100%      4.97s 19.24%  main.Name.N9_main.List
         0     0%   100%      4.63s 17.92%  main.Next.N9_main.List
         0     0%   100%      7.67s 29.69%  main.PureLinkedList
         0     0%   100%     25.83s   100%  main.main
         0     0%   100%     25.83s   100%  runtime_main
         0     0%   100%     25.83s   100%  sig_tramp_info

To me, the interesting thing is that, in both implementations, each of List.Name & List.Next take almost as much time (during 30 seconds) as the PureLinkedList function takes (which is the benchmark function using direct struct & pointer usage.)

That code was run with go run main.go and then go tool pprof http://localhost:6060/debug/pprof/profile while main.go was running. Then I inspected the collected profile with top10.

This code is very similar to the benchmark code above, but uses net/http/pprof

package main

import (
	"flag"
	"fmt"
	"log"
	"net/http"
	"os"
	"time"
)

import _ "net/http/pprof"

type Lister interface {
	Next() Lister
	Name() string
}

type List struct {
	link Lister
	name string
}

func (l List) Name() string {
	return l.name
}

func (l List) Next() Lister {
	return l.link
}

func LinkedList(l Lister) {

	first := "test0"

	for i := 0; i &lt; 10000; i++ {
		for ll := l; ll != nil; ll = ll.Next() {
			if ll.Name() == first {
				break
			}
		}
	}
}

type List2 struct {
	link *List2
	name string
}

// Test without the interface function calls
func PureLinkedList(l *List2) {

	first := "test0"

	for i := 0; i &lt; 10000; i++ {
		for ll := l; ll != nil; ll = ll.link {
			if ll.name == first {
				break
			}
		}
	}
}

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")

func main() {
	var l Lister
	for i := 1; i &lt; 10000; i++ {
		l = List{link: l, name: fmt.Sprintf("test%d", i)}
	}

	var ll *List2
	for i := 1; i &lt; 10000; i++ {
		ll = &List2{link: ll, name: fmt.Sprintf("test%d", i)}
	}

	go func() {
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()

	for {
		LinkedList(l)
		time.Sleep(time.Millisecond * 300)
		PureLinkedList(ll)
		time.Sleep(time.Millisecond * 300)
	}

	os.Exit(1)
}

I suppose the take away here is direct struct access is going to be faster than interface method calls. Not that this is news or even hard to believe, but is a reminder that tight loops that need speed will do better with direct struct access over interface method calls. Though, and hopefully this is obvious, profiling your application should happen before optimizing code.