Benchmarking Results in Go

November 15, 2018

Benchmarking is a great way to understand how your code is performing, and the go benchmark tools can show both execution times and memory allocation data. This can be very handy for performance tuning because youg have empircal evidence that a change you made is making the difference you expect.

How does it work?

I’m not going to go into a huge amount of detail here, but I do want to set up some very high-level basics. For more details, check out Dave Cheney’s blog.

Let’s take a look at the following example. Benchmarks are added in the same way as tests. Like tests, benchmarks use function naming convention (Benchmark...), and the file containing the functions should be named with _test.go suffix. In this case, I’ve named the file benchmark_test.go.

package bechmarkperf

import (
	"testing"
)

func BenchmarkCountParallel(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			countFunc()
		}
	})
}

func BenchmarkCountSerial(b *testing.B) {
	for i := 0; i < b.N; i++ {
		countFunc()
	}
}

func countFunc() {
	for i := 0; i < 1e6; i++ {
	}
}

Tests can be run with the following command: go test -cpu 1,2,4,8 -benchmem -run=^$ -bench . benchmark_test.go

The BenchmarkCountSerial function runs in serial - that is one process in an execution loop. It doesn’t start the count process until the previous iteration has finished, and everything runs in the main goroutine. The BenchmarkCountParallel function runs in parallel. The main goroutine fires off multiple goroutines to execute the code inside b.RunParallel. It then waits for all of the goroutines to complete, then continues. The -cpu flag sets the number of threads available to the go runtime.

Interpreting Results

Running the above command gives us the following results:

goos: linux
goarch: amd64
BenchmarkCountParallel     	    5000	    368018 ns/op	       0 B/op	       0 allocs/op
BenchmarkCountParallel-2   	   10000	    163742 ns/op	       0 B/op	       0 allocs/op
BenchmarkCountParallel-4   	   20000	     95246 ns/op	       0 B/op	       0 allocs/op
BenchmarkCountParallel-8   	   20000	     98381 ns/op	       0 B/op	       0 allocs/op
BenchmarkCountSerial       	    5000	    337964 ns/op	       0 B/op	       0 allocs/op
BenchmarkCountSerial-2     	    3000	    386644 ns/op	       0 B/op	       0 allocs/op
BenchmarkCountSerial-4     	    5000	    317093 ns/op	       0 B/op	       0 allocs/op
BenchmarkCountSerial-8     	    5000	    355617 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	command-line-arguments	16.685s

We can see here the name of the test followed by the number of allocated procs (via -cpu flag) when that is greater than 1. If it’s set to 1, it will have no suffix; otherwise, you will see -<num procs>. Next is the number of iterations run. This is a fairly light process, so it completes quickly, and many iterations can be run. This will adjust dynamically based on the test duration specified at execution time. Then we see the ns/op column - this is the average time each function call takes to complete. After that are the two memory columns (present because of the -benchmem flag). This shows us that in all tests, zero bytes of memory were allocated at an average rate of 0 allocations / operation.

Looking at that result set, everything makes sense except for timing. Is the function actually getting faster when we add more goroutines? That function isn’t parallelized at all, so all other things held the same, function-level performance should be at best the same as running with fewer threads. Context switching and CPU exhaustion can lead to significantly slower per-thread performance as well.

To amplify the point, let’s tweak the example used above a bit:

package bechmarkperf

import (
	"strconv"
	"testing"
	"time"
)

func BenchmarkSleepSerial(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sleepFunc()
	}
}

func BenchmarkSleepParallel(b *testing.B) {
	for i := 1; i <= 8; i *= 2 {
		b.Run(strconv.Itoa(i), func(b *testing.B) {
			b.SetParallelism(i)
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					sleepFunc()
				}
			})
		})
	}
}

func sleepFunc() {
	time.Sleep(1 * time.Millisecond)
}

The test naming structure was set up following the pattern BenchmarkSleepParallel/<parallelism>-<cpu>. Note, if CPU is set to 1, it won’t be appended.

Running this code–assuming the sleep implentation is working properly–will never take less than 1ms.

Now let’s take a look at the benchmark results by running go test -cpu 1,2,4,8,16 -benchmem -run=^$ -bench . benchmark_test.go:

goos: linux
goarch: amd64
BenchmarkSleepSerial         	    1000	   1449941 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-2       	    1000	   1550830 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-4       	    1000	   1468569 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-8       	    1000	   1507217 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-16      	    1000	   1509757 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1     	    1000	   1498519 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-2   	    2000	    743778 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-4   	    5000	    377843 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-8   	   10000	    198106 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-16  	   20000	     92807 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2     	    2000	    735995 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-2   	    5000	    378456 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-4   	   10000	    186719 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-8   	   20000	     92417 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-16  	   30000	     45607 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4     	    5000	    369069 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-2   	   10000	    190875 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-4   	   20000	     92051 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-8   	   30000	     46979 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-16  	  100000	     22973 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8     	   10000	    186493 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-2   	   20000	     99457 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-4   	   30000	     47851 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-8   	   50000	     25342 ns/op	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-16  	  100000	     12024 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	command-line-arguments	48.824s

So, on average, each function call is taking well less than the minimum time the function call could take (1ms). Ruling out code elimination during compilation (which is not happening inside the test), let’s take a look at what’s going on when the benchmark runs and reports its results.

When a benchmark is run in parallel, it first determines the number of goroutines to run. This is determined by multiplying GOMAXPROCS (set by --cpu flag) by the value of b.parallelism, which can be set by calling b.SetParallelism(int). So if we’re running on a 4-core machine with defaults, we’ll get 4 for GOMAXPROCS multiplied by 1 for parallelism, creating 4 goroutines. This means that assuming we schedule against 4 CPU cores, we can saturate the CPU without contention (running go processes don’t exceed available hardware cores).

As we start to increase parallelism (which in turn increases our multiplier for the goroutine count), we can see the performance get even better. In fact, even with 128 goroutines running on 4 cores, we can see that doubling the number of goroutines about halves the ns/op metric coming from the benchmark. However, as pointed out above, we’re still sleeping for 1ms, so the performance of the function isn’t actually getting faster.

So what’s actually going on!?

Let’s take a look at the benchmark results generation code:

func (r BenchmarkResult) String() string {
	mbs := r.mbPerSec()
	mb := ""
	if mbs != 0 {
		mb = fmt.Sprintf("\t%7.2f MB/s", mbs)
	}
	nsop := r.NsPerOp()
	ns := fmt.Sprintf("%10d ns/op", nsop)
	if r.N > 0 && nsop < 100 {
		// The format specifiers here make sure that
		// the ones digits line up for all three possible formats.
		if nsop < 10 {
			ns = fmt.Sprintf("%13.2f ns/op", float64(r.T.Nanoseconds())/float64(r.N))
		} else {
			ns = fmt.Sprintf("%12.1f ns/op", float64(r.T.Nanoseconds())/float64(r.N))
		}
	}
	return fmt.Sprintf("%8d\t%s%s", r.N, ns, mb)
}

What we can see here is that the duration of the test run (r.T.Nanoseconds()) is divided by the iterations run in the test (r.N). This is an incorrect calculation–it’s going the wrong direction. What should be happening here is that the number of iterations run should be calculated over a fixed unit of time instead, giving us a result with the unit ops/s based on the data we have.

In the existing scenario, dividing a time duration by completed operations, we get a rate, not a time period per operation. Switching the numbers around gives a number that’s actually relatively useless. While seeing decreases will show that increasing concurrency increases performance, the number itself can only be used as a relative comparison, because as shown above, a sleep of 1ms is never going to complete in less than that.

By instead providing an operations per second count (anchoring throughput to a fixed time duration), we can have a number that’s both useful and accurate.

I’m personally also concerned with timing of the operation. Because many of the functions that I’m working on are called inline with client requests, I want to know how much elaspsed time they require because this will directly impact their perception of performance. This can already be accomplished by running tests in serial, but as discussed above, it’s a good idea to run tests in parallel, especially if they’re already going to run concurrently.

So how do we figure that out?

We need to figure out thread time, which can do fairly accurately by multiplying the duration by the number of running goroutines by the elapsed time. For even more accuracy, we could track durations within each running goroutine (based on start and completion time of each in case some run noticably longer than others). The latter shouldn’t skew numbers too heavily assuming the duration of the test is reasonably long.

So, I modified the go test tool to do just that, and here’s the results that I got:

goos: darwin
goarch: amd64
BenchmarkSleepSerial         	    1000	   1337675 ns/op	       747.57 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-2       	    1000	   1387810 ns/op	       720.56 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-4       	    1000	   1385465 ns/op	       721.78 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-8       	    1000	   1388933 ns/op	       719.98 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepSerial-16      	    1000	   1401729 ns/op	       713.40 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1     	    1000	   1370312 ns/op	       729.76 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-2   	    2000	   1410639 ns/op	      1417.79 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-4   	    3000	   1406812 ns/op	      2843.24 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-8   	   10000	   1407085 ns/op	      5685.26 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/1-16  	   20000	   1378449 ns/op	     11606.96 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2     	    2000	   1395714 ns/op	      1432.94 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-2   	    3000	   1410110 ns/op	      2836.61 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-4   	   10000	   1269820 ns/op	      6299.93 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-8   	   20000	   1289376 ns/op	     12408.16 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/2-16  	   30000	   1288648 ns/op	     24820.27 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4     	    3000	   1359702 ns/op	      2941.75 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-2   	   10000	   1403628 ns/op	      5699.39 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-4   	   20000	   1404574 ns/op	     11390.93 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-8   	   30000	   1404647 ns/op	     22765.47 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/4-16  	   50000	   1410323 ns/op	     45219.82 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8     	   10000	   1312485 ns/op	      6095.18 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-2   	   20000	   1295381 ns/op	     12351.18 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-4   	   30000	   1302094 ns/op	     24562.79 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-8   	   50000	   1326515 ns/op	     47992.22 ops/s	       0 B/op	       0 allocs/op
BenchmarkSleepParallel/8-16  	  100000	   1303139 ns/op	     97150.85 ops/s	       0 B/op	       0 allocs/op
PASS
ok  	command-line-arguments	64.395s

Now, rather than having the ns/op dividing by the total elapsed time, the ns/op is based off of the processing time running the benchmark. It also adds the operations per second metric, which shows the throughput doubling when we double the goroutine count as well.

Note that while parallelization increases well beyond available cores, we stil see little overhead in performance. This is because time.Sleep isn’t a cpu blocking instruction, which means that while goroutines sleep, others can be run, allowing us to hit higher CPU utilization and have huge amounts of function call throughput. In reality, you’ll probably be benchmarking things that are doing actual work, so you won’t be able to oversubscribe the CPU this hard, but asynchronous I/O calls for example can provide similar behavior to a lesser extreme (assuming I/O isn’t the bottleneck).

Mike Newswanger

Sr Software Engineer at Elastic; previously SRE at Stack Overflow — go, performance, kubernetes, containers, databases

« Read more posts