This is a creation in Article, where the information may have evolved or changed.
Recently, in a colleague's question: when iterating over a slice, it would be better to move the loop condition out of the loop for
len
than the Golang compiler's optimization results.
That
func G0(a []int) int { L := Len(a) for I := 0; I < L; I++ { } return 1}
Whether it will be better than
func G1(a []int) int { for I := 0; I < Len(a); I++ { } return 1}
Results are more optimized (currently the Golang compiler does not eliminate this empty loop).
That to prove the problem, then the benchmark prove ah.
Import "Testing"var a = Make([]int, 1<< -)func BenchmarkG0(b *Testing.B) { for I := 0; I < b.N; I++ { G0(a) }}func BenchmarkG1(b *Testing.B) { for I := 0; I < b.N; I++ { G1(a) }}
And then execute
test-c.-test.bench=.-test.count=2
Get the output result:
goos: darwingoarch: amd64BenchmarkG0-4 100 11784627 ns/opBenchmarkG0-4 100 11841061 ns/opBenchmarkG1-4 100 18623122 ns/opBenchmarkG1-4 100 17790754 ns/opPASS
Sure g0
g1
, much faster than the speed, but this is a bit counter-common sense ah, can not be so easy to conclude. So let's see g0
if the compilation results are much better than the g1
optimizations:
Let's do it.
> main.s
The results we have come to see:
TEXT _/test/go/len.g0 (SB)/test/go/len/main.go main.go:4 0x10ef150 488b442410 movq 0x10 (SP), AX main.go:4 0x10ef155 31c9 xorl CX, CX Main.go:6 0x 10ef157 eb03 JMP 0x10ef15c main.go:6 0x10ef159 48ffc1 INCQ CX main.go:6 0x10ef15c 4839c1 cmpq AX, CX Main.go:6 0x10ef15f 7cf8 JL 0x10ef159 main.go:8 0x10ef161 48c7442420010 00000 movq $0x1, 0x20 (SP) Main.go:8 0x10ef16a C3 RET:-1 0x10ef16b cc INT $0x3: -1 0x10ef16c cc int $0x3: -1 0x10ef16d cc int $0X3:1 0x10ef16e CC int $0X3: -1 0x10ef16f cc int $0x3text _/ TEST/GO/LEN.G1 (SB)/test/go/len/main.go main.go:12 0x10ef170 488b442410 movq 0x10 (S P), AX main.go:12 0x10ef175 31c9 xorl CX, CX main.go:13 0x10ef177 EB03 JMP 0x10ef17c main.go:13 0x10ef179 48ffc1 INCQ CX main.go:13 0x10ef17c 4839c1 cmpq AX, CX main.go:13 0X10EF 17f 7cf8 JL 0x10ef179 main.go:15 0x10ef181 48c744242001000000 Movq $0x1, 0x20 (SP) Main.go:15 0x10ef18a C3 RET:-1 0x10ef18b cc INT $0x3: -1 0x10ef18c cc INT $0x3:-1 0x10ef18d cc INT $0x3: -1 0x10ef18e cc int $0x3: -1 0x10ef18f cc int $0x3
We can see that compiler-generated intermediate code is exactly the same, so why do you have different results when you run it? So we're going to have to think about what's going to affect code execution in addition to the code? That is:
- Operating Environment
- Runtime
Then we will verify the two factors separately.
The first is the operating environment, we switch to the linux
previous verification:
GOOS=GOARCH=test-c.## copy to linux-test.bench=.-test.count=2
Get the output result:
goos: linuxgoarch: amd64BenchmarkG0-32 100 10824437 ns/opBenchmarkG0-32 100 10743979 ns/opBenchmarkG1-32 100 10740347 ns/opBenchmarkG1-32 100 10898047 ns/opPASS
On linux
g0
/on g1
the performance is the same. So what are we going to think about linux
and darwin
What's the difference? This can be more, there is no way to one by one contrast. But these differences will be largely reflected in the runtime
above.
Then we'll runtime
compare them. runtime
what in that will affect the operation of the program? may have: (not listed full)
- Extension of function stack space
- Dispatch of Goroutine
- Io/syscall/cgo
- Gc
From the results of the above objdump, we can see that the generated code should have nothing to do with the first 3 factors. Then we'll try to close the GC and make a comparison:
Import ( "Runtime/debug" "Testing")func Init() { Debug.setgcpercent(-1)}var a = Make([]int, 1<< -)func BenchmarkG0(b *Testing.B) { for I := 0; I < b.N; I++ { G0(a) }}func BenchmarkG1(b *Testing.B) { for I := 0; I < b.N; I++ { G1(a) }}
And then
go test -c ../len.test -test.bench=. -test.count=2
Get results:
goos: darwingoarch: amd64BenchmarkG0-4 100 11521770 ns/opBenchmarkG0-4 100 11310217 ns/opBenchmarkG1-4 100 11562763 ns/opBenchmarkG1-4 100 11590019 ns/opPASS
You can see that gc
after closing, g0
/performing the g1
same, is it because g1
the memory is dynamically allocated during the run? The result of the above objdump, obviously not. That's only possible when the GC is running, why does the GC Pianqiao g1
start at runtime? This is more subtle, runtime
the performance is related to many factors of the system, the same code on different operating systems also have subtle differences, perhaps some kind of pseudo-random factors in runtime
? Not yet.
Once again, the effect of this GC has nothing to do with the position of the benchmark function.