This is a creation in Article, where the information may have evolved or changed.
Keith Randall (GitHub) is a principal software engineer for Google and works on the Go compiler. Last year he gave a talk on high-frequency trading with Go. Previously, he is a scientist at Compaq's System Center (SRC) and a student of the MIT supercomputing T Echnologies Group.
Today, he's talking about generating better machine code with a single Static assignment (SSA). SSA is a technique used by the most modern compilers to optimize generated machine code.
Go 1.5
The Go compiler was originally based on the Plan9 C compiler, which are old. This is modified to compile Go instead of C. Later it is autotranslated from C to Go.
In the era of Go 1.5, Keith began looking through go-generated assembly code with the aim of making things faster. He noticed a number of instances where he thought the generated assembly was more verbose than it needed to be.
Consider the following assembly code generated from Go 1.5:
MOVQAX, BX SHLQ$0x3, BX MOVQBX, 0x10(SP) CALLruntime.memmove(SB)
Why was that first movq there? Why not just:
SHLQ$0x3, AX MOVQAX, 0x10(SP) CALLruntime.memmove(SB)
Another example:why do an expensive multiply operation:
IMULQ$0x10, R8, R8
Instead of a shift operation, which is cheap:
SHLQ$0x4, R8
Yet another example:writing value to register only to move the value straight to another register:
MOVQR8, 0x20(CX)MOVQ0x20(CX), R9
Why not just:
MOVQ R8, 0x20(CX)MOVQ R8, R9
After finding all these examples of inefficiencies, Keith felt bold enough to proclaim, "I think it would being fairly easy t o make the generated programs 20% smaller and 10% faster ". He admits those numbers were largely made up.
This is in in February 2015. Keith wanted to move the Go compiler from a syntax-tree-based intermediate representation (IR) to a more modern ssa-based IR. With an SSAS IR, he believed they could implement a lot of optimizations that is difficult to does in the current compiler.
In, Feb, the SSAS proposal mailed to Golang-dev. Work subsequently began, and in Go 1.7 and Go 1.8, the work was Shipp Ed for compiling to AMD64 and arm respectively. Here is the performance improvements:
Go 1.7:AMD64
Go 1.8:arm
There was better performance isn't only on the synthetic Go benchmarks (above), but also in the real world. Some benchmarks from the community:
- Big Data workload-15% Improvement
- Convex hull-14-24% Improvement (from 1.5)
- Hash functions-39% Improvement
- Audio processing (ARM)-48% improvement
Does The compiler itself get slower or faster with SSA?
So obviously, we ' d expect a speedup in programs that were compiled via the SSA IR. But generating the SSAS IR is also more computationally expensive. The one program where both these things would affect the speed of the program is the compiler itself. Compiler speed is very important. So with SSA IR, does the compiler get faster or slower?
He asks the audience, "How many people think it gets faster?" How many people think it gets slower? " A Few more people think it gets faster.
Turns out, the arm of compiler is 10% faster. The compiler has a more work to does to output SSA IR, but the compiler are now compiled with the new compiler and so itself are More optimized. For ARM, the speedup from the compiler binary being generated from SSA IR are larger than the slowdown from the additional Computation that needs to is done to output the SSA IR.
The AMD64 compiler, on the other hand, is 10% slower. The extra work required by the SSA passes isn ' t fully eliminated by the speedup we get from the compiling the compiler usi Ng SSA.
Why is SSA?
A compiler translates a plaintext source file into an object file that contains assembly instructions:
Internally, the compiler have multiple components that translate the source into successive intermediate representations be Fore finally outputting assembly:
All phases of the "Go 1.5 compiler dealt in syntax trees as their internal representation, with the exception of the very Las T step, which emits assembly:
For this code snippet,
func f(a []int) { for i := 0; i < 10; i++ { a[i] = 0; }}
Here's what's the syntax tree looks like:
Here is the phases of the Go 1.5 compiler, all of which deal in syntax trees:
- Type checking
- Closure analysis
- inlining
- Escape analysis
- Adding temporaries where needed
- Introducing runtime Calls
- Code generation
In the Go 1.7 compiler, SSAS replaces the old code generation phase of the compiler with successive SSA passes:
So, what does "SSA" actually mean? SSA stands for ' Single Static assignment ' and it means each variable in the program is only have one assignment in the text of The program. Dynamically, you can have multiple assignments (e.g., an increment variable in a loop), but statically, there are only one Assignment. Here's a simple conversion from original source to SSA form:
Sometimes, it's not as simple as the example above. Consider the case of a assignment within a conditional block. It's not clear how to translate the this to SSA form. To solve this problem, we introduce a special notation,φ:
Here's the SSA representation embedded in a control flow graph:
Here ' s just the control flow graph.
The control flow graph represents flow of logic in your code much better than a syntax tree (which just represents syntax Containment). The SSA control flow graph enables a bunch of optimization algorithms, including:
- Common subexpression Elimination
- Dead Code Elimination
- Dead Store elimination:get rid of
store
operations that is immediately overwritten
- Nil Check Elimination:can often statically prove some nil checks is unnecessary
- Bounds Check Elimination
- Register allocation
- Loop rotation
- Instruction scheduling
- and more!
Consider the case of common subexpression elimination. If you ' re dealing with a syntax tree, it's not clear whether we can eliminate a subexpression in this example:
With SSAS, however, it is clear. In fact, many optimizations can is reduced to simple (and not-so-simple) rewrite rules on the SSA form. Rules like:
(Mul64 x (Const64 [2])) -> (Add64 x x)
Here's a rewrite rule that lowers machine-independent operations to machine-dependent operations:
(Add64 x y) -> (ADDQ x y)
Rules can also is more complicated:
(ORQ s1:(SHLQconst [j1] x1:(MOVBload [i1] {s} p mem)) or:(ORQ s0:(SHLQconst [j0] x0:(MOVBload [i0] {s} p mem))y)) && i1 == i0+1 && j1 == j0+8 && j0 % 16 == 0 && x0.Uses == 1 && x1.Uses == 1 && s0.Uses == 1 && s1.Uses == 1 && or.Uses == 1 && mergePoint(b,x0,x1) != nil && clobber(x0) && clobber(x1) && clobber(s0) && clobber(s1) && clobber(or) -> @mergePoint(b,x0,x1) (ORQ
(SHLQconst
[j0] (MOVWload [i0] {s} p mem)) y)
This rule takes the 8-bit loads and replaces it with one 16-bit load if it can. The bulk of it describes different cases where such a translation can occur.
Rewrite rules make incorporating optimizations to new ports easy. Rules for most optimizations (e.g., common subexpression elimination, nil check elimination, etc.) is the same across arc Hitectures. The only rules, the need to change is really the opcode lowering rules. It took a year to write the first SSAS backend for AMD64. Subsequent backends for ARM, ARM64, MIPS, MIPS64, PPC64, S390X, x86 only took 3 months.
The future
There ' s still potentially lots to does to improve the SSA implementation in the Go compiler:
- Alias Analysis
- Store-load forwarding
- Better Dead Store Removal
- Devirtualization
- Better Register allocation
- Better Code Layout
- Better instruction Scheduling
- Lifting loop invariant code out of loops
They would like help creating better benchmarks against which to test. They is committed to releasing optimizations, observably benefit Real-world use cases.