This is a creation in Article, where the information may have evolved or changed.
Background
Go has a map type built into it, and one of the important hashing algorithms is a Cityhash variant.
At the same time, in order to avoid the hash collision attack (collision attack) and accelerate the hash calculation speed, Keith Randall in Go1.0 added x86_64 supported hardware acceleration aeshash algorithm. I searched the internet and was surprised to find out that the algorithm was only implemented in go, which was absolutely impossible.
This is what I'm searching for. ARM64 go runtime to optimize the point of the people found: ARM64 also support aeshash hardware acceleration instructions, but Go is not used.
My mouth smiled a little, filled with joy and ready to add code. But I do not know, this seemingly calm under the sea do not know what monsters hidden ...
Start
Hit the snake hit seven inches, see the code to achieve the nature of the first look. Initialization code in RUNTIME/ALG.GO
if (GOARCH == "386" || GOARCH == "amd64") &&GOOS != "nacl" &&support_aes && // AESENCsupport_ssse3 && // PSHUFBsupport_sse41 { // PINSR{D,Q}useAeshash = truealgarray[alg_MEM32].hash = aeshash32algarray[alg_MEM64].hash = aeshash64algarray[alg_STRING].hash = aeshashstr// Initialize with random data so hash collisions will be hard to engineer.getRandomData(aeskeysched[:])return}
As you can see, by replacing the hash function in the algarrary into Aeshash, we have completed this accelerated replacement, taking into account the follow-up of other platforms in the future, and feel that this is simply a gracious invitation to me to complete the next work.
Let's take a look at the simplest aeshash64 implementation.
//func aeshash64(p unsafe.Pointer, h uintptr) uintptrTEXT runtime·aeshash64(SB),NOSPLIT,$0-24MOVQp+0(FP), AX// ptr to dataMOVQh+8(FP), X0// seedPINSRQ$1, (AX), X0// dataAESENCruntime·aeskeysched+0(SB), X0AESENCruntime·aeskeysched+16(SB), X0AESENCruntime·aeskeysched+32(SB), X0MOVQX0, ret+16(FP)RET
The note is also clear, AX loads the data pointer, and then loads the map's seed and data into the X0 to wait for the calculation. To illustrate, each hashmap is randomly assigned a seed at initialization time, preventing the hacker from finding the seed of the system and initiating a hash conflict attack. In the next few steps, the data is encrypted using the random seed runtime aeskeysched generated at program initialization, and the result is returned.
More complex aeshash are just loads of various lengths to calculate.
To this, I can only sigh, this is too simple, it took two weekends to write the general code, but also encountered the following issues:
- Platform Differences
- Smhasher
- Conflict (collision)
- Go compiler bug
Platform Differences
The first problem is that ARM64 does not have a X86 aesenc, but is divided into two commands, Aese and AESMC.
First look at AES introduction standard AES encryption is divided into 4 steps:
- AddRoundKey
- Subbytes
- Shiftrows
- Mixcolumns
The X86 aesenc instruction is equivalent to:
- AddRoundKey
- Subbytes
- Shiftrows
- Mixcolumns
- Data XOR Key
But...... The ARM64 aese instruction is equivalent to:
- Data XOR Key
- AddRoundKey
- Subbytes
- Shiftrows
So, if you simply imitate X86 's writing ... AESENC X0,X0
The data will be emptied out ... Fell Helpless, had to find another way, with the system random seed encryption data, code ideas are as follows:
// 把系统种子载入 V1// 再将种子和数据载入 V2AESEV1.B16, V2.B16
Smhasher&collision
Solve the last problem and start testing. Go uses smhasher, a hash function is required to pass the following tests:
- Sanity, the entire key must be processed
- Appendedzeros, fill 0, different lengths
- Smallkeys, all small key (< 3 bytes) combination
- Cyclic, loops, for example: 11211->11121
- Sparse, sparse, for example: 0b00001 and 0b00100
- Permutation, combination, each block permutation combination order is different
- Avalanche, flip each bit
- Windowed, for example, the resulting hash value is 32 bits, then the 20 bits are the same, and the result needs to be different
- Seed, changes in seeds can affect results
Every time smhasher error, I began to see the code where the problem, is generally misplaced register these low-level errors ... Go also tests the bucket allocation in the map, and if a bucket puts too much data, it can go wrong. However, this part is still relatively smooth.
Go compiler bug
Fix this, and I find another pit. In order to reduce the instruction, I want to directly load the data in the register directly into ARM64 Vector Lane.
But the go compiler is not able to compile different lane index correctly. For example, the following two instructions, the resulting instruction code is the same ...
VMOV R1, V2.D[0]VMOV R1, V2.D[1]
Had to report a bug first.
Cmd/asm:wrong implement VMOV/VLD on arm64
And then use the native bytecode to top it, and want to know how to do it straight to the tips
Monster
Finally, all runtime smhasher and hash tests were passed, and I started trying src/all.bash
to build go.
Then I pulled to the bottom of the monster ...
Build Log
$./all.bashbuilding go cmd/dist using/usr/lib/go-1.6.building go toolchain1 using/usr/lib/go-1.6.building Go bootstrap Cmd/go (go_bootstrap) using Go toolchain1. Building go toolchain2 using Go_bootstrap and go toolchain1. Building go toolchain3 using Go_bootstrap and go toolchain2.# runtimeduplicate type: hashfunc.struct {"". Full "". Lfstack; "". Empty "". Lfstack; "". pad0 [64]uint8; "". Wbufspans struct {". Lock" ". Mutex;" ". Free "". Mspanlist; "". Busy "". Mspanlist}; _ UInt32; "". bytesmarked UInt64; "". Markrootnext UInt32; "". Markrootjobs uint32; "". Nproc UInt32; "". Tstart int64; "". Nwait UInt32; "". Ndone uint32; "". Alldone "". Note; "". helperdrainblock bool; "". Nflushcacheroots int; "". Ndataroots int; "". Nbssroots int; "". Nspanroots int; "". Nstackroots int; "". markrootdone bool; "". Startsema uint32; "". Markdonesema UInt32; "". Bgmarkready "". Note; "". Bgmarkdone UInt32; ". Mode" ". Gcmode;" ". userforced bool; "". TotalTime int64; "". Initialheaplive UInt64; "". Assistqueue struct {"". Lock". Mutex;" ". Head "". Guintptr; "". Tail "". Guintptr}; "". sweepwaiters struct {"". Lock "". Mutex; "". Head "". Guintptr}; "". Cycles UInt32; "". Stwprocs int32; "". Maxprocs int32; "". Tsweepterm int64; "". TMark Int64; "". Tmarkterm int64; "". TEnd Int64; "". Pausens int64; "". Pausestart Int64; "". Heap0 UInt64; "". Heap1 UInt64; "". Heap2 UInt64; "". Heapgoal UInt64}build failed
Hey...... Compiler build error??? The tests have passed.
Run a all.bash again, found that the wrong place is not the same???
With GDB breakpoints in the Asm_arm64.s:aeshash I wrote, trace the execution process, and there is no problem with the long string (>129) hash. Is it a compiler bug?
So I started to follow the compiler's actions and found that only the symbol table used the map function. After the compiler's basic principle and go implementation, I realized that the symbol table only used the function of aeshashstr (to hash characters). I smhasher in the aeshash of all the conversion into a aeshashstr, found still can miraculously pass the test! Manual Check once, found that even aeshash32 and aeshash64 are made and x86 achieve the same, including the results, or error!
So I sent this weird question e-mail, post, Hair group asked all over. There is no solution.
In this way for 1 months of spare time, basically read the compiler's relevant code, found that clearly is two different symbols (symbol) or will be considered the same. Finally, the egg hurts to see if anyone is willing to help debug. Attitude annoyed a lot of people. I think I'm getting the whole head of this bug.
Out of the Pit
It was not until recently that I realized that the Smhasher test did not cover all the circumstances. Sure enough, after a careful examination in Aeshash129plus this section has
SUBS$1, R1, R1BHSaesloop
The subs is a sub-contrast, and in r1=1, it exits. But smhasher only tested 128 bytes in full, so the test passed, but it couldn't be compiled. This bug is only triggered if it is more than 256 bytes (fall).
After the improvement is
SUB$1, R1, R1CBNZR1, aesloop
Can eventually submit CL
Runtime:implement Aeshash for arm64 platform
Note that if you use the PRFM instruction, the speed can be accelerated around 30-40MB (Hash1024). May be the focus of the next optimization (alignment and cache)
name old speed new speed deltaHash5 97.0MB/s ± 0% 97.0MB/s ± 0% -0.03% (p=0.008 n=5+5)Hash16 329MB/s ± 0% 329MB/s ± 0% ~ (p=0.302 n=4+5)Hash64 858MB/s ±20% 890MB/s ±11% ~ (p=0.841 n=5+5)Hash1024 3.50GB/s ±16% 3.57GB/s ± 7% ~ (p=0.690 n=5+5)Hash65536 4.54GB/s ± 1% 4.57GB/s ± 0% ~ (p=0.310 n=5+5)
Tips
How to generate native ARM64 instruction bytecode with GNU assembly language?
$cat vld.sld1 {v2.b}[14], [x0] $as -o vld.o -al vld.lis vld.sAARCH64 GAS vld.s page 1 1 0000 0218404D
The third column is the generated bytecode, and the copy to go is OK.
WORD $0x4D401802
In fact, there are tools asm2plan9s, but there is no way to compile the tool ARM64
Thank
Finally, thank you very much
- Wei Xiao
- Fangming
- Keith Randall
For my meticulous help.