How Go uses SIMD instructions

Last Update:2017-02-09 Source: Internet

Author: User

Tags benchmark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

Java SIMD Lucene Elasticsearch

Let's start by looking at how JAVA uses the CPU's SIMD instructions. This is an RU buddy. Try to use SIMD instructions in Lucene to speed up the decoding of Lucene's postings list (that is, specifying the term's corresponding document ID):

Http://blog.griddynamics.com/2015/02/proposing-simd-codec-for-lucene.h ...
Https://www.youtube.com/watch?v=2HQdbpgHfnQ&index=15&list=PLq-...

The most important conclusion is that Java itself does not support the JIT (machine code generated by the runtime) out of SIMD instructions. If you write SIMD code in C/asm, the overhead of JNI itself offsets the benefits of SIMD by invoking it in Java. So ultimately, we need to make

Access the native code in a much lower-level way:

Http://stackoverflow.com/questions/24746776/what-does-a-jvm-have-to-do ...

It is worth mentioning that Elasticsearch has greatly strengthened the aggregation from 2.0 and has now begun to support pipeline. You can write code like select SUM (Money)/sum (users_count) from payment. Natural SIMD optimization can also be done in the aggregation phase.

Https://www.elastic.co/guide/en/elasticsearch/reference/master/search-...

Go CGO

CGO slow, obvious.

Https://github.com/golang/go/blob/master/src/runtime/cgocall.go

Specifically, these are the lines.

    /*     * Announce we are entering a system call     * so that the scheduler knows to create another     * M to run goroutines while we are in the     * foreign code.     *     * The call to asmcgocall is guaranteed not to     * split the stack and does not allocate memory,     * so it is safe to call while "in a system call", outside     * the $GOMAXPROCS accounting.     */    entersyscall(0)    errno := asmcgocall(fn, arg)    exitsyscall(0)

Each call to the C function assumes that the function is blocked. Entersyscall will save the stack information for the current thread. So the go strategy is like Java, by making jni very slow, forcing the user to write as much code as possible into go.

Go Plan9 Assembly

Go has two compilers, one is GC (go compiler), and the other is gccgo (with the back end of GCC). The GC compiler is a compilation that compiles code from go to Plan 9. The Assembly of Plan 9 is not platform-independent, but each platform has a version, and then the

The compilation syntax of a platform itself is different.
First we can look at whether the GC compiler generates SIMD instructions:

Https://github.com/golang/go/blob/master/src/cmd/compile/internal/amd6 ...

As you can see, there is no SIMD instruction such as ADDPD in this list. Description The GC compiler does not currently support compiling common additions to vector additions. With Intel's compiler, if you put the code in the form of a struct of array instead of

Array of struct forms, the compiler can automatically do vectorization optimizations. Obviously, the GC compiler hasn't done this as an optimization direction.

Https://software.intel.com/sites/default/files/8c/a9/CompilerAutovecto ...

Although the GC compiler does not support SIMD, its PLAN9 assembler is supported by SIMD instructions in AMD64.

Https://github.com/golang/go/blob/master/src/cmd/internal/obj/x86/asm6 ...

Among them are AADDPD (i.e. ADDPD). And go is supported in the code to mix Go and plan9 assembly. So gonum this project wrote some PLAN9 compilations to optimize performance:

Https://github.com/gonum/internal/blob/master/asm/ddot_amd64.s

Simply made a benchmark:

package mainimport "fmt"import "simd/asm"import "testing"func BenchmarkFunction(b *testing.B) {    x := make([]float64, 10000)    for i := 0; i < len(x); i++ {        x[i] = float64(i)    }    y := make([]float64, 10000)    for i := 0; i < len(y); i++ {        y[i] = float64(i)    }    for i := 0; i < b.N; i++ {        _ = asm.DdotUnitary(x, y)    }}func main() {    br := testing.Benchmark(BenchmarkFunction)    fmt.Println(br)}

Using the SIMD version of the point multiplication, the speed is 4616 ns/op. Using a non-SIMD version of the point multiplication, the speed is 12340 ns/op. The Go does not currently support inline PLAN9 assembly code. That is, the function of the sink is to pay a function call for each invocation.

Cost, which is no way to use it as a SIMD intrinsics. But it's still much better than Java ...

Gccgo

Go also has another compiler. It provides another way of CGO, extern.

Https://golang.org/doc/install/gccgo

Use extern to link any C code to the Go code. As for scheduler and garbage collector, these are self-made. Even the details of the type conversion are subject to the change. Can make it understood

In order to remove the security of the CGO.

Using this route, you can also link SIMD instructions to the go code to use:

Http://stackoverflow.com/questions/2951028/is-it-possible-to-include-i ...

Using GCCGO may also be possible to make these SIMD calls inline at Link:

Https://groups.google.com/forum/#!topic/golang-nuts/kGgkcOFCBtc
Https://groups.google.com/forum/#!topic/golang-nuts/TqMTWdYGKOk

Quote a paragraph

Answering specifically about gccgo.  Gccgo is of course just a frontend to GCC.  GCC can not inline functions written in pure assembly.  However, GCC provides CPU-specific builtin functions usable in C/C++ for many things that people want to do (e.g., vector instructions) and it also provides a sophisticated asm expression as a C/C++ extension.  This means that you can write your assembly code in extended C/C++ instead, and a function written that way can be inlined.

Summarize

Go has three ways to call native code:

Cgo
PLAN9 Assembly
GCCGO extern

More options are available than the Java JNI. In the near future, go can surpass Java in the two areas of Spark/lucene.
The compiler for Go 1.5 is already written with go. Maybe the compiler of go will be able to automatically generate vectorization code like the Intel compiler.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More