Metal Parallel Computing and command-line compilation of metal programs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cuda used to be very good, in order to Apple, give up cuda, change to throw OpenCL. Finally OpenCL is also familiar with, WWDC2018 announced Metal2, suggest you give up OpenCL, use metal performance Shaders.
Apple is a "revolutionary" innovation company, a lot of innovation, will completely abandon the original accumulation. Constantly bring new abilities, but also make people love and hate.

Here is an example of how to use Metal+shader to speed up large-scale data calculations for your Mac. The
Main program uses Swift. Randomly generate a large array of integers and then assign to the GPU cores to sum the parallel array. The logic of the
Metal section suggests looking at the official document Https://developer.apple.com/metal/, focusing only on the calculation section. The calculation is done by the Shader subroutine (kernel function), the language used in shader programming derives from c++14, so the data structure used by the CPU communication is basically the type that the C language can accept. The shader language can be understood as a subset of C + +. The official recommendation is that there can be a lot of calculations, but try to reduce the amount of content that requires GPU cores to be pre-judged to reduce speed, such as logic statements. Most of the time, a single core is not computationally fast, and GPU-accelerated computing is due to the fact that the GPU has multiple computational units for parallel computations. The
usually contains at least 3 parts in the parameters of the shader function: input, output, ID of the process. The first two parameters are good to understand, the third parameter is because the kernel function may run randomly on a GPU core for the calculation work, should be based on this unique ID assigned the only task in the program to calculate, so as to achieve the concurrency effect. Therefore, the kernel function should be support concurrency, support data cutting block computing.
Metal support for concurrency first is the number of thread groups Threadgroupspergrid, which is basically related to the number of GPU cores, the other is the batch quantity Threadsperthreadgroup, which is the integer multiple of the number of thread groups.
See the comments in the code for additional content. The main program is named Testcompute.swift

Import metal//defines the data set length, total count data sum//swift the number of immediate numbers can be added underline the way the scientific notation, very distinctive let count = 10_000_000// Each elementspersum data is assigned to a core rollup once let Elementspersum = 10_000//Each data type, must use a C compatible type,//Because the GPU runs the shader language from C + + 14 derived Typealias DataType = cint//device, which is gpulet devices = Mtlcreatesystemdefaultdevice ()!//load into the current directory Default.metallib ( Compiled shader), use one of the Parsum kernel functions let parsum = Device.makedefaultlibrary ()!. MakeFunction (Name: "Parsum")!//if the shader file is not the default name, you can use the following method to load the specified file//let lib = try! Dev.makelibrary (filepath: "Default.metallib")!. Newfunctionwithname ("Parsum")!//set up pipelining of GPU operations let pipeline = try! Device.makecomputepipelinestate (function:parsum)//Generate random DataSet var data = (0..<count). map{_ in DataType (arc4random_ Uniform (100))}//The total number of data passed to the kernel function, so also C-compatible var datacount = Cunsignedint (count)//The sum of each set passed to the kernel function, as above var ELEMENTSPERSUMC = Cunsignedint (elementspersum)//Returns the number of results of partial rollup let Resultscount = (count + elementsPerSum-1)/elementspersum// Set up two buffers with GPU communication, one for input to kernel function, one for kernel function return let DataBuffer = Device.makebuffer (Bytes:&data, Length:memorylAyout<cint>.size * count, Options: [])//Our data in a buffer (copied) Let Resultsbuffer = Device.makebuffer (length: Memorylayout<cint>.size * resultscount, Options: [])//A buffer for individual results (zero initialized)//return result is C pointer , to convert the data type that is accessible to swift let results = unsafebufferpointer<datatype> (start:resultsbuffer!. Contents (). Assumingmemorybound (to:CInt.self), Count:resultscount)//build GPU command queue let queue = Device.makecommandqueue ()/ /GPU command buffers, typically multiple operations are placed into the buffer, one-time commit execution let Cmds = queue!. The Makecommandbuffer ()//Command encoder is used to package a GPU kernel function call function, parameters, etc. together let encoder = cmds!. Makecomputecommandencoder ()!//sets a function for the GPU kernel function call and its associated parameters, as described previously, must use C-compatible type Encoder.setcomputepipelinestate (pipeline) Encoder.setbuffer (DataBuffer, offset:0, index:0) encoder.setbytes (&datacount, length:memorylayout< Cunsignedint>.size, Index:1) encoder.setbuffer (Resultsbuffer, offset:0, Index:2) encoder.setbytes (& ELEMENTSPERSUMC, Length:memorylayout<cunsignedint>.size, Index:3)//Set the number of tasks per group let ThreadgroupsperGrid = mtlsize (width: (resultscount + pipeline.threadexecutionwidth-1)/pipeline.threadexecutionwidth, height:1, depth : 1)//Set the number of tasks per batch, must be an integer multiple of the number of groups let Threadsperthreadgroup = Mtlsize (Width:pipeline.threadExecutionWidth, height:1, depth : 1)//Assign Task thread encoder.dispatchthreadgroups (Threadgroupspergrid, Threadsperthreadgroup:threadsperthreadgroup)// Complete all settings for a call encoder.endencoding () var start, end:uint64var result:datatype = 0start = Mach_absolute_time ()//Real Commit task cmds! . commit ()//waits to complete GPU compute cmds!. waituntilcompleted ()//gpu calculated in batches, the number has been very small, and finally a complete summary of the CPU for the Elem in results {result + = Elem}end = Mach_absolute_time () Displays the GPU calculation results and the time spent in print ("Metal result: \ (Result), timing: \ (double (End-start)/double (nsec_per_sec))") result = 0//        The following is a complete calculation using the CPU once, and displays the result, time-consuming start = Mach_absolute_time () data.withunsafebufferpointer {buffer in for Elem in buffer { Result + = Elem}}end = Mach_absolute_time () print ("CPU result: \ (Result), Time: \ (double (End-start)/double (nsec_ per_sec)) ")

The shade program is named: Shader.metal

//The data type must be the same as defined in Swift#include <metal_stdlib>typedef unsigned intUInttypedef intDatatype;kernelvoidParsum (ConstDevice datatype* data [[Buffer (0) ]],ConstDevice uint& datalength [[Buffer (1]], device datatype* sums [[Buffer (2) ]],ConstDevice uint& elementspersum [[Buffer (3) ]],ConstUINT Tgpos [[Threadgroup_position_in_grid]],ConstUINT TPERTG [[Threads_per_threadgroup]],ConstUINT TPos [[Thread_position_in_threadgroup]]) {calculates the total index value based on the group index, the batch index, and the position in the group, which is uniqueUINT Resultindex = Tgpos * TPERTG + tPos;//Calculate the start end position of this batch dataUINT DATAINDEX = Resultindex * elementspersum;//Where The summation should beginUINT EndIndex = dataindex + elementspersum < datalength? Dataindex + elementspersum:datalength;//The index where summation should end    //Sum of this batch of data     for(; dataindex < endIndex; dataindex++) Sums[resultindex] + = Data[dataindex];}

To a compile script that is used at the command line:

#!/bin/bashxcrun metal -o shader.air shader.metalxcrun metal-ar rcs shader.metal-ar shader.airxcrun metallib -o default.metallib shader.metal-arswiftc testCompute.swift?

The effect on my notebook is as follows:

metal>Metal result: 495056208, time: 0.017362745CPU result: 495056208, time: 1.210801891

As a comparative one-sided comparison, GPU computing speed is 121 times times faster than CPU.
Test environment:
MacBook Pro (13-inch, four Thunderbolt 3 Ports)
cpu:3.1 GHz Intel Core i5
Graphics:intel Iris Plus Graphics 650 1536 MB
Memory:8 GB 2133 MHz LPDDR3
xcode:9.4.1

Resources:
Https://stackoverflow.com/questions/38164634/compute-sum-of-array-values-in-parallel-with-metal-swift

Metal Parallel Computing and command-line compilation of metal programs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More