Google depth of TPU: A article to understand the internal principles, and why the rolling GPU

Last Update:2018-08-22 Source: Internet

Author: User

Tags scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Search, Street View, photos, translations, the services Google offers, use Google's TPU (tensor processor) to speed up the neural network calculations behind it.

On the PCB board Google's first TPU and the deployment of the TPU data center

Last year, Google launched TPU and in the near future on the chip's performance and structure of a detailed study. The simple conclusion is that TPU offers 15-30 times the performance boost and 30-80 times the efficiency (performance/watt) boost compared to the CPU and GPU over the same period.

This means that Google's services can be run on a large scale on the most advanced neural networks, and that the cost can be controlled to an acceptable level. The following sections will delve into the technology in Google TPU and discuss how to achieve better performance. The road to TPU

As early as 2006, Google was considering building a dedicated integrated circuit (ASIC) for neural networks. The demand for 2013 became even more pressing, when Google realised that fast-growing computing requirements might mean that the number of data centers needed to be doubled to meet them.

Generally speaking, the development of ASIC takes several years. But in the case of TPU, it takes only 15 months from design to validation, construction, and deployment to the data center.

TPU ASIC used 28nm process manufacturing, 700MHz, power consumption 40W. To deploy the TPU to an existing server as soon as possible, Google chose to package the chip as an external expansion accelerator and plug it into a SATA hard drive slot. So TPU is connected to the host via PCIe Gen3 x16 Bus, that is to say, the effective bandwidth of 12.5gb/s. Using neural network to predict

To illustrate the design of TPU, we need to first introduce the calculation of neural network.

This is an example of a tensorflow playground. It is used to train a neural network, to classify data by tags, to estimate missing data, or to infer future data. For inference, each neuron in a neural network is computed as follows:

The input data (x) is multiplied by the weight (W) to indicate the signal strength

Product sum, becoming the unique value representing the state of the neuron

Apply activation functions (f), such as Relu, sigmoid, and other regulated neurons

The neural network multiplies the input data with the weight matrix and enters the activation function

For example, for a single layer neural network with three input data and two fully connected neurons, the input and the weights are multiplied by six times, and the sum of the two sets of products is obtained. This multiplication and addition sequence can be written as a matrix multiplication, and then the output of the matrix is processed further by activating the function.

In a more complex neural network architecture, the multiplication matrix is usually the most computationally important part.

How many multiplication operations are needed in the actual business. In July 2016, the Google team surveyed six representative neural network applications in the actual business, as shown in the following table:

As shown in the table above, the number of weights in each neural network varies from 5 million to 100 million. Each prediction requires a number of steps to be multiplied by the input data and the weight matrix and entered into the activation function.

In a word, the calculation is very large. As a first step in optimization, Google applied a technique called quantization to perform integer operations instead of 32-bit or 16-bit floating-point operations on all mathematical tasks on the CPU or GPU. This can reduce the amount of memory required and compute resources. Quantification in neural networks

Generally speaking, neural network prediction does not need 32-bit or 16 floating-point precision, through some methods, 8-bit integer can be used to predict neural network and maintain proper accuracy.

Quantification is an optimization technique that uses a 8-bit integer to approximate the minimum and maximum of any numerical value.
Quantification in the TensorFlow

Quantification is a powerful tool for reducing the cost of neural networks, and the reduction of memory is also important, especially for mobile and embedded deployments. For example, after applying quantization in inception, this image recognition model can be compressed from 91MB to 23MB, successfully slimming three-fourths.

Using integers rather than floating-point computations greatly reduces the hardware size and power consumption of TPU. A TPU clock contains 65,536 8-bit integer multipliers. A mainstream GPU used in a cloud environment, typically consisting of thousands of 32-bit floating-point multipliers. As long as the 8-bit to meet the accuracy requirements, can bring more than 25 times times the performance improvement. Risc,cisc and TPU instruction set

Programmability is another important design goal for TPU. TPU is not designed to run a neural network, but to speed up many different types of models.

Most contemporary CPUs employ a streamlined instruction set (RISC). But Google chooses the complex instruction set (CISC) as the basis for the TPU instruction set, which focuses on running more complex tasks.

Let's look at the TPU structure.

TPU includes the following computational resources:

Matrix multiplication Unit (MUX): 65,536 8-bit multiplication and addition unit, running matrix calculation

Unified Buffering (UB): 24MB capacity SRAM for register work

Activation Unit (AU): Activation function for hardware connection

To control MUX, UB and AU calculations, Google defines more than 10 high-level directives designed specifically for neural network inference. Here are five examples.

In short, TPU design encapsulates the nature of neural network computing and can be programmed for a variety of neural network models. In order to program, Google also created a compiler and software stack, the API from the TensorFlow map to call, into the TPU instructions.

From TensorFlow to TPU: Parallel computation of software stack matrix multiplication unit

A typical RISC processor provides instructions for simple computations, such as multiplication or addition. These things are called scalar (Scalar) processors, because each instruction handles a single operation, that is, scalar operations.

Even the CPU at the frequency of gigabit Hertz still takes a long time to complete the computation of large matrices through a series of scalar operations. The improved approach is the vector operation, which performs the same operation for multiple data elements.

The GPU's streaming processor (SM) is an efficient vector processors that can handle hundreds of to thousands of operations in a single clock cycle.

As for Tpu,google, the MXU is designed as a matrix processor, which can handle hundreds of thousands of operations in a single clock cycle, which is the matrix operation. The core of TPU: pulsating array

Mxu has a different architecture from the traditional CPU and GPU, called the pulsating array (systolic array). It is called "pulsation" because, in this structure, the data flows through the chip in a wave of waves, similar to the way the heart beats blood supply.

As shown in the figure, the CPU and GPU need to be accessed from multiple registers in each operation, while TPU's pulsating array concatenates multiple operational Logic units (ALU) together, reusing the results read from a register.

The weights array in Mxu is optimized for matrix multiplication and is not applicable to general calculation.

In a pulsating array, the input vector is multiplied by the weight matrix.

In a pulsating array, the input matrix is multiplied by the weight matrix.

Mxu's pulsating array contains 256x256 = 65,536 alu, which means that TPU can handle 65,536 times of multiplication and addition of 8-bit integers per cycle.

TPU runs at 700 MHz of power, that is, it can run 65,536x700,000,000 = 46x1012 multiplication and addition operations, or 92 trillion (92X1012) sub matrix units per second.

The Mxu in TPU

We compare CPU, GPU, and TPU for each cycle of arithmetic operations:
CPU number of CPUs per cycle (vector extension) Dozens of GPU tens of thousands of TPU hundreds of thousands of

The matrix operation design based on complex instruction set (CISC) has achieved excellent performance power consumption ratio: TPU performance power consumption ratio, compared with the same period of CPU 83 times times stronger than the same time GPU strong 29 times times.

Minimalist & Deterministic Design

Minimalist, which was mentioned on page 8th of the TPU paper that was released before Google. Compared with CPU and GPU, single purpose TPU is a single threaded chip, which does not need to consider the problems of caching, branch prediction and multi-channel processing.

The simplicity of TPU design can be seen from the plan of stamping die:

Yellow is the unit of operation; Blue is the data unit; Green is I/O, red is the control logical unit.

Compared with the CPU and GPU, TPU control unit is smaller, easier to design, the area accounted for only 2% of the entire die, to the chip memory and Operation unit left a greater space. Also, TPU is only half the size of other chips. The smaller the wafer, the lower the cost and the higher the yield.

Certainty, however, is another advantage brought about by single use. CPU and GPU need to consider performance optimizations on a variety of tasks, so there are more and more complex mechanisms, and the side effect is that the behavior of these processors is very unpredictable.

With TPU, we can easily predict how long it will take to run a neural network and predict, so that we can allow the chip to run at a peak throughput approach, while strictly controlling the delay.

In the case of the MLP0 mentioned above, the TPU throughput is 15 to 30 times times that of the CPU and GPU, in the same way that the delay is controlled within 7 milliseconds.

Operational MLP0 predictions per second on various processors

Below, is TPU, CPU, GPU on six kinds of neural network performance comparison. On the CNN1, TPU performance is the most amazing, up to 71 times times the CPU.

Summarize

As mentioned above, the secret of TPU's strong performance is that it concentrates on neural network inference. This makes it possible to quantify selection, CISC instruction set, matrix processor, and minimal design.

Neural networks are driving the shift in computing patterns, and Google expects TPU to be a fast, intelligent and affordable chip in the coming years. Finish

The original published in Google Cloud

Author:
Google Cloud Kaz Sato, Staff Developer Advocate
Google Brain software engineer Cliff Young, Software Engineer
Google Brain Outstanding engineer David Patterson

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More