[Switch] CPU GPU TPU

Source: Internet
Author: User

Google I/O is the annual meeting of network developers held by Google, focusing on developing network applications using Google and open network technologies. This annual meeting has been held since 9th and has been held this year.

At this year's Annual Meeting, Google released the following eight products: intelligent assistant Google Assistant, a wireless speaker and voice command device competing with Amazon echo, Google home, and Message application Allo, video Call application duo, VR platform daydream, independent applications support Android Wear 2.0, and Android instant apps of the application can be used without installation, and Google Play on Chrome OS.

These 8 products are mainly concentrated in the software field.

At the end of the Google I/O 2016 keynote speech, Google's CEO, PI Cai, mentioned the achievements they have made in AI and machine learning during this period, a tensor Processing Unit (tensor Processing Unit) processor, referred to as TPU. At the conference, we only introduced some performance indicators of this TPU and published some application scenarios in the subsequent blog, the architecture and internal operation mechanism of this processor are not elaborated in detail, so we may need to start from some common processor structures, I tried to guess and explore what kind of face this dedicated chip for machine learning has.

First, let's take a look at the most familiar Central Processing Unit (CPU. It is a super-large-scale integrated chip and a general-purpose chip. That is to say, it can be used to do many kinds of things. The processors we use on our daily computers are basically CPUs. There is no problem in watching movies, listening to music, and running code.

Let's take a look at the CPU structure.

The CPU mainly includes the ALU (arithmetic and logic unit) and the controller (Cu, control unit. In addition, it includes several registers and high-speed buffer memory and bus that implement data, control, and status between them. From the above description, we can see that the CPU mainly includes the operational logic device, register component and control component.

Literally, we also understand that the operation logic device mainly performs arithmetic operations, shifts, and other operations, as well as address operations and conversions. The register is mainly used to store the data and commands generated during the operation; the Controller Module is responsible for decoding commands and sending control signals to complete the operations to be executed for each command.

The following figure shows the general process of executing a command on the CPU:

The CPU obtains commands from the program counter, sends the commands to the decoder through the command bus, and delivers the translated commands to the Time Series generator and Operation Controller, and then the hacker computes the data, store data to the data cache register through the data bus.

We can see from the CPU Structure and execution process that the CPU follows the Von noriman architecture. The core of von noriman is the storage program, which is executed sequentially.

From the above description, we can see that the CPU is like an orderly Butler, and what we want is always done step by step. However, as Moore's Law advances and people's demand for larger and faster processing speeds increases, the CPU seems to be less satisfactory when a task is executed. So people thought, could we put a lot of processors on the same chip and let them do things together? Will the efficiency be much higher? This is the birth of GPU.

 

GPU was born

A gpu is called a graphics processing unit. The Chinese version is a graphics processor, just like its name, GPU was initially used to run plotting operations on personal computers, workstations, game consoles, and mobile devices (such as tablets and smartphones. For processing image data, each pixel in the image has a need for processing. This is a big data. Therefore, the image processing field is the most demanding for computing acceleration, GPU came into being.

By comparing the CPU and GPU structures, we can see that there are many CPU function modules that can adapt to complex computing environments. The GPU architecture is relatively simple, most transistors are mainly used to build control circuits (such as branch prediction) and caches, and only a small number of transistors are used to complete actual computation. However, the GPU control is relatively simple and requires little cache. Therefore, most transistors can form various specialized circuits and multiple pipelines, making the GPU computing speed a breakthrough, it has more powerful processing capabilities for floating point operations. Currently, the top-level CPU only has 4 or 6 cores, and 8 or 12 processing threads are simulated for computation. However, a general-level GPU contains hundreds of processing units, high-end or even more, which has inherent advantages for a large number of repeated processing processes in multimedia computing.

This is like when you draw a picture, the CPU uses a pen to draw a picture, while the GPU uses multiple pens to depict different positions at the same time, that is, efficiency is a rapid development.

Although GPU is generated for image processing, we can see from the previous introduction that it does not have components specifically designed for image processing, only the CPU structure has been optimized and adjusted, so now GPU can not only play a major role in the image processing field, it is also used for scientific computing, password cracking, numerical analysis, massive data processing (sorting, map-Reduce), financial analysis, and other fields that require large-scale parallel computing.Therefore, GPU can also be considered as a common chip..

 

FPGA came into being

As people's computing needs become more and more professional, people hope that chips can better meet our professional needs. However, considering that hardware products cannot be changed once they are built, people begin to think, can we make a chip that is programmable in hardware. That is to say --

At this moment, we need a hardware system that is more suitable for image processing. At the next moment, we need a hardware system that is more suitable for scientific computing, but we do not want to weld two boards, at this time, FPGA came into being.

FPGA is short for field programmable gate array. It is called field-effect Programmable Logic Gate Array in Chinese. It appears as a semi-customized circuit in the field of dedicated integrated circuits.It not only solves the problem of fully-customized circuits, but also overcomes the disadvantages of limited number of original Programmable Logic Device door circuits.

FPGA describes logical circuits using hardware description language (OpenGL or VHDL). It can be quickly burned to FPGA for testing using logic synthesis, layout, and wiring tool software. People can connect logical blocks in FPGA Through editable connections as needed. This is like a circuit test board is put in a chip. The Logic block and connection of a factory-ready FPGA can be changed according to the designer's needs, so FPGA can complete the required logic functions.

The hardware-programmable feature of FPGA has made it very popular since its launch, and many ASIC (dedicated integrated circuits) have been replaced by FPGA. Here we need to explain what ASIC is. ASIC is a special type of integrated circuit tailored to different product requirements. It is designed and manufactured according to specific user requirements and the needs of specific electronic systems. The reason for this is that the TPU we introduce below is also an ASIC.

FPGA and ASIC chips have their own shortcomings. Generally, FPGA is slower than ASIC, and it cannot complete more complex designs and consumes more energy. However, ASIC production costs are high, if the shipment volume is small, ASIC is not economical. However, if a demand increases and ASIC shipments start to increase, the emergence of a special integrated circuit is a historical trend, I think this is also an important starting point for Google to produce tensor processing unit. Now, TPU is on the stage of history.

As machine learning algorithms are increasingly used in various fields and show superior performance, such as street view, intelligent email reply, and sound search, the hardware support for machine learning algorithms is becoming increasingly necessary. At present, most of the machine learning and image processing algorithms run on GPUs and FPGAs. However, we can see from the above descriptions that both chips are still a universal chip, therefore, the performance and power consumption cannot be more closely adapted to machine learning algorithms, and Google has always believed that great software will shine even more with the help of great hardware, So Google is thinking, can we make a dedicated chip for Machine Learning Algorithms for dedicated machines? TPU is born.

 

 

Google wants to build a dedicated chip for machine learning algorithms-TPU

 

From the name, we can see that TPU is inspired by Google's open-source deep learning framework tensorflow. Therefore, TPU is only a chip used inside Google.

Google has been running TPU in its internal data center for more than a year, and its performance indicators are superb. It has probably improved its hardware performance by seven years, it is about three generations of Moore's Law. For performance, the maximum two factors limiting the processor speed are fever and logic gate delay. Fever is the most important factor limiting the speed. Most of the current processors use CMOS technology, and each clock cycle will produce energy dissipation, so the faster the speed, the larger the heat. The following figure shows the relationship between CPU clock frequency and energy consumption. We can see that growth is exponential.

From the perspective of the TPU, we can see that a large piece of metal is highlighted in the middle of the TPU, which is to ensure a good deal of Heat Dissipation for the TPU high-speed operation.

The high performance of TPU also comes from the tolerance for low computing accuracy. That is to say, each step of TPU operation requires less transistor. With the total transistor capacity unchanged, we can run more operations on these transistors in units of time, in this way, we can get more intelligent results by using more complex and powerful machine learning algorithms at a faster speed. We can see the plug-in on the TPU board, So Google uses TPU to insert the Board containing TPU into the hard drive slot of the data center cabinet.

In addition, I think the high performance of TPU also comes from its data localization. For GPUs, retrieving commands and data from Memory takes a lot of time, but machine learning does not need to retrieve data from the global cache for most of the time, therefore, the more localized design in the structure also accelerates the TPU operation speed.

(The server rack containing tpu used in the alphago vs. Li Shiji match. I don't know why the go picture on the side is cute. Via: googleblog.com)

In the past year of Google's data center, TPU has actually done many things, such as rankbrain, an artificial intelligence system for machine learning, it is used to help Google process search results and provide users with more relevant search results; there is also street view, used to improve the accuracy of MAP and navigation; of course, there is a go computer program alphago, in fact, this is also a very interesting part. In the nature article describing alphago, we can see that alphago only runs on CPU + GPUs, the article says that the complete alphago version uses 40 search threads and runs on 48 CPUs and 8 GPUs. The distributed version of alphago uses more machines, 40 search threads run on 1202 CPUs and 176 GPUs. This configuration was used in the competition with Fan Yi. At that time, Li Shiji was very confident in the man-machine competition after seeing the confrontation between alphago and Fan Yi. However, in just a few months, Google replaced alphago's hardware platform with TPU, and the battle was tough.

In addition to TPU, the machine learning algorithm can be run better and faster. What other purposes does Google have to release it. I think Google may be playing the next game.

Google said their goal is to take the lead in Machine Learning in the industry and make this innovation benefit every user, and allows users to better use tensorflow and cloud machine learning. In fact, just as Microsoft has configured holographic Processing Unit (HPU) for its hololens augmented reality header, professional hardware such as TPU is only a small step in its ambitious journey, not only do you Want to surpass the market leader Amazon Web Services (AWS) in the public cloud field ). Over time, Google will release more machine learning APIs. Now Google has launched the cloud machine learning platform service and visual API. We can believe that, the leader of machine learning technology and the market is Google's bigger goal.

Refer to: https://www.leiphone.com/news/201605/xAiOZEWgoTn7MxEx.html

 

[Switch] CPU GPU TPU

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.