At the Google I/O conference held last week, Google officially announced the second generation of TPU, also known as Cloud TPU or TPU 2. However, Google has not introduced its new chip in detail, showing only a few photos.
The next platform released an article today, based on Google's images and details, takes you to the TPU2 of Google. The quantum bits are compiled as follows:
The first thing to say is that Google is unlikely to sell TPU chips, motherboards or servers to the public. For the moment, TPU2 is still a product for internal use only. Only a handful of people can access TPU2 hardware directly through the TensorFlow Research Cloud (TRC), because this is itself a "highly selective" project designed for researchers. Researchers can share the TPU2 that they have discovered to speed up code types. Google has also launched the Google Compute Engine Cloud TPU Alpha Project, which we can assume is also a highly selective project.
The primary purpose of Google's design TPU2 is certainly to speed up the depth of learning in its consumer-focused software, such as search, maps, speech recognition and unmanned vehicle research. A rough explanation of Google's TRC is that Google wants to recruit people to work with TPU2-suited grid workloads.
Google said the TRC project, though initially small, would expand gradually. If Google's research is extended to a generic application, other people can also access TPU2 directly. Google will then add a TensorFlow hardware example to the public cloud of its Google Cloud platform.
TPU2 can have today, inseparable from last year's Google I/O conference on the first generation of TPU contribution. A generation of TPU is also designed for machine learning specific purpose chip, but also applied in the Alphago, search, translation, photo albums, and so behind the machine learning model. TPU connects to the coprocessor through two PCI-E 3.0 x8 edge connectors (see the lower left corner of two photos below), with a total of GB/s two-way bandwidth. TPU consumes power up to 40 watts, much higher than the PCI-E Power specification, can provide 92 trillion operations per second for 8-bit integer operations, or provides 23 trillion operations per second for 16-bit integer operations. For comparison purposes, Google claims that in the case of a FP16 floating-point number (TPU2), a floating-point operation of 45 trillion times per second can be achieved.
TPU does not have a built-in scheduling function, nor can it be virtualized. It is a simple matrix multiplication coprocessor that is directly connected to the server board. Google's first generation of TPU card: A figure does not have a radiator; b picture has radiator
Google will never disclose how much TPU is connected to a server board until the motherboard processing capacity or its pci-e throughput is overloaded. The coprocessor only needs to do one thing, it needs to get a lot of information from the host processor in the form of task setup and disassembly, and manage the transmission bandwidth of each TPU data.
Google has designed its TPU2 for a four-rack cabinet and called it pod. A cabinet is a standard rack configuration relative to a set of workloads (from a half rack to multiple racks). It provides easier and cheaper purchase, installation, and deployment processes for all large data center users. For example, Microsoft's Azure stack standard half rack is a cabinet.
The size of the four-rack cabinet is mainly determined by the type of copper cable that Google is using and the maximum length of copper wire running at full speed. The following figure shows the high-level organization of the Cabinet.
We first noticed that Google connected each TPU2 board to a server processor board with two cables. It may also be that Google connects each TPU2 board to two different processor boards, but even Google does not want to confuse the topology's installation, programming, and scheduling complexities. It is much simpler to have a one-to-one connection between the server board and the TPU2 board. Google's TPU2 cabinet: A is CPU rack, B is TPU2 rack, C is TPU2 rack, D is CPU rack; Solid Box (blue): Continuous power supply system (UPS); The Dotted box (red) is the power supply; Dashed box (green) is the top of the rack network switch and the rack-type switch
Google has shown 3 different photos of the TPU2 cabinet. In these 3 photos, the configuration and wiring approach look the same. The color coding of the TPU2 line helps to compare these photos.
Three Google TPU2 cabinets
Google shows the top view of the TPU2 circuit board and the proximity of the front panel interface of the circuit board. All 4 quadrants of the TPU2 circuit board share the same power distribution system. We believe that the 4 TPU2 circuit board quadrants also share the same network connection through a simple network switch. It seems that each circuit board quadrant is a separate subsystem, and there is no connection between the other 4 subsystems. Top view of TPU2 board: A is four TPU2 chips and heat sinks; B is 2 bluelink 25gb/s cable/TPU2; C is the two full path architecture (OPA) cable, D is the circuit board power Connector, E is probably a network switch
The front panel connection looks like a QSFP network interface, but I've never seen it anywhere else. The IBM bluelink specification defines 8 200gb/s channels (a total of 16 channels) in each direction for the minimum 25gb/s configuration (called a ' child link '). Google is a member of OPENCAPI and a founding member of Openpowerfoundation, so it is reasonable to use the BlueLink specification. TPU2 Panel Connection
The two connectors at the center of the front panel look like QSFP interfaces, with copper twisted-pair, not fiber. This supports two network configurations, 10Gbps Ethernet and 100Gbps Intel OPA connections respectively. Two 100Gbps Opa links can be merged to provide two-way 25gb/s bandwidth, which meets the BlueLink specification requirements of the speed. So we think that Google is using a 100Gbps OPA connection.
However, in order to avoid the problem of signal attenuation, these copper, bluelink or OPA cable length can not exceed 3 meters. This means that the physical distance between the CPU and the TPU2 circuit board cannot exceed 3 meters. Google uses color-coded cables to connect, and I guess this is to make wiring easier and avoid errors. As you can see, there is a one by one correspondence between the sticker and the color of the cable below the front panel interface. We believe that color coding suggests that Google plans to deploy these TPU2 cabinets on a larger scale.
The white cable is most likely a 1Gbps Ethernet connection, which is used for system management. In the photos, we don't see how Google connects the management network to the TPU2 circuit board. However, based on the line alignment of the white cable, we can assume that Google connects the processing board to the management network from the back of the rack. Perhaps, the processing board will manage the TPU2 board through the OPA Connection and evaluate the health status of these boards.
Google's TPU2 cabinet has the characteristics of mirror symmetry. In the image below, we flip the photo of processor cabinet D and compare it to processor cabinet a. The two cabinets look exactly the same, just mirroring each other. In the lower picture, you can see that cabinet B and C are mirrored as well.
Compare two TPU2 Racks
Google's photos do not show enough connection information to determine the exact network topology between the boards. But this is likely to be a very complex mesh network.
We believe that the CPU board is the standard Intel Xeon dual socket motherboard, in line with Google's 1.5-inch server rack unit size. This is the current generation of motherboard design. Given the support for OPA, this could be a skylake motherboard (see the following discussion of power consumption). The reason we suspect that this is a double socket motherboard is simply that I have not heard of any vendor who has shipped a large number of single socket motherboards in the Intel supply chain. However, as AMD launched the "Naples" EPYC X86 server chip, and Qualcomm launched the Centriq arm server chip, highlighting the single socket configuration, this situation will change.
We believe that Google uses two OPA cables to connect each CPU board to a unique TPU2 board to achieve 25gb/s total bandwidth. This one-to-one connection answers a key question about TPU2: Google's design of TPU2 cabinets, the TPU2 chip and the number of Xeon sockets to 2:1. This means that 4 TPU2 chips correspond to a dual-socket Xeon server.
In the depth learning task, the GPU accelerator uses a ratio of 4:1 or 6:1, and the tight coupling between the TPU2 accelerator and the processor is very different. The 2:1 ratio suggests that Google follows the first generation of TPU's philosophy: "TPU is more closely related to the FPU (floating-point processing unit) than the GPU." "The processor takes a lot of work in Google's TPU2 architecture and throws all the matrix computing tasks to TPU2."
In the TPU2 cabinet, we do not see any enclosures. Perhaps this is the reason for the large number of blue fibers in the cabinet on the bottom of the diagram. The data center network is connected to the CPU board without any fiber connected to the cabinet B and C, and there is no network connection on the TPU2 board. A lot of fiber bandwidth is connected to the rest of Google's data center
Whether it is TPU2 or CPU, there are 32 units of computing on each rack. Each cabinet has 64 CPU boards and 64 TPU boards, a total of 128 CPU chips and 256 TPU2 chips.
Google says its TRC contains 1000 TPU2 chips, but that number has been stripped of a fraction. The four cabinets contain 1024 TPU2 chips. As a result, four cabinets are the lower bound of how many TPU2 chips Google has deployed. In the photos posted on Google I/O, you can see three cabinets, or maybe four.
It is not yet clear how the CPU in a cabinet and the TPU2 chip are associated, allowing the TPU2 chip to effectively share data through the connections in the grid. We are almost certain that TRC cannot handle a single task across a cabinet (256 TPU2 chips). The first generation of TPU is a simple coprocessor, so the CPU is responsible for handling all data traffic. In this architecture, the CPU accesses Remote Storage data through the data center network.
Google does not have a memory model that describes the cabinet. The TPU2 chip can use remote Direct memory access (RDMA) on the OPA to load its own data from memory on the processor board. I guess I can.
The CPU board also seems likely to perform the same operation on the cabinet, creating a large shared memory pool. The shared memory pool is not as fast as the memory pool in the HP Enterprise version of the shared memory system prototype, but with GB/s bandwidth, it's not too slow, it's in a two-digit byte range (16GB per DIMM, 8 DIMMs per processor, two processors per board, 64 boards produce 16TB of RAM).
We speculate that arranging a task that requires multiple TPU2 on a single cabinet looks like this:
The processor pool should have a grid topology of cabinets and which TPU2 chips can be used to run tasks.
The processor groups may be interrelated, programming each TPU2 to explicitly link the grid between the TPU2 chips.
Each processor board loads the data and instructions onto the four TPU2 chips on its paired TPU2 board, including the flow control of the mesh interconnect.
The processor synchronizes the boot task between the Interconnect TPU2 chips.
When the task completes, the processor collects the result data from the TPU2 chip (the data may have been transferred to the global storage pool via RDMA) and marks the TPU2 chip as available for another task.
The advantage of this approach is that the TPU2 chip does not need to understand multitasking, virtualization or multiple tenants, and all such operations on cabinets are handled by the CPU.
This also means that if Google wants to provide the example of the cloud TPU as an iaas of its Google Cloud custom machine type, the instance will have to include both the processor and the TPU2 chip.
It is not clear at this time whether the workload can be scaled across stamps and keep the super grid low latency and high throughput. While researchers can access some of the 1,024 TPU2 chips through TRC, scaling the calculation to the entire cabinet appears to be a challenge. The researchers may be able to connect to up to 256 clusters of TPU2 chips, which is impressive because the cloud GPU connection is currently expanding to 32 interconnect devices.
Google's first generation of TPU runtime power consumption of 40 watts, can be at the rate of tops 16-bit integer matrix multiplication. The speed of TPU2 is increased to Tflops, which is twice times that of the same time, and improves the computational complexity by upgrading to 16-bit floating-point operations. A rough rule of thumb shows that the power consumption is at least doubled: it increases the running rate one times and upgrades to 16-bit floating-point operations, and the power of the TPU2 is increased to at least 160 watts.
From the size of the radiator, the TPU2 power consumption may be higher, or even higher than 200 watts.
On the TPU2 board, there are huge fins on the top of the TPU2 chip, which are the highest air cooled fins I've seen in years. At the same time, they also have an internal sealing cycle of water-cooled systems. In the following illustration, we compare the TPU2 heatsink to the largest heatsink seen in the last few months: A is a TPU2 panel profile of 4 chips; B is a Zaius motherboard with a dual IBM Power9; C is a Power8 motherboard for a dual IBM Minsky D is a dual Intel Xeon Facebook Yosemite Motherboard and E is the NVIDIA P100 SMX2 module with a heatsink and the Facebook Big Basin motherboard.
The radiator sizes are shouting "all over 200 watts." It is easy to see that they are much larger than the 40-watt radiators on the previous generation of TPU. The height of these radiators can be approximately two rack units, close to 3 inches. (Google Rack Unit height is 1.5 inches, slightly shorter than the industry standard 1.75 inch u type).
Where is the increase in power consumption?
So we can speculate that the memory capacity of the TPU2 chip has also expanded, which helps to improve throughput, but also increases power consumption.
In addition, Google has developed from PCI-E slot-driven single TPU (pci-express slots to TPU cards) to a single chip TPU2 board designed to share dual OPA ports and switches, as well as two dedicated TPU2 ports per BlueLink chip. OPA and BlueLink both increase the power consumption of the TPU2 board level.
Google's Open Computing Project rack specifications show Power transmission profiles of 6-kilowatt, 12-kilowatt and 20-kilowatt, and 20-kilowatt power allocations to drive 90-watt CPUs. We speculate that rack A and D may use 20-kilowatt power supplies using the Skylake architecture of the Xeon processor and the TPU2 chips that handle most of the computational load.
Racks B and C are different. The power delivery is 30-kilowatt, capable of providing 200 watts of power for each TPU2 slot, and each rack 36-kilowatt will provide 250 watts of power for each TPU2 socket. 36-kilowatt is a common high performance computing power transmission specification. We believe that the 250 watt power consumption per chip is also the only reason Google configures a huge heatsink for TPU2. As a result, the power transmission of a single TPU2 cabinet can range from 100-kilowatt to 112-kilowatt and may be closer to a higher number.
This means that the TRC consumes nearly 500,000 watts of power at full capacity. Although the four cabinets are expensive to deploy, they are one-time capital costs and do not occupy a large amount of data center space. However, the continued funding of academic research with 500,000 watts of electricity is not a small fee even for companies on the scale of Google. If TRC is still running within a year, it will indicate that Google is seriously researching new use cases for TPU2.
TPU2 cabinet contains 256 TPU2 chips. In each TPU2 chip tflops calculation, each cabinet produces a total of 11.5 petaflops depth Learning accelerator performance, calculate it is 16-bit floating-point operations peak performance, but also enough to make people impressive. Depth learning training usually requires higher accuracy, so the FP32 matrix multiplication performance may be one-fourth of FP16 performance, or each cabinet is about 2.9 petaflop, the TRC is 11.5 FP32.
In terms of peak performance, this means FP16 operations across the cabinet (excluding CPU performance contributions or storage outside the cabinet), between each tile 100-115 gigaflops.
After Intel has released the Xeon core count and power configuration for dual socket skylake, you can compute the FP16 and FP32 performance of the Xeon processor and increase it to the overall performance per watt.
There is not enough information about Google's TPU2 cabinet to compare it to commercial products such as Nvidia Volta. Their architectures are so different that they cannot be compared without benchmarks. Just compare FP16 Peak performance, like comparing two processors, memory, hard disk, graphics cards are not the same PC, but only consider CPU frequency.
In other words, we think the real game is not at the chip level. The real challenge is to expand the use of these accelerators. NVIDIA relies on Nvlink to take the first step, the pursuit of its own chip independent of the CPU. Nvidia is extending its software infrastructure and workloads from a single GPU to a GPU cluster.
When Google launched its first generation of TPU, it chose to use it as a coprocessor for the CPU, and when it launched the TPU2, it only extended it to the processor's 2:1 accelerator. However, the TPU2 programming model does not seem to have a workload type that can be extended. Google is looking for third-party help to find workloads that can be extended using the TPU2 schema.