160413. Neural network processor

Last Update:2016-04-26 Source: Internet

Author: User

Tags svm xeon e5

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chen Yun

http://novel.ict.ac.cn/ychen/
Chen Yun Ji, male, born in 1983, Jiangxi Nanchang People, Research Institute of Computational Technology, CAS, doctoral tutor. At the same time, he served as a distinguished researcher of the Center for Brain Science, Chinese Academy of Sciences and University post professor. He currently leads his laboratory to develop the Cambrian series of deep learning processors. Prior to this, he engaged in the development of domestic processors for more than 10 years, has been responsible for or participated in a variety of godson processor design. He has published more than 60 papers in academic conferences and periodicals including ISCA, HPCA, MICRO, Asplos, ICSE, ISSCC, Hot Chips, Ijcai, FPGA, SPAA, IEEE MICRO, and 8 ieee/acm Trans. Chenyun won the first National Natural Science Foundation "Outstanding Youth Fund", the first national million people program "Youth notch Talent", the Chinese Computer Society Young Scientist Award and the Academy of Sciences Young Talent Award. He also led the scientific research team as the head of the national "Youth Civilization number" and the Central State organ "youth civilization" title.

Smart Apps

Intelligent processing is the core problem
20w Human brain Power consumption
Multilayer large-scale neural network ≈ convolutional Neural Network + LRM (different feature map extracts different features to complete normalization) + Pooling (de-sampled) + Classifier (All-in-one, 2-3-tier)
DeepMind: Deep School + Enhanced learning = 49 Games

Requirements for neural network processors

Google cat:1.6 million CPUx7 day = Cat Face Recognition task
100 billion synapse (Google Brain) = 100 trillion synapse

Dedicated Neural network processor

Each computer requires a dedicated neural network processor

Cambrian 2008-2016

Architecture method to complete the calculation of neural networks
2012: Results
can use the CPU (Xeon e5-4620) and GPU (k20m) One-tenth of the area, respectively to achieve CPU-117 times, GPU-1.1 times performance.
2013: First deep learning processor-Diannao
The traditional neural network chip is the method of the hardware unit and the algorithm neuron one by one corresponding, so that only a fixed neural network to calculate. They used a small-scale neural network time-sharing method to support arbitrary-scale neural networks, which is very powerful, greatly improving the ability of the chip to different algorithms.
2014: multicore deep learning processor-Dadiannao
2015: Universal machine learning Processor-Pudiannao-(Artificial neural network, K-nn,svm,bayes, etc.)
2016: Smart identification on the camera Ip-shidiannao
2016: Neural network universal Instruction set-Diannaoyu

Methodological innovation

Curing small-scale hardware = Any large variable neural network
Optimize the storage hierarchy to minimize the number of accesses to memory
Increase the bandwidth of the memory visit

Time-sharing of hardware operation unit

Diannao: Calculation of small-scale neurons in sequence
Dadiannao:
- Edram Technology
- Neural networks are structured so that the efficiency can be improved by scheduling
- Up to 21 times times faster than K20
- Under the 28nm process, the Dadiannao has a frequency of 606MHz, an area of 67.7mm^2, and a power consumption of about 16W. Single-chip performance is 21 times times larger than the mainstream GPU, and consumes only 1/330 of the mainstream GPU. The high-performance computing system consisting of 64 chips can be up to 450 times times faster than the mainstream GPU, but the total energy consumption is only 1/150.
- Cons: 1. Edram was successful in 28nm, but the 7nm process appeared to be difficult. (Transistor leakage) 2. A fully connected network means that there is also a need for full communication between the chips. So there are some flaws in the design.
Pudiannao
- Small sample Learning Method-bayes Method (large sample learning method is not a panacea)
- No large sample of economics can be used
- The evolution of the algorithm itself means that the hardware chip has a strong universality
- Vector inner Product (SVM), vector distance, count, non-linear function, sort = = 95% The arithmetic involved in the machine learning algorithm has designed the MLU (machine learning function part)
Shidiannao
- Both the input and output and the system model are on the chip without the need for a range of memory
- Essentially, it's von Neumann structure.
- The supercomputer in the phone

Brain-like computer and wind Neumann structure

The universal architecture of hardware, brain-like computer is not a breakthrough in nature

160413. Neural network processor

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More