CS224D Lecture 15 Notes

Last Update:2015-08-12 Source: Internet

Author: User

Tags intel mkl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Welcome reprint, Reprint annotated Source:

http://blog.csdn.net/neighborhoodguo/article/details/47449257

This is still a lecture of the guests, speaking about parallel calculation. As the saying goes: "Three Stooges, top a Zhuge Liang". Ha ha

Because the processing power of a computer or a processor is limited, parallel computing can greatly improve the computational speed and save time for debugging. While our nn is so complex and sometimes very large, it is necessary to use parallel computing.

This talk, mainly divided into five parts: 1. Efficient formulations 2.CPUs and GPUs 3.Parallelism 4.Asynchronous SGD 5.Easy implementations and current

Efficient formulationsstructured VS unstructured computation

Structured graph means that the connections between the units are very orderly, like CNN.

The advantage of this method of representation is that the cache is used continuously, is easy to load, and uses less memory. Disadvantage is that flexibility is not good

And another is unstructured graph.

The advantage is stronger expression, but the use of the cache is not continuous, not easy to load, high memory usage (and the previous confrontation)

Our goal is to make the expression more structured without compromising performance.

Block Operations and BLAS

One of the simplest examples of Block operations is matrix multiplication and addition, which is similar to packing similar operations into a whole block and then entering into batch calculations.

Blas:basic Linear Algebra Subroutines is a very advanced parallel computing tool, and other great parallel computing tools are also recommended in class.

Batching

is said before the batching gradient descent will not repeat.

CPUs and GPUs

The class lecturer said the CPU and GPU had reached peak performance.

The size of the memory is very limited, CPU and GPU communication is very slow is a bottleneck.

Less CPU cores each core operation is faster

GPU more cores each core operation is slow

But the GPU has a number of advantages, as a whole GPU operation is faster than the CPU

At first glance it looks like it's better to use the GPU completely.

Because of the communication bottleneck, the use of CPU in the calculation of a small amount of time is actually more computational advantage, in the calculation of a large number of times when using the GPU has a clear advantage.

Data parallelism

This is used to optimize the previous batching gradient descent

1. First specify a master core and then multiple worker cores, and master assigns the calculated task to each worker

2. Each worker core is then individually calculated

3. When the calculation is complete, summarize it to master, and the final result is calculated by Master.

The parallelism here are synchronous.

Model parallelism

This is to divide the model into blocks and then each module is assigned to each core calculation and then summarizes the results.

The computing power of a computer is limited, can you use more than one computer to help compute at the same time?

But the speed of Ethernet communication between computers is too slow to develop faster communication between computers.

Asynchronous SGD

The calculation method that was previously said to be synchronous requires waiting for each work core to be completed in order to summarize and calculate the results, so that part of the time is spent waiting.

In view of this, an asynchronous SGD is proposed

Assign tasks or assign them as usual, but who will upload the results to master and then update the data for each work core immediately after the master rollup is complete

Directions

There are three areas that can be improved:

1. Modify the model to minimize unstructured, increase the portion of the structured, increase the width of the model as much as possible, and reduce the depth

2. Try to make the neural unsaturated, so that the data as far as possible in the online sex area.

3. Find a better approach to optimization.

Several open-source parallelism Packages

1.BLAS

2.cpus:intel MKL, Atlas, GOTO

Gpus:cuda, OPENACC, Clblas

3.Theano, Torch

CS224D Lecture 15 Notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More