CS224D Lecture 15 Notes

Source: Internet
Author: User
Tags intel mkl

Welcome reprint, Reprint annotated Source:

http://blog.csdn.net/neighborhoodguo/article/details/47449257


This is still a lecture of the guests, speaking about parallel calculation. As the saying goes: "Three Stooges, top a Zhuge Liang". Ha ha

Because the processing power of a computer or a processor is limited, parallel computing can greatly improve the computational speed and save time for debugging. While our nn is so complex and sometimes very large, it is necessary to use parallel computing.

This talk, mainly divided into five parts: 1. Efficient formulations 2.CPUs and GPUs 3.Parallelism 4.Asynchronous SGD 5.Easy implementations and current

Efficient formulationsstructured VS unstructured computation

Structured graph means that the connections between the units are very orderly, like CNN.

The advantage of this method of representation is that the cache is used continuously, is easy to load, and uses less memory. Disadvantage is that flexibility is not good

And another is unstructured graph.

The advantage is stronger expression, but the use of the cache is not continuous, not easy to load, high memory usage (and the previous confrontation)

Our goal is to make the expression more structured without compromising performance.

Block Operations and BLAS

One of the simplest examples of Block operations is matrix multiplication and addition, which is similar to packing similar operations into a whole block and then entering into batch calculations.

Blas:basic Linear Algebra Subroutines is a very advanced parallel computing tool, and other great parallel computing tools are also recommended in class.

Batching

is said before the batching gradient descent will not repeat.

CPUs and GPUs

The class lecturer said the CPU and GPU had reached peak performance.

The size of the memory is very limited, CPU and GPU communication is very slow is a bottleneck.

Less CPU cores each core operation is faster

GPU more cores each core operation is slow

But the GPU has a number of advantages, as a whole GPU operation is faster than the CPU

At first glance it looks like it's better to use the GPU completely.

Because of the communication bottleneck, the use of CPU in the calculation of a small amount of time is actually more computational advantage, in the calculation of a large number of times when using the GPU has a clear advantage.

Data parallelism

This is used to optimize the previous batching gradient descent

1. First specify a master core and then multiple worker cores, and master assigns the calculated task to each worker

2. Each worker core is then individually calculated

3. When the calculation is complete, summarize it to master, and the final result is calculated by Master.

The parallelism here are synchronous.

Model parallelism

This is to divide the model into blocks and then each module is assigned to each core calculation and then summarizes the results.

The computing power of a computer is limited, can you use more than one computer to help compute at the same time?

But the speed of Ethernet communication between computers is too slow to develop faster communication between computers.

Asynchronous SGD

The calculation method that was previously said to be synchronous requires waiting for each work core to be completed in order to summarize and calculate the results, so that part of the time is spent waiting.

In view of this, an asynchronous SGD is proposed

Assign tasks or assign them as usual, but who will upload the results to master and then update the data for each work core immediately after the master rollup is complete

Directions

There are three areas that can be improved:

1. Modify the model to minimize unstructured, increase the portion of the structured, increase the width of the model as much as possible, and reduce the depth

2. Try to make the neural unsaturated, so that the data as far as possible in the online sex area.

3. Find a better approach to optimization.


Several open-source parallelism Packages

1.BLAS

2.cpus:intel MKL, Atlas, GOTO

Gpus:cuda, OPENACC, Clblas

3.Theano, Torch


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

CS224D Lecture 15 Notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.