Linux learning materials, so learning Linux more

Source: Internet
Author: User

The first thing to think about is what to solve, the most important of which are three aspects: efficiency, scale, and some intrinsic requirements of machine learning itself.

Scale

The so-called scale problem has three points. The first is that the volume of data is growing rapidly, with more than 60% growth in public cloud and video data each year. 2nd, the amount of data is very large, such as seven cattle have 200 billion pictures, more than 1 billion hours of video, how to dig the value of data intrinsic, this is a very headache in itself. 3rd, the throughput is large, such as 1080P of the camera, a camera in one hours the data generated is 1.8G, a city has hundreds of thousands of cameras, such as Beijing city or even millions of, three months the data generated by the EB level, the data throughput is very large, So you must be able to keep pace with the new data when designing the system.

Efficiency

Many people say that the internet today is cloud computing. But in fact cloud computing is not a cloud, the internet is not only a public cloud, but there are many clouds, such as storage, log servers, compute clusters and so on. Our system needs to be a bridge between the clouds and connect them. Many times they are not in the same room, not even in the same city, and the system needs to reliably reach the final learning cluster with enough speed and bandwidth to make the data reliable.

Native requirements for scale machine learning

The third is the native demand for machine learning, with seven of cows abstracting the computational process of machine learning into two types: data and training operations. Native requirements include many aspects, figure 1 lists several requirements, such as one that allows the entire training to be quickly iterated, the second is that the process can be restarted at any time, the data should be secure, and then the individual training tasks need to be separated from its resources, and training jobs need to be distributed. There are other requirements, such as visualization, model blending, and model Management, which are the requirements of the machine learning itself training.

What?

Figure 2 is the deep learning platform architecture, the underlying is the previous introduction of the various file systems, which made a layer of Caching IO, that is, the distributed memory server, the data required to compute is through it to do the extraction, you can refer to an open source project-AlluxIO, seven cow design and it has some approximation. The abstraction on the IO is Docker, which calculates tasks and resource allocations as well as scheduling. On this basis, the orchestration system is also reference to open source projects. Seven cows do this project time is actually relatively early, probably a year two years ago did, mainly is worried about the open source project wayward said do not dry, so I wrote a set of things.

With such a set of orchestration system, it is easy to build a distributed system, its core is a parameter server, here is recommended a distributed framework based on Caffe Poseidon, which for deep learning tasks optimized computing intra-cluster communication, reduce traffic while ensuring good iteration. With the distributed system and the orchestration system, the above job is abstracted into a separate mirror, such as data cleansing, there are five kinds of data cleansing methods to do five images. Data amplification is similar, including training and reasoning, will make the corresponding basic image to call. On this basis, the job is abstracted into the form of a graph, and the process of expressing the whole job in the way of data flow.

Linux learning materials, so learning Linux more

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.