"Csdn Live Report" December 2014 12-14th, sponsored by the China Computer Society (CCF), CCF large data expert committee contractor, the Chinese Academy of Sciences and CSDN jointly co-organized to promote large data research, application and industrial development as the main theme of the 2014 China Data Technology Conference (big Data Marvell Conference 2014,BDTC 2014) and the second session of the CCF Grand Symposium was opened at Crowne Plaza Hotel, New Yunnan, Beijing.
2014 The first day plenary meeting of China large Data technology conference, Carnegie Mellon University professor, ICML 2014 program President Biopo brought the name "A New Platform for cloud-based distributed Machine Data "keynote speech. During the period, Biopo said, focus on the current large data processing platform, a large number of resources are wasted on the trunking communication synchronization. Even the better platform, the calculation time is only 20%, communication time accounted for 80%, such as Hadoop communication time accounted for 90% or even higher.
Carnegie Mellon University professor, ICML 2014 program Chairman Biopo
The following is a transcript of the speech:
Biopo:
I first thank the organizing committee for inviting me to make a report here. My report style may be different from the previous several, dry goods more, more boring, there are some formal experimental results, even some mathematical formulas, and I would be happy to share the results just from the students.
I want to discuss the platform for distributed machine learning operations for large data. When we face big data, the first question we ask is usually what we can dig out of big data, what's the use of big data? This problem has been seen a lot of show, this piece will not repeat or append.
I would like to be able to talk about a more boring but basic question, how to do large scale data operations, how to do it right. What is the reason? Now that the volume of data is so large, the key question will be the correct understanding of the data, and what is the side of the tool here? At least in the current experience of our computer scientists, many people agree that machine learning and the statistical learning algorithm that it represents may be an effective way to excavate data.
I'm going to explore how to transition this tool to a large data platform, and what effect does the word "big" have on previous research? It is necessary to emphasize the importance of this question. Now there are a lot of big data to be fooled, many articles will say that the data is money, have a lot of data you become very rich, and even you become very smart. But if there is no good effective system for analyzing this data, data is not equal to knowledge: like a tree falling down in a forest, you don't see it, it doesn't fall, you don't know. Today I'm going to talk about these technical issues.
Why large data machine learning mining is difficult, first data quantitative change, challenges the storage, communication, and even the limit of processing, so you have to distribute it to a large number of computer data center and no longer only on the stand-alone. But the challenge is more than that, and when the data gets big, your problems become complex, requiring a smart brain and a clever model to understand.
The model you use in large companies has hundreds of hundreds of millions of of parameters, the overflow from a single machine requires parallel processing, to do this is not a simple question, this involves the third question, where is the package tool? You may see just now the Speaker to show IBM's system, Mr. Yukei will show the Baidu system, large data problems are currently the exclusive rights of large enterprises, while the more cock-level companies or non-IT companies have no way to deal with, is not this situation can not be shaken? I think the popularity of the large data tools library becomes very handy and will change.
Large Data ToolPak cannot include simple tools such as decision trees, KNN, these tools have 20, 30 of history is still in use but not very effective; we want the tall tools, the depth of learning, the topic model, collaborative filtering these things appear in the very modern literature, and began to be actively used by very advanced companies, What technical challenges do they have to prevent ordinary people or more large companies from using them?
I want to ask a question here: When you have this data or the model is large enough to overflow from a machine memory, we hope that this is obviously a curve: I continue to add the machine, the more and more power, this is everyone's expectations. If developers, especially engineers, have actual experience, I thought you would tell me that when I gave you 1000 machines, your abilities did not increase by 1000 times times. Machines have a lot of time, and all of your resources are wasted on useless computing, such as communication, waiting, and coordination. So one of the curves we see is this curve--the computational power we actually reap does not increase with the machine, but it quickly caps and even falls. For computer scientists, while it's important to get big data for digging, providing information, and doing a data-mining player in person, I think another important task for computer scientists is to provide methodologies and tools to bring this curve from here, which is what I'll be talking about in the center.
Why do I say that the existing system is not enough to achieve the function we just wanted? Let me give you a few examples. For example, there are a lot of machine learning scholars, they are obviously very interested in large data, because of their own training limitations or habitual thinking, they usually do not understand the system knowledge, they see 100 machines and a machine difference is only multiplied by 100, the middle price or the machine failure probability they can not think too much, So their algorithm is mainly for mathematical correctness or iterative algorithm iterative number of reduction, but they're not going to delve into how the algorithm works on a real machine group; they write a program that says, "parallel operations," and then naively assumes that this is happening, Put these algorithms on a lot of machines and they're good.
The actual process I saw at least was such a situation, you have to do a little experiment to measure the time spent on computation and the time spent in communication, and the best result is this: at least 20% of the time spent on the machine, 80% of the time spent waiting, often worse, mentioned that the ideal state does not exist, So these algorithms are actually not usually applied.
On the other hand, system engineers are not very proficient in machine learning or statistical learning principles or techniques, and they need to achieve the goal of maximizing the output of iterations as much as possible and correcting some of the loss caused by the machine, so they develop some very reliable, very high throughput technologies. But their algorithms for machine learning usually make a very simple assumption: I just have to run it, and it's sure to run. If there is a special programming model in the system, such as a spark in a rdd,graphlab, they assume that no matter what machine learning algorithm can be converted into such a model, this is the engineer's task. I don't know if the engineers have ever tried to turn the topic model into RDD or node model, which is very difficult and requires a deep mastery of the machine learning principle. So you don't see the pervasive application of complex machine learning on these systems.
So the work of the system usually simplifies the work in the machine learning theory, it becomes idealized, at least it will cause some waste of resources, but more serious will cause the failure of the algorithm itself will diverge, you get the result is not useful. So to sum up, due to the gap between the two areas, we usually develop a known system to each other's framework has a more simplistic assumption, the actual work you lost many opportunities, but also caused a lot of mistakes.
Ideally there is a way, you have a lot of tasks and various hardware devices, we have a universal algorithm or framework, and then this framework can support all of the machine learning software, and then use all the known hardware, the middle of this layer of machine coordination, Communication or error handling should be abstracted to be automated, and as a programmer we should not grasp to touch these things.
At what level can this problem be solved? First of all, I would like to point out that this is definitely not a software problem, in the design of such an interface, you should not only understand the system design also understand the theory of machine learning, but also understand the design of algorithms, so the whole need of skills is a very large technical support. So this creates the difficulty of this problem, and it shows that not everyone is going to touch the problem, the risk is very high, the rate of return is not highly, but there is always someone doing such a thing, someone to develop a similar operating platform or framework. Our CMU team is working on this and I would like to share with you some of the more interesting ideas and lessons learned in the development framework.
Here we first need to study the design of distributed strategy, how to do distribution. When you have more than one machine, it is obvious that the task is distributed to each machine, each machine goes forward, the intermediate results exist in the local storage space, they do not complete the same time, to ensure the correctness of the program you need to set up a waiting point, Each console point machine and that is called the server leader to shake hands together can then go down. Machine learning algorithm has such a characteristic, it's not one-off, it needs iterative iterations, which makes it difficult, and when you choose to take a different approach to the behavior of the system, it produces different results, and some of the results are like this, a lot of time wasted on communication, or waiting for all the endpoints to work together, This will ensure that your results are right, it's slow, but yes. Another way I can simplify the communication protocol, so that it does not have to wait for each other, sometimes get a quick result, but this result is often a divergence of incorrect results.
Hadoop, in the past, was thought to be a platform for writing parallel programs, but does Hadoop really fit the machine learning algorithm? I do not know that you have had this experience in your own development, and that my own experience of visiting professors in school or Facebook was rather tragic: you mastered 1000 machines in Hadoop, you wrote a project for Hadoop, But the bottleneck of the intermediate hard drive reading and so on can greatly limit the program validity, when you have many machines, it is difficult to sync with each other, most of the time occurs in the waiting. Hard drive reading is a pretty expensive operation so it takes a lot of time and often makes the program quite difficult to move forward.
It is a very interesting question how we start from this place and how to solve this problem. In order to express our ideas, it is necessary to show a little bit about the mystery of machine learning or its characteristics. What's the difference between a normal program? I tried to make a comparison. Usually write computer program hope is a precise execution, just like I build a building, put a blueprint precision to follow the steps to achieve, so that the building can be set up, wrong step. Machine learning is not the precise implementation of a previously set plan, it's usually a mathematical optimization problem, like there's a regular exercise in gymnastics. You can climb this hill from this road, or you can climb from that road, so there is a fault tolerance, and fault tolerance gives new opportunities. and machine learning can write a mathematical formula to achieve the highest point you can use mathematical formula to evaluate, the solution is usually to execute iterative procedures, the iterative process itself has the characteristics of automatic convergence.
When you iterate under certain circumstances, whether your iterative accuracy is particularly high or relatively high, it is possible to converge, which causes the machine learning algorithm is difficult and simple, the key depends on which angle you solve. Here to do a comparison of fault tolerance: for example, you are doing a sort, we know that this thing is not fault-tolerant, this piece if not changed after the wrong, the final result is wrong. This is a common feature of traditional computer programs, and once you've made a mistake in one of the bony eyes, you have to change it. The machine learning algorithm runs like this little drunk guy in the mountains, although drunk, but know where the top, he can see, feet still walk, more or less can climb up, not necessarily climb so fast, or every step up, go wrong after not necessarily go back, but also to go again, This place is not the same as the traditional computer program.
There are data and model two-phase, for system engineers, data and models are no different, it is in the memory of some of the numbers, it is not a problem, such as I have data on a certain machine. I have a model here, the model refers to the parameters inside, such as neural network parameters can be divided, after large can also do the so-called data and model parallel.
In the usual classic system design, there is no difference between these two parallels, and sometimes you see the parallel processing that is not differentiated by Hadoop and spark. But if you look closely at the characteristics of the machine learning program, you find that the two parallel results are very different, when the data is parallel, they are irrelevant, so you don't need to reconcile them, you need to coordinate the results of the distributions, but you don't need to coordinate them in the process.
When the model is parallel, it's actually relevant in the middle, so you don't coordinate in the process and the end result goes wrong, so in this case you'll find that you need to do different communication and system design for parallel data parallel to the model, and there are a few other things I don't discuss much about.
I make a summary, machine learning algorithm has its uniqueness, based on the optimization algorithm, and with (recursion) to achieve, there are some fault-tolerant capabilities, there are some dynamic structure, and then it is not homogeneous, some parameters will return very quickly, some convergence is very slow, you can stop the convergence to use the resources in addition, This requires programmers or programs to have a certain understanding of machine learning algorithms, so there is a certain opportunity to accelerate. This kind of thing does not exist in the traditional program, usually for the instruction level of a completely correct implementation, resulting in many more technical measures, but in machine learning is not necessarily necessary.
To see how a known system solves this challenge, we all know that spark is essentially an upgraded version of Hadoop, and that it starts with a special data structure called RDD, which is in memory, and thus reads and writes very quickly, followed by iterations very well. RDD maintains a process tree, so it is very good for rdd and spark to find the problem quickly in the operation, especially in the database processing or not iterative data parallel processing is very effective.
Doing the model parallel requires a global coordination, which will have some great cost, spark there is no special mechanism to deal with this need. Graphlab, using the node graph to represent the model, the data, the graphs on the edge of the relative degree of intensity, you can write a node program to automatically do an asynchronous communication, still keep this program finally able to converge correctly. This is also a good idea, in many cases have produced a better result, even better than spark. But it also has some problems, the data volume becomes very big, the program will become very heavy, the efficiency is not very high.
Our group is developing such a platform called petuum, including data and model parallel two sets of functions, but also the characteristics of machine learning to do a more in-depth study, they do some targeted use, so our system for machine learning internal characteristics more targeted, they have some very interesting features and functions, I can sum up this piece.
The approximate structure is this, it contains a parameter server, we all know the parameters of the server, to provide you with a very good programming of a shared virtual distributed memory, everyone in the programming time without a separate communication for each machine; we also have a scheduler, which is able to effectively segment the model, or even dynamic segmentation, Then do a distribution and load balance. The operating principle of the machine learning engineers to write machine learning algorithm is a basic idea, with the iteration plus on the formula update the randomness of the quantity rather than the certainty of repeated refresh, and the traditional is not the same.
This parameter server has such a programming interface that you do not need to make a special instruction for each machine in the write memory reading memory, but use common instructions similar to that of a stand-alone computer. It uses the more ingenious so-called semi synchronous coordination mechanism, this can significantly reduce the time spent on communication and increase the amount of time spent on the computation, so you can see that our communication time will drop significantly, down to less than the computational time, as we adjust our semiparametric parameters. This allows the computer's resources to be used most heavily.
In the scheduler we also use the design based on machine learning considerations, the dispatcher automatically discovers some of the structures in the machine learning model, finds out which parameters are relevant, which parameters are irrelevant, and then distributes them accordingly, and they do not violate the correctness and constraint when they are distributed, which will result in quicker convergence.
Why does this produce such good results? There are some deep technical and scientific principles, time permitting, I can speak for a few minutes. Parallel system is not ideal, we have several machines, obviously can not expect it synchronous operation synchronous communication, even if the same number of machines in the room inside the temperature is not the same, the behavior is different, the final result is that we see such a situation. How do we reconcile such a flaw? Usually to the programmer, of course, this is not a problem, he can do deep operation on each machine, can avoid all traps, for ordinary programmers and low-end users, this very expensive and time-consuming development process is not affordable, we need a very simple interface, the interface below the support framework to make a communication decision , this decision can be summed up as a so-called coordination or synchronization protocol in the process of data parallelism. This sync protocol, as we all know, includes the spark or the Hadoop protocol in full synergy, and then goes down, which is mathematically proven to be right, but a loss of effectiveness.
Another kind of relatively dirty result, I am not synchronized, let oneself run, to the program convergence and correctness without any guarantee. The way we take it is to take a middle route, do a half sync, let the machine in a limited window to do the local operation, with the local version of the parameter value to do an operation, do not communicate with other things, when this window is broken, I must stop waiting, each thread to the window boundary time is random, So the end result is that all threads can use the window to do the maximum. What we care about is that it obviously gets faster, like something that's not synchronized, but does it guarantee correctness? The result is that because we are finite asynchronous or semi synchronous, you can produce a proof that the resulting convergence is the same as the result of the synchronization. This is the first thing I know. For this kind of operating system to do a theoretical proof of the correctness of the results, the system has a certain theoretical value or even a good application value.
And then its programming interface is quite ordinary interface, you put different machine learning algorithms, such as the topic model, matrix segmentation or as a linear or logical regression is represented as a parameter server such a representation, with such a very simple high-level language to operate. What's the result? We found that in different programs we have achieved faster and better convergence results than the synchronized or completely asynchronous approach.
Again the model parallel, in mathematics there is a more intense so-called coordination requirements, this piece I would like to use a linear regression to do a simple exposition. Usually we do linear regression need to estimate parameters, this parameter vector is very high dimensional, perhaps 10 billion-D, in the ordinary storage machine inside, or put down after the serial method to calculate very slowly, so you need to parallel, that disorderly release not? At last I found it impossible. In mathematics you can obtain such a form: when you randomly take two parameters, when you use one of the parameters, and the other, its value is related to the previous value, so there is a previous request that when you put these two things in parallel, the dependencies are corrupted and unable to converge, These two correlations are related to the data so you can do quantitative calculations, so when two parameters are very relevant we need to put them on the same machine, which is quite expensive to operate, theoretically, by drawing a node graph, we do this estimate for each pair of parameters, and 10 billion dimensions are large graphs and are not operable.
In both of these options, pure graph segmentation or random segmentation to take the middle route, we can imagine that the parameters are not all parameters are equally important, some things are important, some things are not important, some things are fast convergence, we use a method to the parameters of the general evaluation, only to deal with the important things, This is based on the structure of parallelization called SAP, also can use a simple programming interface. This is an operation that takes a sample from a certain equation, take some important parameters, analyze its structure, and distribute it to each machine.
In the second iteration, it is possible that the structure will change, and this step can be done again, which is the choice between users. Distribution strategy There are many possibilities, just say the priority of the consideration, which parameters are important which parameters are not important, you can also do modular considerations, the parameters into several blocks, can also do a secure synchronization.
Experimental results, we do a good dynamic parallel implementation of fast benign convergence, otherwise, if the random parallel is not convergent or very slow convergence, we also have theoretical and experimental proof, not only in the mean, the variance has a good convergence of protection. Time just let me finish the scientific principle, we also saw the results, in different algorithms have been greatly improved. I finally use a few pictures to sum up, now large-scale machine learning platform for the entire environmental landscape, which is quite challenging but quite exciting areas, there are many home play, we are the successor.
Now, with the big data model, you don't have to talk about a billions of argument. To see what has been achieved recently, you can see someone using 10,000 machines to reach 10 billion parameters, with the estimate of the topic model, you will see that other data, neural networks or the topic model have reached a similar level of magnitude, the goal is that my machine is more and more, the model can be done more and more, this is a trend, Some inputs have been rewarded. It's a great achievement to get a lot of machines together and run a distribution program. But at the same time you want to improve efficiency, this is the result of our report, we have recently done a topic model and matrix decomposition and roll machine neural network, to see their size has exceeded the present in the literature reported the largest results, but the use of machines than the current machine one order of magnitude. The following also shows the speed of Ascension.
We have a collaboration with Microsoft, they used 24 more high-end machines, to achieve 10 of the 12-point parameters of the topic model, there are 1 million topics, each topic has 1 million words, which is currently known as the largest topic model, It's about 10 to 50 times times bigger than the very famous Tencent Peakcock, but we use a lot less machines than they do.
Finally, for the platform or framework design, if you consider the principle of the operating system, but also consider the principle of machine learning, you will get a good harvest. Petuum is open source project, based on the current observation can say, it can not only achieve a very large model volume, basically equivalent or better than the current best system, and in the machine cluster size and hardware requirements, we greatly reduce the requirements. Before the speech, the students had just received the latest results from the NIPS conference, we are pleasantly surprised: there is a group to know our system, with this system with spark and Graphlab made an independent comparison, I am glad to see our convergence curve to achieve the lowest and fastest, I want to use this result to end my lectures. In the end, you may want to know exactly how far the petuum system is, and our vision includes both the top software, the underlying software, and the eco-interface support that is now a molecule in the Hadoop ecosystem, and you can download it later for your own development, or you can use the software in our library. This project I would like to thank my colleagues and students in the last two years of support, but also thank you for your attention, thank you!
More highlights, please pay attention to the live topic 2014 China Large Data Technology Congress (BDTC), Sina Weibo @CSDN cloud computing, subscribe to CSDN large data micro-signal.