"Csdn Live Report" December 2014 12-14th, sponsored by the China Computer Society (CCF), CCF large data expert committee contractor, the Chinese Academy of Sciences and CSDN jointly co-organized to promote large data research, application and industrial development as the main theme of the 2014 China Data Technology Conference (big Data Marvell Conference 2014,BDTC 2014) and the second session of the CCF Grand Symposium was opened at Crowne Plaza Hotel, New Yunnan, Beijing.
2014 The first day of China's Big Data technology conference, the chief research and development architect of the Huawei EU Research Center, Sabri Skhiri The main problem "LAMBDA Architecture 2.0 convergence inclusive Real-time Analytics, context-awareness and Online Learning, he points out two major drawbacks to the current large data framework. First, the class MapReduce model is not simple to do machine learning, mainly because of the following reasons: The key-based operators limit flexibility; iterations are complex, while computations are expensive; batches scan all data; Second, there is no incremental learning that gradually increases the number of data slices.
Sabri Skhiri, chief research and development architect of Huawei EU research Center
The following is a transcript of the speech
Sbari Skhiri:
My name Sbari Skhiri, from Belgium. Today I would like to introduce you to the topic of CEP and PME for real-time analysis. This is the report of my speech, first do a brief report, and then briefly introduce the lambda structure and Lambda2.0 architecture, and then give some examples to make a final summary.
I often attend meetings and give lectures on machine learning in meetings. I am also a coordinator of Open source project, be able to participate in open source project is very good, can have the front line contact with the developer, have very close contact, everybody can see.
Let's take a look at how the big Data age has changed, and we can actually see some smart developments, including machine learning. Machine learning has been presented before the large data age. We have listed different uses in the image, including Hadoop, and so on, but in the field of communications, many companies are large in size, they can provide users with more reliable data, in our mining requirements, as well as our existing capabilities, there is a problem, So Huawei also wants to meet the needs of telecoms operators to shorten the gap.
Let's take a specific look at the real time analysis. Let's look at the communications field, the main reason we enter into the big data age is that operators want to further enhance the user experience, improve the quality, and better optimize the operational efficiency, they also hope to better explore the value of data, to realize the data, to create a stable ecosystem, including data generation, Data users and the people who run them, so they want to make better use of new data in all directions.
We have some new business use cases, such as real-time advertising, we can give users accurate positioning of the advertising push, there can be a better dynamic network management, there can be a better initiative to manage the user experience and so on, all of these have a number of major needs. First, we need to be able to develop scenarios, backgrounds, and implementations in real time; And what does it mean to us from the perspective of construction? This means that our direction has changed. Let's look at the picture on the left. The vertical axis above is the data value, the horizontal axis above mainly shows the time, our data includes the event correlation, we also need to carry on the data storage, the information sends and the final action, we also can obtain the data in this process, also needs to have the time to make the decision. Over time, our data will become less valuable, requiring us to collect and detect data and make corresponding analysis and decision. We also need to take a number of ways to generate knowledge and generate content on the basis of data.
As shown in the picture on the left, we really use the data to make a change, first we go to explore and understand the interesting background and context, on the basis of which we can make a direct decision according to this context. We want to implement a direct response to understanding the context, which helps us improve efficiency.
What should we do to achieve this? This requires us to have new capabilities that allow us to detect text, detect scenarios, and do batch processing, as well as help us build efficiency models. I think it's really interesting because it will require us to build more efficient computing patterns, and we're going to expand the computing model and give you a detailed introduction later.
First look at how to detect the scene. We are, of course, looking for the model of the event, and we have to find the relationship between events and events, including that we can use the time as a basis for a connection or other metrics to find the link between events. So briefly, our background, our context can be used in a variety of cases, such as in marketing, if the user search on the site, you can directly give him the promotion of accurate positioning, this is what we call related events, in this effective scenario we want to make the corresponding decision. For example, to provide him with some online phone purchase links and ads, this is the case used in Google, I think you can see here you should be more familiar with, is that we search Google Maps, it will have some corresponding push, it is based on events and user market promotion and marketing.
That's right, right. This case is the IT system, the system is integrated, integrated, and it allows us to do some matching, for example, if there are two sets of IT systems, one system is the market system, then one is the operating system, these systems can interact, can achieve convergence, Help you to better explore the value between data.
This means implementing the goal on the first slide, which requires the following tasks, including the handling of the event flow just now, we have to have very complete event handling, that is CEP, and the model matching engine is implemented. What exactly is a model matching search? I think you might think that this is a little bit obscure, in fact, it means that we are going to first find the processing of the event flow, I will compare my colleagues and I do some of the event flow, after processing I want to achieve a more complete event processing, so I will show you the difference between the three. If we can identify specific events in a larger context, so we can do a little bit more of an integrated, predictable model, which is cool, because we can build the model better, and we also find some more accurate recommendations, such as to help people make real time recommendations, Even if you're just searching for a product, you're not buying it right away, but with this one push, we'll collect the information through this time, and we'll give it a more accurate push next time.
So it means that we don't need to recalculate, but we can combine the computational capabilities to better use the model, which will increase user stickiness, become more accurate, and its predictability is getting higher.
Let's take a look at the difference between the techniques just mentioned and now look at a nasty deal. Stream processing, at IBM, Yahoo, they all have the flow processing related technology, their goal is to be able to process the convection information, they use the DFG,DFG is a data topology processing, typically they do not have temporary support, you have no way directly in this event, The relationship between the data source and the data source is represented in the process, but you have to have a time relationship between the different elements. So let's take a look at the second, CEP, which is the complete event handling.
The complete event processing is based on the streaming technology just mentioned, we have to have a mainstream language, the language can be expressed, such as SQL. And on this basis we can do some KPI calculations, for example, to calculate the performance of an employee over the past 10 minutes, to define a KPI, I can make some corrections, such as less than 10 people entering, and I can do the calculations, which is what we call CEP.
And then the model matching engine, the model matching engine needs to connect different events to find patterns between events and events, such as five minutes later, from A to B, and then from B to D, where we have some connections, some of them are not connected, So that requires us to find the connection and match between them. And our first event, and the events that occur in time require a match, so this is what we really define and determine the relationship between events and events, we have a variety of research projects in this field, such as a university in the United States and other projects, For example, the University of Dartmouth is doing network management and expected projects, and Huawei has PME projects. It is because these technologies are what we need to be able to define the relationship between events and events.
Here to show you what Huawei has achieved, here has been established, to a research based on the development of the language, the language based on the basis of Baidu to use, we in this process, we found that this language can actually achieve CEP, Which is what we say it can do to show the relationship between events and events, and it can further expand the language that can be used. We can implement algebraic transformations, and we also find that we can integrate both languages directly on the same platform. And in this integration we can also run optimization, to help us better improve the implementation rate.
After merging the models of these two algebras, we can better reduce the limitation of data processing. And we can also generate some static, we can do it on the static machine, this is the architecture we have built up. First of all, it's more flexible and better, because it has different languages to fuse, and we can achieve the speed of 80ev per core per second, which is the advantage of PME.
When are we going to use stream processing, you will ask this question, we want to use flow processing, for example, we want to understand in the field of work to understand the defect area, for example, whether for filtering and application need to find blind spots and defect points, whether for marketing or dynamic management requires us to find the need to change some scenarios, You need to be able to PME to express the text.
Now that you have the tools and the framework, the next core is to generate the content, which requires text that can be very accurate in this scenario. We need to do this to do a lot of data mining, which is very important. We want to see what buttons are available in this area, we have to build a corresponding structure, we also have some very simple user instances, how to use, to give you some examples. In addition, I would like to add that if we build the architecture, we can see the red part, and the red part is that we have to take some decisions and do batch processing. Let's see how we're going to expand so that our functionality can be expanded to a higher level.
The architecture was mentioned this afternoon, so let's look at the lambda. As you can see here for 2010 years, Nathan Marz has released a description of the lambda architecture, and his idea is to be able to deal with or get some data, such as to compute the KPI, and you can calculate the information on Twitter or Weibo and compute your KPI, And you can implement real-time processing. With real-time data after the ID is implemented, we also need to think about how to better calculate the KPI, we get the data, and then the data in advance calculation and processing, and then we will create a set of data sets, this set of data set is a combination of KPIs, then we will find in the front-end, our early events That this event has a subset of its data, and the customer wants to have the latest value, you can combine these values to pick out the latest values, which is what we call the architecture.
This architecture is made up of the current structure, and we spend time building it, what kind of prediction model do we need? Let's take a closer look.
The first step, in addition to the flow processing layer just mentioned, we have other levels, such as PME and so on, which is to help us to find some interesting scenarios, and then not wait for the client's instructions here, but is based on the situation to produce text, this is the first step.
The second step, we can have a machine learning system, and this machine learning system, we can build a predictive model based on it, if it can be done, it would be great. Because we can further define new and reliable predictive models, we are not only based on the user's search results during this time, because such search results may not be able to completely save, for example, you have to change your hairstyle, you will have to go through, you have to understand how different people they have different suggestions, And what happens after you change your hairstyle, these are all through data processing to help you.
Well, this is an example. Now let's take a look at this. Michael Jackson has a new album coming out and we're searching through it for Celine Dion's album released last month, in which case we have to achieve a response, and in real time this layer may ask you this question, what choice should we give this person? We can give him advice about, say, where in some case at which store, his album is 75 percent and how to sell it online and so on, so this is what we can do, is to provide real-time information to tell him the relevant discount information.
Finally, we need to be able to do error correction, we were able to correct the model, it was a really cool thing, we could buy some discounted music albums online, sometimes in the past there were some mistakes, and I wanted to say here's a loop of feedback, which means we can all have some feedback, We can put forward some relevant feedback to say the suggestion, sometimes the suggestion is wrong, then we will introduce some correlation feedback to carry on certain error correction, therefore said we have certain ability to accept our customer feedback. I want to say, this has a certain problem for machine learning, with machine learning we can update the model constantly, because we can take different steps to update the relevant content, because you can not say that based on some of the relevant data last year to these customers to provide information to consumers, Tell him the album is on sale and so on, we need to update the data in real time. So we need to do the cumulative learning, continuous learning, to get the real all of this information, including access to all of our buyers of some relevant information. In this way, we need to have a machine to learn the backend to deal with this aspect of the problem. About three years ago, in the United States, we did this work, whether our work was able to measure performance, whether it was fast enough, that we were more efficient, and that we should be able to achieve cumulative learning. So from what we've gotten in the past, the conclusions we've actually drawn are the conclusions we reached this morning, and the final conclusion is the same,
MapReduce There is a problem, as a model, we have to achieve positive learning is not very easy, we need to update it, for machine learning or cumulative learning for MapReduce may be more difficult. In addition, in other related literature, we should solve these related problems, then want us to do two aspects of work, first, we can design a new architecture, machine learning for further optimization, and then from the operational point of view or from the machine learning point of view, The performance of the machine learning platform is improved. In fact, mainly for machine learning is a platform for optimization, we do this work.
Second, we are also to the machine learning a structure redistribution, we also for this calculation, the design of the calculation is based on different models of the basis for the calculation of the relevant design. We have designed 25 machine learning calculations so that he can accumulate online learning and be able to make constant adjustments to the system with relevant feedback online.
I'm not going to go into detail, I'm still here, you can ask me further, but let's take a look at the partitioning of the data and we'll build a local model after we've built a data partition. I want to say that the initial data is not very accurate, you have to wait for some convergence, after convergence, the data will be more and more accurate. Now let's take a look at this slide, which is that we did an experiment, we did a comparison of different stations, and then we went on to expand it further, and from here we can see that our speed is all right, and here I want to emphasize that Sometimes our customers ask us such questions, he thinks it's very interesting and interesting, you need to spark or Hadoop fast, first, this is very useful in real time, because you need to put some relevant feedback to learn from the accumulated continuous learning.
In addition, you can also take a look at the market simulation, you want to simulate the market competition aspects of the relevant situation, for the calculation, we need to do some related calculations. If you can simulate, should be very good, mainly based on some, not only based on population-related samples but for the whole market simulation, with this market simulation you know how to invest, which is why we have a high-performance machine learning for us is very useful.
Let's see what kind of solution we can offer for real time marketing. From the perspective of the network, we can have a number of different IP sources, we can also let the decision making decisions, the other requires different models, layers, and then based on the relevant situation to make recommendations, how to take action.
Now I'm not going to talk about this case anymore, I'm a bit out of time, but this slide is all here, so you can look at it later.
All right, let's conclude with a summary and our future work. Basically, the situation we are facing, we have some new cases, we need to achieve a transfer of technology, on the one hand we should have the pattern matching the engine, on the other hand we should have a platform called the cumulative high-performance machine learning to further realize the transformation. In addition to the future work we need real-time analysis, we understand consumer behavior, consumer behavior learning, and then we need to analyze the data to understand what this data represents what is represented, how data to solve specific problems. Another very interesting point, we should use this data to predict future scenarios, such as if you have different events, and then use this data to predict future events or patterns that may be problematic or likely to occur, which have not happened so far, But we can use this data to predict the form of the future, because he has had a similar situation before, which is interesting.
In addition very cool research areas, contextual situational learning. We need to have a certain study of the situation, in this situation if there is a problem we need to respond, of course, this content is very complex, and full of challenges, thank you!
More highlights, please pay attention to the live topic 2014 China Large Data Technology Congress (BDTC), Sina Weibo @CSDN cloud computing, subscribe to CSDN large data micro-signal.