From August 19, 2014 to 20th, the "2014 China International Data Conference" was held at the Ambassador Hotel in Beijing. The following is Zhu speech by general manager of IBM China Development Center information Management Software Department.
Point of view concentration:
1, large data this matter, if it has a life cycle, I think it has not exceeded its infancy;
2, the charm of large data is not that it is big, but it is such a large number of data can produce more and more value before.
3. Big data doesn't just mean new data, in this society, the most valuable data is still within the enterprise itself for decades has accumulated data, that is, in the traditional data management core system produced, is still valuable, not necessarily the largest data, not necessarily the most trendy data, But it must be the most commercially valuable data.
4, we now say that data is a resource, is indeed a resource, who has mastered the data, who mastered the information, it must be standing at the high point of competition. But the problem is that oil is also the source, and oil is useless before it is processed, a bucket of oil from Saudi Arabia you have no use, accidentally poured out you can not wash off, but after processing will play a very great value, the human war is basically for it, information is a truth, No processed information is a resource, but it is useless.
5, we believe that so far are the five areas: large data exploration, 360-degree omni-directional customer view, operations and operational analysis, data warehousing capacity expansion and enhancement, as well as security and risk capabilities.
6, can not lose large data to bring us a new opportunity, but it can not be excessive of the flicker, or to be scientific and realistic spirit to do.
7. There is no complete prescriptive analysis of the dominant business decision, but prescriptive analysis has become an increasing trend.
Original speech and PPT comparison:
Zhu: Hello everyone! I am from IBM's China Research and Development Center, my name is Zhu, and now I am leading about 500 engineers in Beijing Zhongguancun Software Park, to do all IBM with information management and large data-related products.
When I returned home in 07, I began working in the field of data in 08, and began to work on Hadoop, when I first set up a team under me in 08 to develop the Hadoop accelerator.
The purpose of my coming today is this, I think there are many basic concepts in the big data, I think we've passed that point, and we've got this general understanding, and as a tech-born person who's been in the field of information management for nearly 20 years, I'm going to share some of my own thoughts on this matter on behalf of IBM.
First, let's say that the big data thing, if it has a life cycle, I think it's not more than its infancy, so many ideas and opinions are immature, so if I can put forward some slightly valuable questions for you today in my short 20 minutes, I think our mission is done.
This picture may have been seen in different versions, it's basically about where the big data comes from, and say that I share a story with you, I started a small team in 08 when I was doing hadoop research, I went back to our Silicon Valley lab, and you probably knew that the invention of relational databases came from IBM, For more than 30 years we have a bunch of Montana in Silicon Valley, the so-called IBM Fellows, their status is unshakable, so I went to ask for some guidance. When I was upside down, 08 They said why did you consider the big data? They say the problem of big data has never disappeared since the database technology, we are trying to solve the problem of increasing data every year, why is the big data now a new thing? This is the thinking of the 08.
By about four or five years ago, I felt more and more people began to be willing to ask about the big data, what is the big data, at that time we began to talk about the production of large data unstructured data, structured data and unstructured data, the annual data volume growth, probably reached what level of magnitude, We estimate what the approximate amount of data will be for 2020 years, which is what we talked about four years ago.
About two years ago, I didn't think this conversation was really needed, and when we talked to the CIO and CTO of the enterprise, they already knew, two years ago, it was very good, and the head of the headquarters thought it was important. Get something?
So I think that today, and especially the topic we're talking about today, is the smart city, and I think we need to look at what we know about the latest in this matter, and second, what we think is possible.
I think there is no doubt that social media, especially the sources of large data unstructured data, are increasingly being generated by more and more sensory devices or sensors. Moreover, as people and the ability to understand the data processing, we have a lot of people every day to think of a variety of ways, say where I can also put a sensor, from which to get more data. But don't forget, what I'm going to say is probably a bit of a misunderstanding right now, big data doesn't just refer to new data, and in this society, the most valuable data is still the data that has been accumulated in the enterprise itself for decades, that is to say, in the traditional data management core system, is still valuable, not necessarily the largest data, not necessarily the most trendy data, but it must be the most commercially valuable data.
What are the problems we have to solve? Put it in a smart city concept, a city has a lot of hardware, environment, equipment, through the sensor in the production of a lot of data, this part of the data we have. There must be a lot of people living in this city, people's behavior patterns, his emotional expression, some of the comments on social media, which is certainly something we need to consider. Businesses that operate the city and provide livelihood services to people in the city, banks, telecoms, insurance, public security, all the data owned by government units and businesses, are also part of the big data that a smart city needs to be concerned about and needs to consider.
So what are the three sources of data? How do you share it? How do you integrate it into a common platform for analysis? These are the issues that we need to spend time thinking about and solving with big data technology.
This picture has a lot of data, big is not good, the charm of large data is not that it is big, but in such a large data can produce more and more value before.
Last week I was with a good friend of mine who was a CIO at a Chinese commercial bank in Beijing, and he told me that I was being killed by you guys, and that we had a really bad headache when you guys were screwing around with big data all day. I said, "How about a headache?" He said now that the leaders feel that the data is the resources, what data must be left behind can not be lost, he said that my assessment is the budget, so you guys to slow down a little bit more than fooled, and then fooled me this work can not be done. So, the data is not a problem, but also an opportunity, but how to Amoy out the gold, this is what we need to consider.
This picture is also a very standard diagram, you may have seen in many occasions, the previous lecture on large-scale data, diverse data, large data but not necessarily deterministic, tide dredging, heavy workload, difficult.
There is also a very good Pan of speed, Chinese speaking, speaking of Hadoop is not fast enough, but Hadoop is a data flow, will not stop, so you will never be able to deal with, our general understanding of large data is that I need to improve my efficiency, need to improve the real-time, need to get the analysis report faster, Decisions that can affect business, all of these things, in a big, complex, uncertain premise, plus a speed, which makes the big data this thing can be seen, but not easy to fall, very difficult to succeed, I think the current situation is basically in this situation.
Quote a writer John Naisbitt, whom I adore very much, he was the author of the big trends, and when we were teenagers, the big trends led us to a biblical book on science, and he was the first person to come up with information as a resource, and for the first time, we had an economy based on key resources. Is that the future economy revolves around information.
But the more critical issue is that he pointed out that it was a spontaneous generation, it doesn't matter near it, but don't drown in it, I think it's very critical, the data is endless, but if we do not have a good way to talk about big data all day as if it will succeed, I think it is wrong, The fact that you don't have the ability to deal with the big data is a problem we have to be careful about.
However, the above sentence is also very important, and its corresponding, we now say that data is a resource, is indeed a resource, who has mastered the data, who mastered the information, it must be standing at the high point of competition. But the problem is that oil is also the source, and oil is useless before it is processed, a bucket of oil from Saudi Arabia you have no use, accidentally poured out you can not wash off, but after processing will play a very great value, the human war is basically for it, information is a truth, No processed information is a resource, but it is useless.
So, I think one of the things we've been thinking about in recent years about big data is that how to put the value of large data, can be from the top design positioning good, how from the technical means, can be handled well, can produce value, this is our next to do with large data-related companies, businesses and individuals need to solve a problem.
When it comes to processing data, we have to introduce the current analysis of the data, the experience of a number of key time periods.
Descriptive analysis, is our most concerned about the report. A bunch of historical data, producing a report, annual report, Quarterly, Monthly, weekly, daily, and various reports, producing commercial value. Predictive analysis, through data mining, statistics on some of the above algorithms, can be based on historical data on the direction of future data, make some predictions and judgments, this is called predictive analysis. I think we're basically in descriptive analysis, which accounts for 90% of the data. Predictive analysis may account for about 5% of the total data processing. The remaining 5% may be just entering some prescriptive analysis and cognitive analysis. The difference between predictive analysis and prescriptive analysis is that predictive analysis tells you, this stock is likely to go down, the market may fall, but it is not a directive, the result of the instruction type analysis is that, when a certain stock to 21 5 of the time into, to 22 pieces of 6 time out, this is called the result of the command type analysis. I don't think we have yet fully prescriptive analysis of the dominant business decision stage, but the prescriptive analysis has become increasingly an upward trend.
At the forefront, it is also our most important area of IBM's attention now called cognitive analysis, the purpose of cognitive analysis is that large data also have a particularly significant latitude outside, we can not ignore that the data is beyond the limits of people can deal with, we are now the world's data, We just said 2003 years ago the data we now two days can produce, this is a very good thing. But what I want to ask is, how much of the data now produced is really being analyzed and handled? The ratio is very very small, the data has been beyond the human ability to deal with and analysis of the limits, this trend has only been more and more serious direction, how to do? We need to have a machine, The machine has an ability to learn, to recognize itself, which is now the hottest area.
We IBM now released a production called Watson, is a self-awareness of an analysis engine, Watson first came from a rather encyclopedia of the United States Knowledge Contest, we have the competition in the history of the most powerful two championships and this machine a ratio, the problem through the natural language of people to mention, This machine must be understood by the natural language analysis and then the answer, our machine to the absolute advantage of victory. But that's a very early stage, and we just released a new chip that's closest to the human brain two weeks ago, with 5.4 billion transistors on it, 1 billion more transistors than Intel's current Xeon Chip, which is specifically for self-learning, self-cognitive supercomputers, More than 10 of these chips will be able to achieve a supercomputer, and then it will consume less energy than a hearing aid battery in one ear, which is very powerful, and so far humans have not invented such processing power.
But what does this cognitive ability mean? It's only 1 million neurons, on this little chip, it's the capacity of a bee, our human brain is 1 trillion times times that, that's the difference. However, the chip and CPU this thing has a rule, once the technology breakthrough, its pace is very fast geometric level, so we have a lot of confidence, this will become our perception, cognitive analysis of the ability to support the point.
I think how to use the information data, how to develop suitable for intelligent city development applications above, did a lot of fruitful work, I feel very admire. Of course, our IBM perspective, we are more to see, from information to applications, the middle of the process, we have to solve the technical problems and platform-level technology, which is our most interesting.
The information needs the control area, the grasping area, the real time analysis area, the exploration landing and the growth area, the data Warehouse's timely analysis. This I will not detail, this is a very long content in itself.
Finally, on the product technology above I want to say, we are talking about large data, we can not forget the cloud, I personally think that the delivery of large data, must be through the cloud. Why not? If we combine the few points I have just mentioned, the data is beyond the limits that humans can handle, the difficulty and complexity of the data, and the need for a very high level of technical support. In my opinion, companies that really can provide large data analysis skills will be more concentrated in the future. Because you want to, we also said that information is resources, resources will be robbed, resources will not be free to continue, the people who have resources want to make money on the resources. So I personally think that the person who owns the information, he must be to provide a large data service value delivery of the people, this is open business thinking. This delivery, the beneficiaries of the public, is a lot of small and medium-sized companies, including our common people, we people can not go to buy a system, only the way the cloud delivery can let me at an affordable price, to enjoy the world's leading data analysis capabilities.
I'm going to ask you a question, I'm often invited to visit a variety of parks, I have run to the farthest in Jilin Baicheng, I have not heard of a place, but very beautiful, agricultural products are very good, very green, they built a large data center let me see, I saw, very beautiful, dozens of floors tall building. My first question is, do you have any data? Not yet, do you have any applications? The application is not much, mainly by the three major operators in their subsidiaries, put their applications on top. So I've been thinking about a problem, and now there are all kinds of big data parks in China, and in that area, where do they come from? Where are their apps coming from? Why should the information be put there? I think this is the wisdom of the city, especially the park planning leadership and business, need to consider a problem.
I would like to cite a more extreme example, if I am a XXX bank today, I put a disaster preparedness center in your park, I have a backup in your park, this information is yours? Is it his or yours? If not yours, can you run the application on it? Can you analyze it? Can you enjoy the value that this information brings to you? So how does it get on the edge of your smart city? How can it contribute to your city of wisdom? Is there a relationship? There is a relationship between it and the green economy. But do you say this data has a direct impact on your smarter city? That depends on the data you can use, can produce what value? Therefore, I think that the problem of such a large number of data in the present, especially in China, need to be more deep-seated, or calm thinking.
Time relationship, I jump straight to the remaining two points I think are more important:
We conclude that now, globally, the examples I cite today are basically global in scope, because I would like to take this opportunity to give you some references abroad. The most widely used scenarios for large data, which we believe to date, are the five areas: large data exploration, 360-degree omni-directional customer view, operations and operational analysis, expansion and enhancement of data warehousing capabilities, and increased security and risk capabilities.
This is interesting, the largest application scenario for large data is large data exploration. What does that mean? This means that, in fact, now the big data this matter, the biggest challenge is to find real, really can produce value of the application of the landing, that is, I joked in front of the boss always asked me, the budget is not a problem, in fact, it reflects what I can do, This is precisely the current large data application scenario, the most to explore, this exploration is a process, from the beginning of a kind of understanding slowly began.
Let me cite an example, we IBM from 1993 onwards, as the grand Slam of the four tennis tournament technology partners, we were beginning in 93 when we started from one thing, through the induction technology and high-speed camera technology to judge the speed of the service, you see tennis are 120 kilometers, 90 kilometers, That's where we started, and from 93 we started collecting data related to tennis. Starting in 2005, we have a more profound understanding of the relationship between the data and the tournament, and we started a massive collection of 05-12 years, eight years of our four Grand Slam tournaments, we've collected more than 1800 games, and each game collects 41 million data, a variety of data points, sensors, High speed camera, video. More and more data, we have a group of statisticians and mathematicians, analysis of what is the relationship between it? I really found out, finally we made a software called SlamTracker, the first is to enhance the audience to watch the experience of the tennis game, to provide the narrator with some results based on large data analysis. But now its biggest application is the coach and the athlete itself, for example, what do we find? For example, Li Na, we found that Li Na in the four competitions, when he and his opponent into the tug of battle, when he more than 20 shots, her scoring rate began to decline, Each increase of 10 beat her scoring rate will fall to a geometric level, wait until a certain number of shots she almost did not score, this is a very strange phenomenon. Later on, we see that this statistical stability is particularly evident in Li Na. As a result, we looked at more data and found that Chinese athletes were very obvious.
Then I went to consult some tennis professional coaches, Li Na later the coach told me that the basic training of Chinese athletes is particularly solid, tennis players they are two or three hours from childhood to pull the ball, from small to large has formed the mechanism of brain reaction, when she entered pull back to pull the time will not change, Do you see the warm-up of the table tennis players? Later you will feel it like a robotic arm, the body's swing is the natural response is not the decision of his head, that is, the Chinese athlete's solid training, so that he entered the 20 shot after entering another state, her brain does not do too much thinking and decision, At this time will not change the line and the length of the ball combined, it entered a kind of beat the other person's state.
Another interesting data, last year's French Open final, the little Williams played Wimbledon, when we calculate, the first ball of the service to come over the scoring rate, little Williams to win Wimbledon words must be more than 36%, and Wimbledon to win little Williams words to more than 28% is enough, When he did not reach the scoring rate, her entire game's win rate will fall by a rule.
We've only done one thing in 93 years, we just want to collect the speed of each player's serve, only that point, and we only realized in 05 that it seemed like we were supposed to collect the data, but we didn't know what to expect, and in 2012 we just knew we had collected so much data, we didn't know what to do, Only in the last couple of years through some algorithms, only really feel these.
This is another example of data exploration, and this is not a new technology, you're in the emergency room with a lot of patients with this, but what do we do in the program that we work with in Ontario, Canada? Newborn baby, he had one of the greatest limitations, he had no way to readme the illness, There is no way to communicate with you where uncomfortable, the most is crying. At this point, all doctors can only judge the disease of newborn infants from one aspect, that is, testing the data from the device. The problem is that the detected data is sometimes very small, his temperature changes in the last one hours, one hours, sometimes people can not see why. So, by putting a lot of new babies, the same person, the same area of newborn babies, the first few days of all the physical signs of data, 15 minutes, summed up after a comparison of the analysis, to be able to discover in advance, The child's temperature increased by 0.1 degrees and his breathing rate increased by as much as a percentage. In this case, he is likely to have respiratory problems or other symptoms. Through this analysis, basically what can be seen now is that it is able to check out the real symptoms more than the actual various other checks, and can be 24 hours ahead of schedule. What do I mean? When you go to X-rays and you can show that you have TB, you have a lot of symptoms to form TB that happened 48 hours ago. For a newborn baby to be able to intervene 24 hours in advance, it often means you can save his life.
This proves that this information is not no, we actually have, but when you put a lot of information together, and then do some processing and algorithmic research, you will be able to produce such a framework.
Some of the other things I think, helping the retail industry and so on, each case is a company, no longer said. Jump directly to this page, Dublin, when we made intelligent transportation planning for him, the first time is actually very simple, in fact, in China, there are several cities have begun, passengers at the bus stop, can accurately know when the next bus, and then the Bus management Unit, According to the actual location and speed of the bus and the accuracy of arrival, can be more accurate dispatch of these vehicles.
Today, looking back, this is not particularly strange things, we today even in Beijing, gold, Baidu Map can see the so-called real-time road conditions, but I would like to ask everyone, now the real-time traffic information from where? From the Department of Communications, Information Department information from the entire city of the camera head, real-time, But its real-time or through a unified government department, which of course is good, but it also has a disadvantage, it basically only a few big ring can. But what I practiced there was millions of people driving, his cell phone actually represents its speed and location, and if I can compare and summarize the GPS location information of all the people who are driving, I can know what the speed of moving forward on any road is and where he is. So this raises a new possibility that our real-time traffic, of course, has a privacy problem, which is something that big data can never avoid at this stage, but theoretically it can provide real real-time traffic.
I'll give you another one I don't have on this film, and last year when I was at Las Vegas's annual global user conference, they showed me an example that I found very interesting. CDC Epidemic Disease Control Center, influenza is their perennial control surveillance of the epidemic. Their previous monitoring and prediction of influenza came mainly from the collection of data from medical institutions and doctors, how many patients I received, the symptoms of flu, and then I reported tomorrow, and then I synthesized the flu in all places. But we did a little experiment, we were in a few states on the coast of America, and we analyzed some of the Twitter and Facebook information on social media, for example, we said in a micro-letter that it's hard to get up and sneeze in the morning! Weep and weep! Today go to the hospital to see a cold queue too long, tears tears! With this kind of thing can be analyzed, the person caught a cold, the analysis of the graph and the United States CDC graph, from the picture exactly the same. Social media is reacting to things that are real-time, such as when he says his nose is plugged in, so big data gives us a lot of these cases, which, of course, are part of the experiment.
Time Relationship I'm here, and finally I want to conclude by saying, I am very happy to see so many people today to listen to the topic of big data, I just heard a few speeches, I do think, in China, big data has been deeply rooted, but as the people who have been working in this field, I still think that To look at it in a scientific manner, our understanding of large data is very early, it should be said that the dawn, the big data to what kind of things to bring what kind of return, what kind of value, it is not a generalized, nor is it universal, it is necessary through the gradual evolution of technology, the need for the implementation of the project, Slowly distilled by practice. Therefore, I think that can not lose the big data to bring us a new opportunity, but it can not be too much of the flicker, or to be scientific and realistic spirit to do.