In "2013 Zhongguancun Big Data Day" Big Data Wisdom City Forum, cloud Human Science and Technology CEO Wu Zhuhua brings to the theme "about intelligent city thinking-real-time large data processing opportunities and challenges" speech. He believes that the opportunities for large data in various industries are as follows: Financial securities (high-frequency transactions, quantitative transactions), telecommunications services (support systems, unified tents, business intelligence), Energy (Power plant power grid Monitoring, information collection and analysis of electricity), Internet and electricity business (user behavior analysis, commodity model analysis, credit analysis), other industries such as Intelligent city, Internet of things.
Wu Zhuhua: It is a great honor to come to this forum, I am the CEO of Shanghai Yun Man Technology Co., Ltd. Because our company this year and a two years are focused on large data real-time analysis, I would like to share in our eyes large data opportunities and challenges, to deal with large data analysis, as well as some of the more classic cases.
My name is Wu Zhuhua, I 2006 years, 2009 years in Zhongguancun Software Park, IBM China Research Institute to do some cloud operating system development work. At the end of 2009, I left the Chinese Institute of IBI China and returned to Shanghai in 2010. 2010 I wrote a book called "Core Technology Analysis of cloud computing", I began in 2011, 2012, 2013 to now, has been set up in Shanghai cloud human science and technology team, attaches great importance to the field of data analysis, launched a product called Yun table.
Next, let's talk about some opportunities and scenarios for big data analysis. Our two years of practice feel, now big data is already changing, from the two years of big, slowly change to big and fast. So real-time processing is a new requirement for many industries.
We find that as long as the company has large data assets, he generally needs real-time analysis. I take the intelligent city, the Internet of Things, the car networking as an example. A city, about a hundred thousand of of the camera in the city, every second will send data to the cloud in the data center, every day with terabytes of data to deal with, and need real-time feedback, this scenario requires real-time processing technology.
For example, car networking, we have a customer to do car networking, he probably a city on every computer, have to install terminals, this terminal will send a traffic information sent to the cloud, to send 100 million data into the cloud, and is a number of calculations per minute, real-time judgment of road conditions, to the user the best driving advice.
Financial securities, for example, require some analysis. For example, the financial transaction telephone transaction is a mainstream direction, we have built a very big cloud platform for a securities organization, has tens of billions of data in the backstage, can provide the data analysis in real time, the data interface, lets them run quickly.
Like telecom, we have a case on the move over here. We are in a province where we have all the information on the Internet in a province loaded into our centralization of power, and our centralization of power can give some statistical feedback to them, support some of their business support systems, business skills, and statistical relevance.
For example, energy analysis, mainly used in power grid monitoring, electricity information collection analysis.
For example, Internet electric dealers they can do some real-time analysis, real-time promotion of advertising to users, they can do the analysis of commodity models, the best products recommended to users. For example, the Internet, there is a commodity model, as well as credit analysis. I have a friend is to do credit analysis, within more than 10 seconds of this person's data analysis, to give users a rating, quickly determine whether the user is worth lending to him.
As long as the industry has large data assets, as long as the company has large data assets, the general need for large data analysis of the power to enhance the comprehensive competitiveness.
Let me talk about why I need large data for real time analysis. First, real-time decision-making, quantitative transactions, I can calculate data in real time, quickly judge whether I buy stocks or not.
Second, improve business efficiency.
Third, we are free to try new algorithms or new strategies for data. In this way, we can quickly discover new ideas and opportunities through real time attempts.
IV. provision of operational outputs. I have found that more and more industries are starting to need this ability.
What is the challenge of big data? What is the real-time analysis of large data, that is, in a few seconds, or 1 seconds to complete the processing and analysis of billions of data. Quick: Within 10 seconds, 100 milliseconds is the best given result. Internet companies, Baidu they want 100 milliseconds to give results. Some financial institutions want to give results in microseconds, require real-time capabilities, and the 1th is fast, real-time analysis.
The second, is big, I hope I target data volume, is 1 billion per TB level, is very large, far more than our previous understanding of the data. Before we thought the data over 10 million was not big. We now run into the largest centralization of power, presumably at a level as close to trillions of data.
Third, hope to do a variety of analytical operations. The simplest is a query, or it can be a logical complex of algorithms and data analysis. This is the most important three point standard for large data. The first is fast, at least 10 seconds, and some industries are microseconds. The second one is big, and I'm targeting more than 1 billion, tens of terabytes of data. Third, the analysis of a variety of operations. Basically these three points, for some industries may have some concurrency, we have made a large data platform for securities companies, said you more than 100 milliseconds this is necessary. So to achieve real-time analysis of large data, the challenge is very great.
What technologies are available for selection? The first is that Hadoop,hadoop itself was developed by Google, it is in the big data aspect algorithm, he is the TB at least, in the big area no problem. And the operation is diverse, because his on-line tools have a lot of algorithms are very good. But it's quick and awkward, he needs less than a minute, he's a lot to do a reduce, it takes a long time.
Second, No SQL. In the big, should be able to support big. HBase can satisfy big features, it can do a big. The hbase is a database and can only support simple queries. HBase difficult to do some logic complex data analysis and mining. For example, Taobao, they may be more rich, they use a lot of hardware and a lot of development costs, a set of hbase data development cluster. For small and medium-sized enterprises, and the traditional enterprise is not too suitable for using no SQL analysis. It requires huge hardware costs and development costs.
Does Oracle support large data analysis in traditional databases? The support algorithm is OK, but it is more difficult to be natural.
The introduction of my proposal is called Yun Table, which supports memory based computing, product configuration, can also be considered as a new generation of data warehouses. Yun table's core focus, we are in the design, more focus on two aspects: the first optimization, is focused on memory, is focused on the ability to increase from Moore's law, we have done a good memory SD optimization. But it's expensive, we've introduced a column, and the compression ratio is so high that he can squeeze the data very little. Column storage make up for the high cost ahead.
We can through the development of hardware, and constantly reflect our advantages in large data. The 3rd feature is relatively fast, we can quickly deal with massive data, and can do data statistics and analysis. Performance, compared with Oracle, our stand-alone performance is dozens of times times, we do large data real-time analysis. This is some of our core features.
First, we can calculate large data, 1 billion lines, tens of billions of rows of data.
Second, the general X86 hardware.
Third, can be extended to hundreds of clusters.
IV, PB-level storage.
V provides a multi-platform SQL driver with support for R Data Mining language.
Look at our overall architecture, the top is the driver. The middle will have a virtual IP, can let mastes nodes have two management.
The formation is the data node, through some algorithms, to each node, these nodes are our independent generation. And the core value of performance, are reflected here.
Talk about our core technology, why so fast?
Parallel processing.
Second, row and column mixed storage.
Third, compression.
Four, memory calculation.
What is parallel processing? A data into the cluster, which is automatically distributed to each node. For example, my 10-node cluster, I divided a piece of data into 10, each node processing individual data, can be accelerated 10 times times. And a copy of the data can be multiple copies, so any node down machine will not affect the integrity of the business data continuity. We support the latest instruction set, which can be further optimized to try to support the GPU, which is our qualitative process.
The second is the industry hybrid storage. Such a simple table, which has a number of names, age, sex three columns of data, first to our region, we will first guarantee the partition. Our front end is based on traditional lines, we have row partitioning, which can save a lot of development costs. Then we go to the bottom and we have a transformation. The traditional line is Zhao, 25, male, our back end will be the name exists together, gender, age to save together. I do a query, I do not have to look for the name and gender, I find age on it, so that a sudden reduction of two-thirds, reduce IO, I only need the columns I need can, do not need to read out each column. There is the efficient compression, we can do some rapid aggregation, we see that the maximum value is 31, the minimum is 24, this time there is a query, is greater than 32. The biggest is 31, can speed up the whole operation.
Finally, on the basis of the column, add some new index structure. According to our understanding, the traditional based on the location index, its cost will be very high, not necessarily suitable for large data such scenarios.
This is efficient compression, we are on the basis of the column, do some optimization, so we have a very high compression rate. Tall 7 to 20 times times, and we support a variety of compression algorithms that support mild, also support depth. For example, some thermal data, we can use the bidding algorithm. But for some of the less commonly used data, you can press it very small.
If the compressed data, we also need to decompress, we now want to understand the compression, direct processing, so that the performance further increase.
The last is the memory calculation, the current trend is more and more servers, memory more and more, we can build a very large cluster, there are a lot of data. The company's core business data is in it. Before we made a big data platform for financial institutions, there were about 10 machines. We have about 2.5T of memory, our compression ratio is 10 times times, he can drop 25T of data in memory, equivalent to this agency 10 business data processing. 1th, it can be quickly processed, it can make very strong changes, because it is memory, can do a lot of concurrent operations.
Finally, this is Oracle, he analyzed the 1TB data and needed a few hours from the hard drive to memory. A 50-page table, the real processing from 5 pages can be, I 50 pages as long as 5 pages, 45 pages do not need, so as to reduce the IO reading rate.
And finally the compression, we are the memory calculation, so I have 1TB io, from a few hours to a few seconds.
Let me tell you a few simple cases. The first is the Internet, the Internet can use our real-time technology as a social network of electric dealers. Talk about the effect of Internet playback, real-time monitoring. Advertisers can buy ads on Taobao platform, you can publish these ads to the site, such as QQ, Sina, users see this ad, users will produce some logs, these logs will be sent to the monitoring platform, this time monitoring platform, will be sent to Yun table inside. Yun table analyzes the data.
Can do these analysis, we look at this ad click once there are several people, through our usual analysis of this kind of things, we can do some error-proofing measures. Overlap degree analysis, Multidimensional Analysis.
2nd, telecom operators, many of the operators can use the system. We move in a certain province, it is probably the user online complaint system, the user I surf the internet I found that the cost of their own network is very serious, I can call 10086 ask. 10086 will immediately pull up his one-month online data, said you on Youku, or what site, so you used a lot. But 10086 they wanted to give the result in 10 seconds, and they tried a lot before that. Finally I try we, we are about 6 nodes, put 115.3 billion data, we do some queries, can be in a two seconds results. is very fast.
This is Intel's hbase, they put a piece of data, and we are six nodes, is put more than 1000 data, this is 1000 times times the gap.
The 3rd is finance. We do a quantitative transaction for every domestic securities agency. We have 80 billion data in the bottom 2.5T. We provide interfaces to help them deal with them in real time. We have these 10 nodes, and we do a total of 30 seconds of his trading, and calculate his price and the added average. I can take some time end, the first 50 information, we do some queries. We are all in 100 milliseconds, 50 millisecond magnitude complete. And we support concurrency, and we support 1000 of people who have similar operations.
Our product Yun Table, we can deal with large data in real time, and we have a mature case in telecommunications. Welcome to Exchange.
(Responsible editor: The good of the Legacy)