"Csdn Live Report" December 2014 12-14th, sponsored by the China Computer Society (CCF), CCF large data expert committee contractor, the Chinese Academy of Sciences and CSDN jointly co-organized to promote large data research, application and industrial development as the main theme of the 2014 China Data Technology Conference (big Data Marvell Conference 2014,BDTC 2014) and the second session of the CCF Grand Symposium was opened at Crowne Plaza Hotel, New Yunnan, Beijing.
Star Ring Technology CTO Sun Yuanhao's speech theme is "2015 Large data base technology evolution trend." During this period, he summed up four trends: the SQL on Hadoop technology for the integrity and performance of SQL support significantly increased, mixed architecture will gradually disappear; from as Computing to ON-SSD Computing, solid-state disk replaces memory as cache Data production speed and processing speed requirements are rapidly improved, real-time large data technology received attention; The rapid evolution of virtualization technology and the increasingly platform of Hadoop technology, cloud computing and large data eventually converged. In the meantime, he shared a data from spark: Nearly 50 companies around the world are offering spark products and services, and 11 offer commercial spark editions.
Star Ring Technology CTO Sun Yuanhao
The following is a transcript of the speech
Sun Yuanhao:
Thank you, Dr. Cha, my speech today is the 2015 Big Data technology evolution trend, in the past we have been engaged in large data practice, have some experience to share with you. We did a forecast for next year and invited everyone to verify it together.
The first trend is that with the rapid development of SQL on Hadoop, a significant increase in SQL integrity and performance improvements, we believe that the hybrid architecture is gradually beginning to disappear.
Here I explain why the hybrid architecture, in the past few years, when the technology first started with the internet company, began 10 years ago, and the internet company used it more and more in the enterprise, and it was very advantageous to deal with unstructured and semi-structured data. However, the processing of structured data is incomplete, and users feel they need to use a database, or MPP database, to help with the structured data. The second reason Hadoop is designed for hundreds of TB, several PB data, but when the data is small, less than 100T or below 10 T, we found that Hadoop performance is not as good as the traditional MPP database, then we feel the need to use a hybrid architecture, Put all the data on Hadoop, part of the data into the MPP database to compute, or put the real-time data into the MPP database, put the historical data into the Hadoop, when the amount of data accumulation is very large also let Hadoop calculation, this is a mixed architecture typical deployment.
We've seen the development of Hadoop in the past three years, and many companies have been quick to do SQL development and performance has improved dramatically. We've summed up some of the four kinds of SQL on Hadoop technologies on the market, and I'm talking about the companies and technologies that natively develop the SQL engine in the Hadoop system. The first is Impala, which uses an engine similar to MPP. The second home is Tez, which absorbs some of the design ideas of spark. This product is 2012 years and probably started forming in May or June. The third of our company's products is called Transwarp Inceptor, this is based on the spark development of the SQL engine, we last October is the first version, currently supports SQL2003, support functions, cursors and other functions, Our level of SQL integrity is currently the most complete in all Hadoop support. At the same time, there are sparksql and drill. Four kinds of engine each is developing its own technology independently, and Spark will become a mainstream. We have been able to support all Tpc-ds test items, tpc-ds are used to measure the performance of the Data Warehouse, which has a large number of non-equivalence join statements, which makes the SQL engine support more difficult.
The first thing we do is that the hybrid architecture will fade away, with three advantages in the past MPP database, the first SQL support complete, now our SQL support is close to the MPP database, and the second one is higher than the Hadoop performance, but we see that Hadoop performance can be more than MPP several times now. The third advantage is that the BI tools on it, the extension tools are very complete, the traditional BI vendors have been turned to the hadoop,hadoop system of BI tools are more and more rich, and some new startups in Hadoop to develop new BI tools, these tools natively support Hadoop, From this perspective Hadoop's ecosystem will soon go beyond the traditional MPP database.
We feel that in the next year or two years, Hadoop will gradually replace the MPP database, you do not need to use a mixed architecture, do not need to implement migration between different databases. Some people say that I am also moving to the Hadoop, and it is also true that the entire MPP database is slowly disappearing into Hadoop. We hope the end result is that the data is all on Hadoop, no matter how many gigabytes or 10 PB levels The data can be processed on Hadoop, truly unlimited linear expansion.
We found a fact that spark is now the most popular computing engine, Impala has been developed for three years, SQL support is still not complete, and through spark can quickly parallelization sql,sql support can quickly increase the degree of integrity. At the same time, through the spark engine we prove that the new engine performance can exceed the MPP database. Since this year the community of Hadoop has grown very fast, at the time of the Spark Summit Conference this June, the original Hadoop ecosystem manufacturers or projects have announced the beginning of full support for spark. I made a simple statistic that nearly 50 companies around the world have spark products and services, 11 of which offer a commercial spark version, this is all 11 companies in this, we are also certified spark distribution manufacturers.
The second trend used to talk about memory computation. At that time, everyone thought that this is a very good direction, put data into memory for caching, memory speed is a disk hundred times to thousands of times, we apply this technology to the reality found that small memory and high price is a relatively large limit conditions, all the data in full memory time , like Spark, and Hadoop are all running on the JVM, and when the memory is large, the GC impact is very serious, can we cache the data in a better way? With the development of hardware technology, we find that memory can be replaced by large capacity SSD for caching, which is also a very obvious trend.
This map lists the development of SSD hardware technology, the bottom is the hard drive, now SSD has several levels of ascension, we take the market on the Intel P3700 PCIe SSD data For example, read performance is 460,000 times per second, is 1000 times times the hard drive, throughput is 10 times times more than the hard disk. Some vendors plug SSD into memory slots to form a memory strip. Performance is not as good as P3700, the future performance will gradually improve. SSD memory compared to performance contrast is not so large, because the throughput of memory data is 8.5gb/s,pcie SSD is 2.8gb/s is only three or four times times the gap, SSD performance has begun to close to memory, while this price is also rapidly falling, SSD companies in China market very much, The whole price drop is very obvious, today the Chinese market can be 10,000 yuan to 20,000 dollars to buy 1TB SSD, if you buy the memory bar can only buy 128GB of memory, we think that the use of SSD replacement is a better solution. This is our set of data that compares data on disk, memory, and SSD. The first one we found was we put the data on disk, performance as 1, we found that memory performance is still the highest, blue line is the data in memory of the statistical performance, the Red Line is PCIe SSD performance, SSD performance and memory is close, we see each test scenario performance comparison chart, found that the performance gap of up to 30%, the average performance difference of 9.6%, the basic control within 10%, you can achieve a one-tenth of the price of memory only a difference of 10% of the performance of the product, you originally put in memory may be only hundreds of GB or several terabytes of data, Now you can put dozens of TB of things on the SSD for data analysis.
The
Hadoop2.6 proposes a concept called Storage tier, which provides several layers of storage on the HDFs, one layer is the disk layer, the other is the SSD layer, and the memory layer, I can specify which layer you put the file on, in 128MB data block, Decide on which layer to put on, thinking that this performance can be quickly improved. We soon found that it was not that simple, Hadoop is the first design for large-capacity low-speed disk, SSD is 10 times times larger than normal disk read and write performance, its random access performance is 1000 times times the disk, if you can not take advantage of random access performance, your promotion will not be as significant as the hardware indicators. We've tried the orc format, and the performance is only 3 times times higher than the average hard drive. We think there are two trends next year, the first is that disk-based Hadoop is slowly starting to optimize for SSDs, and there will be more optimizations to be done specifically for SSDs in the future. The second trend memory database so manufacturers are beginning to feel that there is not enough memory, I can't put all the data in memory, maybe dozens of T data I need large media, SSD is the ideal alternative, there are many traditional database manufacturers feel that his database is specifically for the SSD to do optimization. We designed a new data format, called Holodesk, where spark data was previously placed in memory, we first stripped the data out of the spark and placed it on an external medium. Then put on the SSD for storage coding compression, which uses our own proprietary coding technology, of course, there are some indexes on top, do this transformation after the performance of a relatively large upgrade. Here is a test comparison, we compare four combinations, one is based on the disk text format, the second in the SSD run Tpc-ds part of the results, we chose the tpc-ds part of the scene, because some scenarios are CPU-intensive, disk performance is not a bottleneck, may not necessarily have a promotion, So we chose some IO-intensive scenarios to measure. Everyone soon found that if I did not change the file format, the same file on the disk and on the SSD, its performance up to 1.5 times times higher. Like we did two or three years ago, we switched the datanode of the Hadoop cluster to SSD, which is about 40% performance improvement. Turning this data into a ORC format does help improve performance, I can filter a lot of data, and can be fully high SSD performance, this format performance is 2.7 times times more improved. But this is not enough, not fully play the advantages of SSD performance, so we adopted the Holodesk storage format we designed, we use the encoding is also a bit different, using this storage format than the ORC to upgrade more than twice times, some pure IO-intensive test scenarios, Can be raised five times times to 10 times times. If the new column storage mode, we can performance than disk 8 to 10 times times faster, I believe that more software in the future will be dedicated to the use of SSDsCharacteristics。
The third trend with the current sensor network, the development of Internet of things, the speed of data generation, of course, in the Internet has been real-time data generation, so that real-time large data technology gradually began to receive more attention, we expect to have more applications next year.
How to handle real-time data and historical data. Nathan proposed the Lambda architecture architecture. Real-time data into a streaming system for detection analysis, but also into Hadoop, the full amount of data in the Hadoop analysis of historical data, the results are fused, and then the application access to the database for analysis. So far, no technology has been able to handle real-time data and handle a lot of historical data, so Nathan has proposed a hybrid architecture that has been pursued by many people. This architecture real-time data flow inside the processing is still lost, just put the results in the inside, that is, I can not be random query real-time data, this is the first question. I keep the data and historical data separated, how I formed a unified view, how the final splicing, it is more difficult things. The third serving db can complete a quick query but cannot do statistical analysis. These three weaknesses soon everyone realized, and soon everyone came up with a solution, there is a project called Druid, the project received more attention, and now Twitter and Yahoo Use this real-time data analysis. Druid solves two problems, combining real-time data and historical data into a single view, which collects the real-time data off-line and combines it into a historical view for analysis. Druid solves the problem of fast acquisition and unified view, but it does not solve the problem of complex statistics and mining. The ideal architecture is to enter a database directly after the data has been streamed, this database can be a complete combination of real-time data and historical data, in which both high-speed query can do iterative analysis, which is more ideal, so that you can save the maintenance of two sets of architecture, and can not only analyze real-time data, And can analyze the historical data, this is the ideal structure, now we are still thinking of what to achieve the method. We made a try, this is a common problem that we find when applying large data technology to various industries in China, which comes from real-time analysis of traffic flow, and if we deploy the entire cluster, they want to see the real situation at every intersection at any point in the early morning rush hour, in the event of a traffic accident, A local congestion will have a ripple effect, a chain reaction needs to quickly analyze the impact of this situation, this situation with the lambda architecture is not very well implemented. We developed a distributed cache Holodesk, memory or SSD, when real-time data flow inside, we first use spark streaming for real-time detection and alarm processing, this is a batch system, can be as short as 100 milliseconds, The small delay is still not achieved, spark itself frame delay is relatively long. After entering the spark memory, we can do a lot of real-time detection and even real time excavation on the top, andThis result, as well as the original data, is mapped into a two-dimensional relational table, made into a SQL transformation, and transformed into a historical store. In the past we can put in memory, but not enough memory capacity we put the full amount of data on SSDs, our holodesk support for fast insertion, so that I can all real-time data including historical data can be all cached to SSD, this cluster may be 10 nodes 20 nodes, You can store a few years of traffic data so that you can analyze the historical data and real-time data in a complete way. This transformation is also very fast, we support high-speed data insertion, data persistence is guaranteed, because the data on the SSD, even if the line is not a problem. This solution solves three problems, the first one is to have a unified view, whether it is historical or real-time, the second is through the standard SQL or R language to do any complex analysis, the third is the problem of data persistence has been resolved. There is still a problem unresolved, if this data to the online user access, this concurrency is not enough. The improved method is that we try to reduce the delay of the query, and we need to expand the cluster scale to increase the degree of concurrency. The solution is a good solution to the problems facing the transportation industry. This problem is quite common, not only the transportation industry, but also the Web site click Log, we can use this way to do analysis, we can the sensor data, such as the factory sensor data for rapid analysis, and the sensor data can be all in a table.
With the rapid evolution of virtualization technology, we say that cloud computing and large data can finally be merged.
Virtual machines help rapid deployment has been validated by time, which splits a machine into many small machines that each machine uses. Big Data think a machine is not enough, I need thousands of units, hundreds of machines to form a machine to deal with. How does this blend together, do I make a virtual machine instead of a physical machine into a cluster? This attempt is basically a failure, because the IO bottleneck is very serious, especially in the virtual machine run large data applications, CPU utilization is often up to 99%, very few people on the virtual machine to the CPU to 99%, so hypervisor is a big test, stability becomes a big problem. Virtualization technology has grown rapidly in the last two years, no less than a new technological revolution. First of all, lightweight Linux container technology appears, container can do resource isolation, which makes virtual machines very lightweight. Soon a company called Docker found application Package migration installation is still inconvenient, so a tool to make your application package migration is very easy. It's not enough, because it's easy to create a single container or a single application, but multiple container applications are cumbersome. Google develops an open source project called Kubernetes, simplifies the task of creating container clusters, you can easily create Hadoop clusters, you can create traditional applications, provide deployments of multiple container clusters, and provide basic services, For example, some scheduling services, which began with the prototype of a distributed operating system. Another direction like the Big data field last year launched the framework of Hadoop2.0 resource management yarn, this is really revolutionary, because the resource management on the bottom, on the above can run a variety of computing framework, we feel that we can eminence. Then we found that yarn resource isolation did not do well enough, memory/disk/io did not manage well. So Hortonworks tries to use Google Kubernetes as a creator Manager for yarn, which uses Docker for resource scheduling. While another company mesosphere a sudden rise, taking mesos as the core of resource dispatching, taking Docker as the basic tool of container management, developed a set of distributed resource management framework and proposed the concept of data center operating system. The company recently financed tens of millions of dollars. Although the underlying technology is changing rapidly, it does not prevent some companies from already offering Hadoop as a service, such as Altiscale,bluedata,xplenty.
There has been a revolution in the past year or two in this field, and very big changes have taken place from the underlying virtualization technology to the top. Gradually leads to the concept of data center operating system. We divided the data center operating system into three tiers, the bottom is the same as the operating system kernel, can easily create easy to destroy computing resources, including CPU network/memory/storage processing. At the same time we need a number of services to be able to find such a mechanism, this mechanism is still lacking, we need to continue to add some basic services at this level. Then there is the platform service, we can create Hadoop, spark etc. we can deploy such traditional applications. This architecture suggests that we now have several, two technical directions in the market and we don't know which one will win. One Direction is to use yarn as the basis for resource scheduling, kubernetes as an application framework running on yarn, but in fact kubernetes is tied to yarn on the same level. Another technical direction is to abstract the scheduler as plugin, such as yarn and mesos can be used as Kubernetes Scheduler, of course, can also implement their own scheduler, Docker or CoreOS for the management of container, Distributed services such as Hadoop run on top of kubernetes. can provide resource isolation and management, for all the services that are available, including the Hadoop ecosystem, this may be the trend for next year, and it's hard to tell who will win, but I prefer the second one, and we can start by trying these two options to see which one is more alive.
To sum up is that we put the trend of the next year into four, a mixed structure will gradually disappear, the second we found that SSD slowly replaces the memory buffer, because of the higher cost performance. The third real-time big data technology obtains the widespread attention and the application, fourth cloud computing and the big Data finally can merge. Here to do an advertisement, we can go outside the booth to see our new version, welcome students interested in joining our company.
More highlights, please pay attention to the live topic 2014 China Large Data Technology Congress (BDTC), Sina Weibo @CSDN cloud computing, subscribe to CSDN large data micro-signal.