Trend One: Hybrid architectures will fade away
At first, Hadoop was born to make it easier to process unstructured and semi-structured data, but when it comes to processing structured data, the functionality becomes incomplete. Users also need to use databases or MPP (massively parallel processing) databases to help Hadoop process structured data. In addition, Hadoop is designed to handle hundreds of TB and petabytes of data, but when the amount of data is less than 10TB, Hadoop often does not perform as well as the MPP database.
To solve these problems, users tend to consider how the hybrid architecture is deployed: put real-time data into the MPP database, put historical data into Hadoop, or put most of the data in Hadoop, with a small amount of data in the MPP database for calculations.
Over the past three years, Hadoop has grown rapidly, and many companies have rapidly started the development of SQL on Hadoop, and their performance has improved a lot. Currently, there are 4 main types of SQL engine technologies that are natively developed in the Hadoop system: The first is Impala, using an MPP-like engine, and the second is Tez, which absorbs some of the design ideas of spark, and the third is Transwarp Inceptor, A spark-developed SQL engine; the fourth is spark SQL and drill.
With the rapid development of SQL on Hadoop technology, the high degree of SQL integrity and the improvement of performance, Yuanhao that the hybrid architecture is fading away. This prediction is made because the 3 advantages of the past MPP database are gradually weakened as SQL on Hadoop matures. First, the traditional MPP database has a relatively complete support for SQL, and now Hadoop is nearing the level of support for SQL to the MPP database. Second, the traditional MPP data processing performance is high, and now, Hadoop performance has exceeded the MPP database several times. Third, the traditional MPP database on the extension tools are very rich, and now, many traditional BI vendors have already supported Hadoop, some emerging startups have developed new BI tools on Hadoop, the extension tools on the Hadoop system are more and more rich, The Hadoop ecosystem will soon surpass the traditional MPP database.
In the future, Hadoop will gradually replace the MPP database, and users will gradually not need to use a hybrid architecture that does not need to be migrated between different databases. The MPP database will fade away and slowly merge into Hadoop. The amount of data a user can handle on Hadoop, regardless of size, is truly unlimited linear expansion.
Trend two: SSD will replace memory
With the development of hardware technology, Yuanhao found that as a cache, memory can be replaced by large capacity SSD (solid-state drive). The speed at which memory reads data is a hundredfold or even thousand times of disk, but SSD performance has already begun to close to memory. At the same time, SSD prices are rapidly declining. Today, 1TB capacity SSDs can be purchased in the Chinese market at a price of 10,000 to 20,000 yuan. Yuanhao believes that replacing memory with SSDs is a good solution at the moment.
A concept called storage tier (storage layer) is presented in Hadoop2.6. It provides three tiers of storage on the HDFS (Distributed File System): The disk layer, the SSD layer, and the memory layer. In a data block of size 128MB, the user can put the file in a specified layer to increase the data access speed. However, the user quickly discovers that things are not that simple. Because Hadoop was originally designed for large-capacity low-speed disks, SSDs are 10 times times more efficient than normal disk reads and writes, and its random access performance is 1000 times times that of disks, and without the performance benefits of random access, the improved performance will not be as significant as hardware metrics.
As a result, Yuanhao believes that the disk read-write-based Hadoop will slowly begin to optimize for SSDs in 2015, and more optimizations will be devoted to SSDs in the future. In addition, the memory database vendors will start to feel the lack of memory bottlenecks, SSD will become the best alternative to memory.
Trend three: Real-time Big data gets more attention
With the development of sensor network and Internet of Things, the speed of data generation is more and more fast, so the technology of real-time big data is getting more attention.
Until today, no technology has been able to handle both real-time data and large amounts of historical data. Yuanhao said that for real-time data and historical data processing, Nathan Marz proposed a lambda architecture (a stream processing application based on MapReduce and Storm). Real-time data enters a stream processing system for analysis, historical data is analyzed on Hadoop, and the results of the two data analyses are then fused, and the application can access the converged database.
However, there are 3 problems with this hybrid architecture: first, the real-time data stream processing system after processing the data discarded, leaving only the analysis results, the user can not be random query real-time data, and second, the real-time data and historical data separation, how to form a unified view, the final how to splice up; The data of two kinds of analysis results can be quickly queried but not complicated statistical analysis and data mining.
The emergence of the Druid project not only solves the problem of fast acquisition, but also resolves the problem of Unified view: The real-time data and historical data are all stitched together to form a view, the real-time data collected in the offline state into a historical view. However, the Druid project has not been able to solve the complex statistical analysis and data mining problems.
Yuanhao points out that the ideal architecture is that full-volume data is streamed directly into a database. This database can be integrated with real-time data and historical data, which can be used for both high-speed query and iterative analysis. In this way, IT staff can eliminate the hassle of maintaining two architectures and can analyze both real-time data and historical data.
Trend four: Cloud computing and big data will eventually converge
In the last year or two, the rapid development of virtualization technology is no less than a new technological revolution. First, the advent of lightweight Linux Container (LXC, a kernel virtualization technology), where resource isolation can be done between Container (containers), makes virtual machines very lightweight. To do this, Docker has developed a tool that makes it easier for users to migrate when they create a single container or application. However, when you create multiple containers or apps, users can still find it difficult to migrate. At this point, Google's Open source project Kubernetes appeared. It simplifies user creation of Hadoop clusters and traditional applications, provides deployment of multi-container clusters, and some basic services, such as some scheduling services.
In 2013, a revolutionary framework yarn (a new MapReduce framework) for Hadoop2.0 Resource management was born. Yarn puts resource management at the bottom and can run multiple computing frameworks on its framework. In the application process, the user discovers that yarn has not done enough to isolate the resource of memory/disk/io. To this end, Hortonworks company tried to use Google's kubernetes as Yarn's application manager, with Docker (an open-source application container engine) for resource scheduling. At the same time, Mesosphere company has developed a distributed resource management framework based on Mesos (a cluster manager) computing framework, which is the core of resource dispatching and the management tool of Docker as container, and put forward the concept of data center operating system.
Yuanhao points out that the data center operating system can be divided into three tiers. At the bottom, as with the operating system kernel, you can quickly create and release compute resources for the management of cpu/network/memory/storage. The middle tier continues to add some basic services based on the bottom-level. The top tier provides platform services to create and deploy applications such as Hadoop, Spark, and more.
According to the concept of the data center operating system, there are two major technical directions on the market at present. The first technical direction is to use yarn as the basis for resource scheduling, kubernetes as an application framework running on yarn, and kubernetes with yarn on the same layer. Another technical direction is to abstract the scheduler as a plug-in, such as yarn and mesos can be used as Kubernetes Scheduler, of course, can also implement their own scheduler, using Docker or CoreOS (a Linux kernel-based lightweight operating system) For container management, and distributed services such as Hadoop run on top of kubernetes. The second technology provides resource isolation and management to the bottom, providing a variety of services to the top level. Yuanhao that the second technology direction is likely to be the dominant trend next year.
Big Data convergence--star ring technology CTO Yuanhao on the evolution trend of Big Data Foundation technology