The evolution of the Apache Kylin Big data analytics Platform
Ext.: http://mt.sohu.com/20160628/n456602429.shtml
I am Li Yang from Kyligence, co-founder and CTO of Shanghai Kyligence. Today I am mainly here to share with you the new features and architecture changes of Apache Kylin 1.5.
What is Apache Kylin?
Kylin is an open source project developed in the last two years and is not very well known abroad, but is widely known in China. Kylin's positioning is a multidimensional analysis tool on the Hadoop Big Data platform, which was first hatched by ebay's research labs in Shanghai, providing an Ansi-sql interface that supports very large datasets and expects to return query results at the second level in the future. Open source in October 2014, Kylin is now one of the few Chinese-dominated Apache top projects.
1.SQL Interface
Most Hadoop analysis tools and SQL are friendly, so Apache Kylin has a SQL interface that is especially important. Kylin's ANSI SQL can replace a large part of Hive's work, and if you don't use hive native dialect, then Kylin and hive are almost completely compatible and are part of SQL on Hadoop.
The main difference between Kylin and other SQL on Hadoop is the offline index. Before using the user to select a Hive table collection, and then on this basis to do an offline cube build, after the cube has been built to do SQL query. The relational table model under SQL data is identical to the original hive table, so the original hive query can be migrated intact to the Kylin directly.
Offline computing instead of on-line computing, in the offline process of complex, computationally large work done, the online calculation will become smaller, you can quickly return the results of the query. In this way, the Kylin can have less computational capacity and higher throughput.
2.Big Data
ebay announced in 2015 that Kylin already has nearly billions of data, and in 2016 it is sure to be more than hundred billion. But this may not be Kylin's biggest case, because according to the data we get from China Mobile, they may have tens of billions of incremental data to be put into the kylin system every day, perhaps more than 10 days. Many of China's first-tier Internet companies are also using Kylin technology for multidimensional data analysis.
3.Low Latency
Kylin's query performance is quite good, which is the original design goals. Our goal is to be able to return the query results in the second level, in the actual production system, Kylin 90% of the query can be returned in a stable three seconds, and this is not a two special SQL can do this performance, but in tens of thousands of different, under a variety of complex query SQL can do so.
You can see that one day Kylin query latency has a mountain, so it is not said that as long as the use of Kylin all queries will be fast, but after tuning most of the query will be very fast.
Integration of 4.BI Tools
Kylin provides a standard ODBC and JDBC interface that can be well integrated with traditional BI tools. Analysts can use the tools they are most familiar with to enjoy the Kylin brought fast.
5.Scalable throughput
The Kylin is an offline calculation instead of on-line computing, which, compared to other tools, has a smaller amount of on-line computation and can have higher throughput rates on fixed hardware configurations.
This is an experiment that looks at the linear scaling capabilities of kylin under two more complex queries. We increase the number of Kylin query engines on a simpler machine, and we can see that the throughput of Kylin is linearly increased from one instance to four instances, and Kylin can support about 250 queries per second. Of course, this experiment has not detected the whole system bottleneck, according to the theory, the Kylin system bottleneck will eventually fall on his storage engine. So, with storage guaranteed, we can extend the Kylin throughput by extending the storage engine.
Apache Kylin 1.5 new features
1. Extensible architecture
Kylin uses an extensible architecture. The user's data is first landed in Hive, and then based on the cube description defined by Meta data, the offline cube is built, and the completed cube results are stored in hbase. When the query comes from the top, whether it is the SQL interface or the rest API interface, the query engine will direct this query to the built cube to return the results, no need to check the original hive data, this way greatly improve the system performance.
The so-called extensible architecture refers to the abstraction of Kylin three dependent interfaces to some extent to replace them. The three dependencies of Kylin are hive Source, MapReduce distributed computing engine, and the storage engine hbase, all driven by the original data, which requires the data source, build engine, and storage system to be declared on the cube's original data. By initializing three dependencies through the factory class, there is no correlation between them and they cannot understand each other's existence, so they cannot work together. After using a mode of adapter, imagine the following MapReduce engine as a motherboard, which has an input slot and an output slot, respectively, to connect the left datasource and the right side of the storage. From the hive and HBase respectively generated an adapter component, inserted in the motherboard, the three parts are connected, the data can flow from the left to the right, complete the process of building the cube.
With this foundation, we can try out different build engines, data sources, and storage engines on the Kylin system. We tried to use spark as the build engine for the Kylin cube, but from the experimental results, the spark engine did not provide a particularly high performance boost for the time being. Currently, data sources, in addition to hive, can now connect to spark and Kafka. Storage Engine is the most concerned about, at the outset, the use of HBase as Kylin storage engine, everyone is very puzzled, there are many people say why not try kudu or other storage engine, with this extensible architecture, you can personally try different storage engine.
The whole extensible architecture brings a lot of benefits, first of all, to freedom, before Kylin and so on to the Hadoop platform, relying on hive,mapreduce and hbase. With this architecture, you can try some different alternative techniques. Second, scalability, the system can accept a variety of data sources, such as Kafka, but also can accept a better distributed computing engine spark and so on. The third is the flexibility, the different construction algorithm is suitable for the different data sets. With flexibility, there are a number of different cube building algorithms that can be used throughout the system, and users can specify one of them according to the characteristics of their datasets.
2.Layered Cubing
MRV1 is an older cube engine, using a very simple cube building algorithm. Shown is a layered cube building process, first group by A, B, C, D four dimensions, after the level of the four dimension, and then the results of the four-level dimension to calculate the level of the three-tier dimension, and so on, respectively, calculated the results of two-level and one-dimensional dimensions.
This layering mode can take advantage of MapReduce's shuffling and merge sort to do a lot of aggregation, thus reducing the amount of development. But it also brings a number of problems, because aggregation occurs in the reduce side, the map is directly to the original data thrown on the network, and then by the MapReduce shuffling let the data aggregated to the reduce side, so this brings a lot of network overhead, And the network is the bottleneck of most hadoop systems. The data show that such layered cubing to the network pressure equivalent to 100 cube size, that is, if the cube has 10T, then the network pressure may be 1000 t.
3.Fast Cubing
How to solve this bottleneck problem, the following for you to share a new algorithm fast Cubing, it is reverse thinking, since the data in the reduce side to do the aggregation will have a lot of network pressure, then can be put into the map to do the aggregation, and then the results of the aggregation over the network transmission, Make the final aggregation on the reduce side, so that the reduced end receives less data and the network pressure becomes lighter. At present, more classical multidimensional analysis is the use of memory to do multidimensional computing, we use a similar technique in the map side allocation of relatively large memory, with more CPU to do in-mem cubing, the effect is similar to layered occurred in the map side. These processes are completed with data that has already been aggregated and then distributed over the network to the reduce side for final aggregation. The disadvantage of this approach is that the algorithm is more complex, development and maintenance is difficult, but can reduce network pressure.
We compare two algorithms to the actual production environment and find that it is not always fast cubing faster. We expect the pre-aggregation of the map end to reduce network shuffling, but that is not necessarily the case because it depends on the data distribution. For example, our expected result is how much Li Yang bought on October 1, the total amount of consumption, then it depends on whether the consumption record is only in one data splits or in all of the map's data splits. If the record only appears on a map, then the aggregated results do not need to go to the other map to do a second aggregation, network distribution is relatively fast. But if, unfortunately, the transaction is distributed evenly across all the maps, it is still distributed over the network many times, and then the second aggregation is done in reduce, so there is not much improvement compared to the previous layered cubing.
If the map's data splits is unique, each map generates different cube data, and then the distribution is not duplicated, then fast cubing does reduce the transmission of the network. But conversely, if the data of each map is identical, then it will still cause the pressure of the network, so the last one in MRv2 is a hybrid algorithm. First, the data is sampled, according to the data sample to determine whether the data set on the map allocation is unique or duplicate, and then based on such characteristics to choose whether to use layered Cubing or fast Cubing. We found this hybrid algorithm 1.5 times times faster than the original MRV1 by testing in 500 different production environments.
4.Parallel Scan
Parallel scanning is a very intuitive improvement. In the previous Kylin version, the density was very high after aggregation, and because the data was aggregated, the return set was small, and the result of the SQL query could be returned without having to scan too large a dataset. But for some of the more complex or relatively slow queries, despite the aggregation, but the data is still million, thousands, then in the run time still want to scan a lot of data, the simple serial scan is obviously not suitable. If you adjust the storage structure of the data, do some partitioning. By scanning materialized views to produce query results, the materialized views on a node are scattered evenly over multiple nodes, then the serial scan becomes a parallel scan.
This improvement can make the slow query speed up to five to 10 times times, but from the actual situation to improve not so much, because the original majority of Kylin query has been relatively fast, scanning data is not much. By comparing the results of 10,000 or so of production status, we find that the speed of the parallel scanning technique will be increased by about twice times.
5. Near real-time
Another feature of Apache Kylin 1.5 is the near real-time build, which is the incremental build before continuation. Kylin, like many big data systems, does an incremental preprocessing of the data when it is preprocessed, that is, instead of counting all the data in the past every day, it only calculates today's data and then matches the historical data. So the first thing to do is to divide the entire data set according to the time line, the longest distance of the data will be larger, perhaps by the year, the middle may be the month, the smallest data set is today. If you want to achieve near real-time, only need to increase the time of the daily incremental construction of a further small, can be reduced from the day to the hour, the hour narrowed to minutes, according to this idea can be very smooth completion of near real-time cube construction.
This is a case we tried in 1.5, where the source of the data is from Kafka, and the algorithm uses fast Cubing. This pairing looks perfect, but in fact, it will produce a lot of cube fragments, such as five minutes today is a separate data set, it will produce a separate cube fragment. As this fragment becomes more and more, the query performance drops, and a query command needs to hit a lot of fragments, each of which performs a scan operation on the storage layer.
The solution is also very simple, that is, merging cube fragments, but this merge is the automatic normal, no need for manual to trigger. The new version of the user can be configured to automatically merge, the five minutes of fragmentation into half an hour, half an hour to merge into four hours, four hours to merge into one day, days merged into weeks, Weeks merged into the month.
If the five-minute near real-time still does not meet the demand, it can be turned into a lambda architecture, that is, in addition to the cube's storage and a real-time memory storage system to record the last five minutes of data. Cube five minutes close to real real-time five minutes of data, put this data in memory, with a mixed query interface to hit both the memory engine and cube storage, then the result is a real implementation of the result set. Unfortunately, however, the idea has not yet been fulfilled.
In the use case published by ebay, there is a case--seo Dashboard, a new version of Kylin near real-time cube, that monitors the user traffic imported by the query engine. Real-time monitoring from Google or Yahoo in the consumer records, real-time monitoring traffic fluctuations, once the user traffic in five minutes of jitter, and immediately take appropriate measures to ensure that ebay's volume revenue stability.
6. User-defined aggregation type
Another new feature of 1.5 is the user Defined Aggregation Types, a custom aggregation type, previously Kylin hyperloglog (approximate count distinct algorithm). On top of this, the new version joins the TOPN and community-contributed accurate count distinct based on big map and the record raw Records that holds the lowest-level raw data. Users can implement an abstract interface to extend their desired aggregation functions. For example, it is used to aggregate many user events, extract the user's access model, or make a cluster of many point samples, or you can calculate it as an aggregated data type, so this custom function can be extended to many areas.
TopN with a very classical algorithm, called spacesaving, in a lot of streaming is useful inside. We took TOPN into the kylin and defined it as a custom aggregate function. The general spacesaving is a single-threaded algorithm, but Kylin uses a parallel algorithm.
User-TOPN queries, such as fetching 100 data, are written as SQL statements as shown in. and Kylin will automatically adapt to such a SQL to directly use the pre-aggregated good results, so at run time Kylin just put the pre-calculated 1000, 10,000 item directly back to the good, there is almost no online calculation, the speed will be very fast.
7. Integration of analysis tools
In the new version, Kylin also added some ODBC interfaces, primarily for Tableau 9 integration and integration with Ms Excel and Ms Power BI.
Zeppelin's integration module has also been shared in the Zeppelin open source community, you can find in the latest release of Zeppelin, in addition, directly from the Zeppelin can also call Kylin data.
Summarize
Overall, Apache Kylin 1.5 There are a few new highlights: 1. Extensible architecture, this new architecture, etc. opens the door to Kylin for other alternative technologies, we can choose a parallel computing engine other than mapreduce, such as Spark, to choose a different data source, or even a different storage. This ensures that the Kylin can evolve along with other parallel computing and big data technologies rather than locking them down on a platform. 2. The new cube engine, because the introduction of a new fast cubing algorithm, the speed of the increase of about 1.5 times times the original, 3. Parallel scanning, the improvement of the storage structure makes the query speed up about twice times. 4. Near real-time analysis, although still in the product testing stage, but we can come to the community to use, found that the problem can communicate with us in a timely manner. 5. User-defined aggregation type, this part in the future should have a lot of room for development. 6. More analysis tools are integrated.
The above is what I want to share with you, Kylin is an open source product, so welcome everyone interested to use, and with us in the community to interact, there are any problems our community is very happy to help you solve.
The evolution of the Apache Kylin Big data analytics platform