kylin-Practice OLAP

Source: Internet
Author: User
Tags ldap hadoop ecosystem olap cube

The history and basic concepts of OLAP

OLAP is all called online analytical applications, and is a solution for multidimensional data analysis queries. Typical OLAP applications include sales, marketing, management and other business statements, budget accounts, economic statements, and so on.

The earliest OLAP query tool was the Express published in 1970, but the complete OLAP concept was presented in 1993 by the parent of the relational database edgarf.codd, accompanied by the famous "Twelvelaws of online analytical Processing ". 1998 Microsoft released Microsoftanalysis Services and introduced the MDX query language through the OLE DB for OLAP API in the early years, and 2001 Microsoft and Hyperion published XML foranalysis Become the de facto standard for OLAP queries. Today, MDX has become a SQL-matched OLAP query language, supported by various OLAP vendors.

Olapcube is a typical multidimensional data analysis technique in which the cube itself can be thought of as a dataset of different dimension data, an OLAP Cube can have multiple dimensions (Dimension), and multiple facts (Factor Measure). The user uses OLAP tools to perform multidimensional analysis of data from multiple angles. OLAP is generally considered to include three basic analytical operations: roll Up (Rollup), Drill Down (drilldown), Slice cut (Slicingand dicing), and the original data is aggregated and organized into a view of one or more dimensions.

ROLAP and MOLAP

Traditional OLAP is divided into ROLAP (relational OLAP) and MOLAP (MULTI-DIMENSIONOLAP) based on the different data storage methods

ROLAP is stored as a relational model for multidimensional analysis of data, the advantage is that the storage volume is small, the query method is flexible, but the shortcomings are obvious, each query needs to aggregate data, in order to improve the short board, ROLAP using the column, parallel query, query optimization, bitmap indexing technology

MOLAP physically stores the data of the analysis as a multidimensional array, which forms the cube structure. The attribute value of the dimension is mapped to the subscript or subscript range of a multidimensional array, the fact that the value of the multidimensional array is stored in the array cell, the advantage is that the query is fast, the disadvantage is that the data volume is not easy to control, the problem of dimension explosion may occur.

The challenge of OLAP in the Big Data era

In the last 20 years, the ROLAP technology with the development of MPP parallel database technology, especially with the support of the storage technology, realized the big leap of analysis ability, along with the further reduction of memory cost, the expansion of single node memory, the query performance of cluster single node leap, The usefulness of the in-memory database has stepped up to a new level, and the result of these technological advancements is that similar technologies basically cover terabytes of data analysis needs. The advent of Hadoop and related big Data technologies provides a near-limitless data platform, with the support of relevant technologies, and the data of each application has broken through the upper bound of capacity that traditional OLAP can support. Tens of hundreds of millions of of data per day, provide a number of dimensions of the analysis model, Big data OLAP is the most urgent to solve the problem is a large number of real-time operations caused by the response time lag.

2. Apache Kylin OLAP solution under Big Data

Apache Kylin Background

Apache Kylin is a MOLAP system under the Hadoop ecosystem and is a distributed OLAP analysis engine that supports TB-to-PB-level data development and open source from the ebay Big Data division from 2014 onwards. The features include:

    • Scalable, ultra-fast OLAP engine
    • Ansi-sql interface Available
    • Interactive query Capabilities
    • The concept of MOLAP Cube
    • Seamless integration with BI tools

Typical application scenarios for Apache Kylin are as follows:

    • User data exists in Hadoop HDFs, using hive to access HDFs file data as relational data with a large amount of data over 500G
    • Up to a few g or even dozens of g of data incremental imports per day
    • There are 10 or so fixed analysis dimensions

Apachekylin's core idea is to use space to change time, because of the query aspect has developed a variety of flexible strategies to further improve the space utilization, making such a balance strategy in the application is worth using.

The overall architecture of Apache Kylin

Apache Kylin, as an OLAP engine, completes a series of tasks, such as fetching data from a data source, ETL to its own storage engine, providing rest services, and its architecture:

Apache Kylin's biosphere includes:

    • The framework, queries, tasks, and storage engines of the Kylin Core:kylin engine are all focused on this, in addition to a rest server to respond to various client requests.
    • Extensions: Various plugins that provide additional features, such as security authentication, SSO, etc.
    • Integrity components: Job Manager, ETL, monitoring, and alerting
    • Interactive interface: User interface based on Kylin core
    • Drive: Provides a way to connect JDBC and ODBC

The computation of multidimensional data of Apache Kylin cube

The multidimensional computation of Apache Kylin is mainly embodied in the calculation of Olapcube. The cube is composed of multiple cuboid, and the data on the cuboid is the data aggregated by the raw data, so creating a cube can be thought of as a process of precomputed preprocessing when the raw data is imported. The power of Kylin is to take advantage of Hadoop's mapreduce parallel processing capabilities to efficiently process imported data.

Apache Kylin data comes from Hive, and as a hive accelerator hopefully the final query SQL is similar to querying directly on hive. So Kylin needed to get the hive table metadata from hive when building the cube. Although there is the process of building a cube, it is not intended to expose the existence of cube to ordinary query users.

Apache Kylin The process of creating a cube is as follows:

    1. Using hive to create a wide table based on the fact table and dimension table defined by cube
    2. Extracts the distinct value of a dimension on a fact table, encodes the dimension on the fact table into a directory as a dictionary tree, encodes the dimension table as a dictionary tree
    3. Using MapReduce from the first step to get the wide table file as input, create N-dimension cuboid, and then each time based on the results of the previous step generated N-1 cuboid, N-2 cuboid ... 0-cuboid
    4. Calculates Htable's region segmentation strategy based on the amount of generated cuboid data, creates htable, imports hfile into it

Apache Kylin, like traditional OLAP, cannot cope with the data update (updating the data will invalidate the cube and need to rebuild the entire cube). Kylin uses an incremental cubing technique to respond quickly to incremental data that is fixed over a daily, even every two hours.

Apache Kylin's cube can be divided into multiple segment depending on the time period. After the first build of cube, there will be a segment, and a new segment will be generated after each incremental build. Incremental cubing relies on existing cubesegments and incremental raw data. The steps for incremental cubing are similar to the steps for creating a cube, and segment are distinguished by time periods.

Incremental cubing need to face a smaller amount of raw data, so the incremental cubing speed is very fast. However, as the number of cubesegments increases, the query will be affected to a certain extent, so after the number of segments to a certain number of cubesegments merging operations may be required, In fact Mergecube is the synthesis of a new large cubesegment to replace, the merge operation is an asynchronous online operation, will not affect the front-end query business.

The merge operation steps are as follows:

    1. Traverse the specified cube Segment
    2. Merging dimension dictionary catalogs and dimension table snapshots
    3. Using MapReduce to merge their n-dimension cuboid
    4. Convert cuboid to hfile, generate new htable, replace multiple htable

Apache Kylin improvements to traditional MOLAP

Calculating the storage cost of cube and calculating cost are relatively large, and the problem of the dimension explosion of traditional OLAP is kylin also encountered. Kylin provides users with some optimization measures that can, to a certain extent, reduce the problem of dimension explosions:

    1. Cube Optimization:
    • Hierachy Dimension
    • Derived Dimension
    • Aggregation Group

Hierachy Dimension, a series of hierarchical relations Dimension composed of a hierachy, such as the year, the month, the day composed of a hierachy, in the cube, if not set hierarchy, will be years, months, days, months, year, month, day 6 cuboid, but after setting up the hierarchy, cuboid added a constraint, hoping that the low level dimension must accompany the dimension of the height. The Hierachy Dimension is set to halve the dimension combinations that need to be computed.

Derived Dimension, if you have more than one dimension on a dimension table, you can set it to Derived Dimension, which is replaced by the primary key of the dimension table in Kylin, so as to reduce the number of dimension combinations. Of course, to a certain extent derived Dimension will reduce the efficiency of the query, when the query, Kylin using the primary key of the dimension table to aggregate, and then through the primary key and the true Dimension column mapping relationship to do a conversion, in the kylin inside the result set once aggregated and returned to the user

Aggregation Group, which is a means of grouping dimensions in order to reduce the number of dimension combinations. The number of cuboid between the dimensions of different groupings will be greatly reduced, the dimension combination from 2 (k+m+n) power can be reduced to 2 of the K Power plus 2 m power plus 2 n power. The group's optimization measures are closely dependent on query SQL, which can be said to be a custom optimization for queries. If the dimension of the query is a kua group, then Kylin needs to aggregate the desired query results from the n-cuboid at a significant cost, which requires the cube builder to carefully weigh in modeling.

    1. Data compression:

Apache Kylin uses a special compression algorithm for dimension dictionaries and dimension table snapshots, which leverages Hadoop's lzo or snappy for aggregated computational data in hbase, ensuring that data stored in hbase and in-memory is as small as possible. The compression of dimension dictionaries and dimension table snapshots takes into account that there are very many duplicate dimension member values in Datacube, the most straightforward way is to map dimension values to IDs using the data dictionary and encode the dimension values using the Trie tree in Kylin

    1. Distinct Count aggregation query optimizations:

Apache Kylin uses the Hypeloglog method to calculate the DistinctCount. The advantage is that the speed is fast, the disadvantage is that the result is an approximate value, there will be some error. It is generally acceptable to apply the statistical error of DistinctCount in non-billable and other common scenarios.

The specific algorithm can be seen paper, this article will not repeat:

Http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

Apache Kylin SQL query implementation

ANSI SQL query is a very obvious advantage of Apache Kylin. Kylin's SQL parsing relies on another open source data management framework Apachecalcite, calcite, the previous Optiq, is a database without a storage module, that is, an algorithm that does not manage data storage, does not contain processing, and does not contain meta-information storage. It is therefore well suited to make an application to the middle tier between the storage engines. On the basis of calcite, a dedicated adapter (Adapter) is written for the storage engine to form a feature-rich "class database" that supports DML or even DDL.

Kylin completed a custom adapter, which is queried by Kylin to define the execution rules for each node of the syntax tree after calcite completes the SQL parsing and forms the syntax tree (AST). Calcite in the traversal of the syntax tree node into a kylin to describe the query model digest, Kylin will be digest to determine whether there is a matching cube. If there is a cube that matches the query, select a cube with the least cost of the query (the query cost calculation for Kylincube is currently an open interface, which can be calculated based on the number of dimensions, depending on the size of the data)

Kylin the current multidimensional data storage engine is HBase, Kylin takes advantage of HBase's coprocessor mechanism to complete partial aggregation and all filtering operations in HBase regionserver, which is calculated in advance of Hbasescan, Accelerate Kylin SQL queries with the compute power of hbase multiple region servers. At present, Kylin still have some query syntax is not supported, especially the filter where part of the constraints are more, there are certain requirements for SQL, but if targeted to the coprocessor part of the transformation believe that the SQL compatibility can be significantly improved.

Apache Kylin and Rtolap

Apachekylin can be said to be with the market popular Presto, Sparksql, Impala and other directly on the original data query system (for the moment to Rtolap) took a completely different road. The former is optimized in terms of how to obtain the predicted results quickly, and to optimize the query parsing so that more queries can be used in the prediction results. Subsequent versions of Kylin will improve the precomputed engine, optimize the estimated speed, and enable Kylin to become an approximate real-time analysis engine. And such as Presto,sparksql is focused on the process of optimizing the query data, like some other data warehouse, the use of column storage, compression, parallel query technology, optimization of the query. The benefit of this approach is that it is scalable and adaptable to a wider range of queries. However, in the query speed, it can be said that Apache Kylin than ROLAP at least an order of magnitude, so the query response time requirements for a higher application, Apachekylin is the best choice.

3. Apache Kylin in NetEase

Kylin of Service

At NetEase, Apache Kylin, as the OLAP query module of the big Data platform, can provide services for the company's various analytical needs and applications. All data exists on Hadoop Hive and can be accelerated by the Kylin OLAP engine. Within the company Kylin as a unified platform, with the product data Warehouse to connect.

The current deployment architecture for Kylin is as follows:

The Kylin cluster consists of multiple query nodes and control nodes. Control node unique, responsible for cluster projects, task scheduling and cube additions and deletions to check and change. Many query nodes before using Nginx to do load balancing, the latter node can be expanded on demand level. The frontend can support both JDBC and ODBC client queries.

Kylin Performance

Before Kylin on-line, we selected some of the company's original report business performance comparison, compared with the same data, Kylin query and Mondrian combined with Oracle query comparison.

The test results are compared by a datastream report with a large amount of data:

Then look at the throughput of the Kylin, using Haproxy for the performance of the increased throughput of the Kylin server after request forwarding:

On the improvement of Kylin by NetEase

Native Community Edition Aapche Kylin is required to be deployed on a unified bottom-level Hadoop, Hive, hbase cluster. NetEase internal Big Data platform for a variety of reasons, divided into multiple Hadoop clusters, the application will be in different Hadoop clusters to establish a hive data warehouse. The most primitive and natural idea is to deploy a set of Kylin services on every Hadoop environment to meet different needs, but the complexity of cluster resource management, computational resource scheduling and management operations is a prominent problem. For example, the user data is on the hive of a room, and the Hadoop cluster in a room does not have enough computing resources to ensure the efficient operation of the kylinolap. Therefore, according to the company's actual large data platform distribution and computer room construction situation, will kylin to create a unified service platform within the company is a better choice. The OLAP team developed the open source version of Kylin two times and submitted the improved patches to the community and received positive feedback.

The current improvements mainly include:

    • Kylin Support for Kerberos authentication
    • Kylin deployment support for non-Hadoop nodes
    • Support for multiple data sources

In the company, due to performance and security considerations, different departments of the application will build their own hive for data analysis, and because the company does not have a cross-room Hadoop cluster, so there will be user data in a place hive, The Hadoop cluster in a room does not have enough computing resources to keep the kylinolap running efficiently.

After a comprehensive analysis of realistic scenarios, we chose the largest Hadoop cluster in the company as a cluster of compute engines for kylinolap, guaranteeing ample storage and computing resources. HBase uses a separate cluster to avoid interaction between hbase queries and Hadoop cluster tasks. The data source hive allows user customization and currently supports the use of KYLINOLAP services with different hive nodes under different hive and different Hadoop clusters under the Hadoop cluster. According to the actual configuration of the user Data warehouse may appear across the cluster of data source extraction calculation, because the company with the city computer room has a dedicated network, Data Warehouse hive in the amount of source data is much smaller than kylin actual aggregated data storage (in HBase, The size of the data is typically more than 10 times times that of the data source hive, so this overhead can be thought of as less of an impact and is proven in our testing.

Kylin OLAP with mammoths and countless combinations

Mammoth is NetEase internal unified Big Data portal platform, in order to let Kylin faster and better integration into the big platform, OLAP team has planned to complete with mammoth Big data platform in the near future to get through and integration, Kylin OLAP will be deep embedded in mammoth, The user can complete the simplified management of KYLINOLAP based on mammoth platform. Mammoth platform docking Control node, as a professional data Modeler operation Portal

    • Kylin will take advantage of Mammoth's user management capabilities
    • Mammoth will take over the creation of user projects and the management of cubes
    • Mammoth will be the original hive data source completely and Kylin, easy to kylin manage user's data source

Kylin native user management is based on LDAP, if you do not use the LDAP Service need to use springsecurity re-development set, NetEase's internal mammoth data platform has a mature and complete user access control system, so can use the existing mechanism of access to Kylin and modify the protective restrictions.

Kylin's data cube modeling, especially some advanced cube optimization functions such as Rowkey order, dimension grouping, layering and other needs higher learning costs, so it is not suitable for the general data analysts to directly operate, we designed a simplified version of the cube modeling process to the user application- Operation and approval of the way to access data.

There are many of NetEase's internal important report analysis platform, there are several kylinolap as a separate data source to support. Existing and potential hive query customers can easily migrate reports to Kylinolap, making it possible to analyze interactive reports under large data volumes.

    • You can create a report based on a cube created on a mammoth.
    • Several dimensions and measures that proactively identify Kylin cube definitions
    • The user is free to work within the scope of Kylin OLAP to complete the editing and querying of the report.

Kylin query results with a number of combinations can be presented to data analysts in more and richer charts:

kylin-Practice OLAP

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.