OLAP engine-Kylin introduction, olap engine kylin

Source: Internet
Author: User
Tags database sharding

OLAP engine-Kylin introduction, olap engine kylin
Kylin is an OLAP system developed by ebay. Unlike Mondrian, Kylin is a MOLAP system mainly used to support data analysis services in the big data ecosystem, it caches User-Defined multi-dimensional cubes to HBase through pre-calculation (currently only hbase is supported), during which mondrian and kylin are used, we can see that these two systems are a trade-off between time and space. mondrian is a ROLAP system. All queries can be completed through real-time database queries without any pre-computing, it saves a lot of storage space (but there will be a cache of query results, which is currently cached in the program memory and can easily lead to OOM), while kylin is a MOLAP system, the pre-calculation method caches all the data results to be queried and requires a large amount of storage space (10 + times the original data volume ). Generally, the data we want to analyze may be stored in relational databases (mysql, oracle, and some business data written in the program, which may be required for table sharding or even database sharding) data on HDFS (structured data, usually Business Log information, queried through hive), text files, excel, etc. Kylin is mainly used to pre-compute hive data and implement it using hadoop's mapreduce framework. Mondrian can theoretically support any SQL interface data. Because relational databases generally have indexes, it is acceptable to use mondrian to query performance. Currently, we use the oracle database, tens of millions of records can be queried in minutes, but it is too slow to query hive data sources.
System ArchitectureAs a result, we started to try kylin to solve the TB-level data analysis needs in big data systems, the pre-Calculation of data analysis in relational databases may be a bit inappropriate (relational databases generally have indexes, so that even if the data volume is large, the query will not be too slow, unless SQL is poorly written ). In the process of using kylin, we gradually have a certain understanding of kylin. First, let's take a look at the system architecture of kylin:
Kylin system architecture kylin consists of the following parts: · REST Server: provides some restful interfaces, such as creating cubes, building cubes, refreshing cubes, and merging cubes, metadata management such as project, table, and cube, user access permissions, and dynamic modification of system configurations. In addition, you can use this interface to implement SQL queries. On the one hand, these interfaces can be called by a third-party program, and the other can also be used on the kylin web interface. · Jdbc/odbc interface: kylin provides the jdbc driver and the classname of the driver is org. apache. kylin. jdbc. driver, the prefix of the url used: jdbc: kylin:. The process of querying using the jdbc interface is the same as that of querying using the RESTFul interface. These interfaces Also make kylin compatible with tebleau and even mondrian. · Query engine: kylin uses an open-source Calcite framework to parse SQL, which is equivalent to the SQL engine layer. · Routing: This module converts the execution plan generated by parsing SQL into a cube cache query. The cube is cached in hbase through pre-computation, this part of the query can be completed in seconds or even milliseconds, and some operations have used to query raw data (stored on hadoop and queried through hive), which has a high latency. · Metadata: kylin has a large amount of Metadata information, including the definition of cube, the definition of star model, the information of job, the output information of job, and the directory information of dimension, both metadata and cube are stored in hbase, and the format is json string. In addition, you can also store metadata in a local file system. · Cube Construction Engine: This module is the foundation of all modules. It is responsible for pre-calculation and creation of cubes, the creation process is to read the raw data through hive, generate Htable through some mapreduce computing, and load it into hbase.
Key ProcessesIn kylin, the two most critical processes are the pre-calculation process of the cube and the process of converting the SQL query to the cube. The structure of the cube can be divided into the construction of the cube and the combination of the cube, first, you must create a cube definition, including setting the cube name and the Star Model Structure of the cube, dimension information, measure information, set where conditions, set incremental cube according to the partition defined in the hive fact table, and set rowkey. these settings are also visible in mondrian, A cube contains dimension and measure. The where condition determines the size of the source data, which can be implemented through view in mondrian. In addition, kylin provides the incremental computing function. Although it cannot meet the real-time computing requirements, it can basically meet the data analysis needs. The query parsing process mainly uses the Calcite framework to parse user input SQL statements and convert them into hbase key-value query operations to obtain the results, however, it is found that its SQL support is relatively poor. All SQL statements cannot use join methods such as from A, B where xxx. inner (left, right) must be used) join on method. Otherwise, the parsing will fail. This will cause the SQL statement generated by mondrian to not use kylin query at all (because the SQL statement generated by mondrian is in the previous method ), another limitation is that you can only query the tables and columns related to the cube. For example, you can query the defined measurement information by group by Based on the dimension, and all other queries return errors, this is not a big problem either. After all, the cube model has been defined and we are unlikely to query anything other than this model. Another thing that is unacceptable is that kylin has poor support for subqueries. The test shows that the query results often return NULL (no row), and the same query has results in hive, this is not very well supported for some product parties. For example, the product party may need to query the total sales of each month in regions with annual sales exceeding xx. In general, we will write such SQL: select month, sum (sales) from fact where location in (select location from fact group by year having sum (sales)> 1000) group by month; some time ago, the test found that this SQL statement is a disaster for relational databases, because the in statement will cause no cache results for subsequent subqueries, and write it as select month, sum (sales) from fact as A inner join (select location from fact group by year having sum (sales)> 1000) as B on. location = B. location group by month; can improve performance, but the test shows that the result returned by kylin is null, while kylin For in statement queries, It is very efficient (after all, all are cached), so we have to first execute the subquery to get the location set, then, write an SQL statement using where location in xxx (kylin supports queries using the in Clause quite well) to obtain the results. This should be something to improve.
Cube ModelThe preceding section describes the settings required during the creation of a cube. Here, let's take a look at the specific meanings of each setting. First, we will set the cube name and notification list. The former must be globally unique, the latter is some emails used to notify the occurrence of some events of cube. Next, we need to define a star model. Like a general data warehouse model, we need to specify a fact table and any number of dimension tables, if a dimension table exists, you also need to specify the association between the fact table and the dimension table, that is, the join method. The next step is to define dimension. When defining dimension, you can select the dimension type, which can be Normal, Hierachy, and Derived. This will be introduced later. The dimension definition determines the size of the cube, you also need to have a good understanding of the original table. The next step is to define measure. kylin will create a metric with the aggregate function count (1) for each cube, which does not need to be associated with any columns, you can select SUM, COUNT, distinct count, MIN, and MAX for custom measurements. You can also select the parameters of these Aggregate functions for each measurement definition, you can select a constant or a column in a fact table. Generally, you can select a column. Here we find that kylin does not provide relatively complex Aggregate functions such as AVG (the variance and average difference are even less), mainly because it needs to perform incremental Computing Based on the cache cube and merge into new cubes, however, these complex Aggregate functions cannot obtain new values after calculation of the two values. For example, the sum values of A key in the two cubes to be incrementally merged are A and B, respectively, if the aggregate function is AVG, we must know the count and sum of the key before aggregation. This requires the user to find a way to calculate it by themselves. After defining the measure, you need to set the where condition. This step is to filter the raw data. For example, the region where the sales volume is smaller than XXX is not within the scope of this analysis, then you can set location in xxx (subquery) in THE where condition. The generated cube filters out these locations, which is equivalent to cleaning invalid data, however, in kylin, this is fixed and cannot be changed easily. For example, if I want to clean the area with sales less than XX today, I may want to remove users with annual consumption less than xxx tomorrow, this requires that the same cube be created each time. The difference lies only in the where condition. There will be a lot of duplicate cache data between them, which will also lead to a waste of storage space, however, this is inevitable in the MOLAP system. Therefore, when many filter conditions change, a better solution is to create a complete cube (no where condition is set ), filter out unwanted dimension members using subqueries. The next step is to set the incremental cube information. First, you need to select a time-type partition column in the fact table (it seems that you can only partition by day ), then specify the time range (Start Time and end time) of the cube to be built. The result of this step is used as the where Condition for the original data query, make sure that the cube constructed this time only contains data within this closed interval. If the fact table does not have a time type difference or no partition is selected, the data will not be dynamically updated, you cannot create cubes incrementally. The last step is to set rowkey. We recommend that you do not modify the rowkey unless you have a deep understanding of the internal implementation of kylin. Of course, here is a mandatory dimension that can be modified. If a dimension needs to appear during each query, you can set this dimension to mandatory, which saves a lot of storage space, in addition, you can divide groups for all dimensions. dimension that does not combine queries can be divided into different groups, which also reduces storage space.

Dimension IntroductionIn a multi-dimensional dataset, the number of dimensions determines the number of possible combinations between dimensions. The size of the member set in each dimension determines the number of possible combinations, for example, if there are three common dimensions A, B, and C, and their different member numbers are 10/100/1000, then the combination of one dimension has A power of 2, they are {null, A, B, C, AB, BC, AC, ABC}. Each member is called cuboid (A combination of dimensions ), the number of members in these sets is 1, 10, 100, 1000, 10*100, 100*1000, 10*1000, and 10*100*1000, respectively. We call the number of different members in each dimension as cardinatily. We should try to avoid storing the combination of dimensions with a relatively high cardinatily. In the preceding example, we can not cache the two cuboids of BC and C, you can calculate the value of a member combination in BC or C through the value of the Member in ABC. This is equivalent to a trade-off between time and space. The four dimensions in kylin are used to reduce the number of cuboids, rather than whether each dimension is cached. Currently, kylin calculates and stores all combinations of cuboids, for general dimension, we can see from the above example that the number of cuboids in N dimensions is the N power of 2, while kylin sets some dimensions to reduce the number of cuboids. Of course, this requires you to have a good understanding of the dimensions you need and know what you may want to perform group. Well, let's take a look at three special dimension in kylin and their functions. here refer to: http://www.slideshare.net/YangLi43/design-cube-in-apache-kylin1. Mandatory dimensionThis dimension is included in the group by statement of each query. Setting a dimension as mandatory can reduce the number of cuboids by half, for example: mandatory dimension this is because we confirm that each group by Operation carries A, so we can save all cuboids that do not contain the dimension.2. hierarchy dimensionThis dimension is the most common, especially in mondrian. Operations on multi-dimensional data often involve operations such as drill-up and drill-down, which requires a hierarchical relationship between dimensions, such as country, province, city, year, quarter, month, etc. Hierarchical dimensions can also greatly reduce the number of cuboids. For example, hierarchy dimension is only limited to A/B/C. For example, A is the year, B is the quarter, and C is the month, in this case, only the year, the year, the year xx, and the year xx indicate that we cannot aggregate the quarter and month separately, for example, we cannot use group by month when querying, but must use group by year, quart, month. If you need to aggregate month separately, you also need to use the month column to define a separate common dimension.3. derived dimensionThis type of dimension indicates a dimension that can be deduced. One or more columns corresponding to this dimension can be one-to-one with the primary key of the dimension table. This dimension can greatly reduce the number of cuboids, for example, derived dimension. For example, timeid is the primary key of the time dimension table, that is, an out-of-fact table check. The time is only accurate to days, therefore, the year, month, and day columns can correspond to a time_id, while time_id is the foreign key of the fact table. Therefore, we can specify year, month, and day as a derived dimension, in actual storage, only the combination of dimensions can be determined based on the value of timeid, but this requires that all columns in the derived dimension set must be specified in the group by statement used during query. Finally, we will briefly describe how to calculate the number of cuboids. Suppose we have two common dimensions: brand and product, and there is a hierarchy, which contains four dimensions: year, quart, month, and day, A derived dimension that specifies the location information, including the country, province, and city columns. This is equivalent to a total of nine dimensions, but we do not need 512 cuboids according to the above analysis. The number of cuboids (excluding any dimension and group by) in Layer 1 is 1, and the number of members in the cuboid is also 1. The cuboid in layer 0th contains a dimension, there are a total of four combinations (brand, product, year, and location, because quart is the second level of hierarchy and cannot be separately group, the three columns of location can be regarded as a whole). The number of members is cardinality of each dimension. There are 7 types of cuboids in layer 2nd, {brand, product}, {brand, year}, {brand, location}, {product, year}, {product, location}, {year, location}, and {year respectively., quart }; there are eight types of cuboids at Layer 1: {brand, product, year}, {brand, product, location}, {product, yea R, location}, {brand, product, year}, {brand, year, quart}, {product, year, quart}, {location, year, quart}, {year, quart, month }; there are eight types of cuboids for Layer 3, {brand, product, year, location}, {brand, product, year, quart}, {brand, location, year, quart}, {product, location, year, And quart }, {brand, year, quart, month}, {product, year, quart, month}, {location, year, quart, month}, {year, quart, month, and day} There are 7 types of cuboids in Layer 3, {brand, product, year, quart, location}, {Brand, product, year, quart, momth}, {brand, location, year, quart, month}, {product, location, year, quart, month}, {brand, there are five types of cuboids for year, quart, month, day}, {product, year, quart, month, day}, {location, year, quart, month, day, {brand, product, year, quart, month, location}, {brand, product, year, quart, momth, day}, {brand, location, year, quart, month, day}, {product, location, year, quart, month, day}, there are 1 cuboids in the layer 7th, is {brand, product, ye Ar, quart, month, day, location} so a total of 40 cuboids (kylin calculates 39, should not include the 0th Layer ).
Incremental cubeBecause the core of kylin lies in the pre-computing of cached data, the support for real-time data query is not as good as that of mondrian. However, in general, our data analysis is not completely real-time, the data delay is acceptable for several hours or even one day. kylin provides an interface for incremental cube. The implementation of kylin is a cube (here it refers to the logical cube) it can contain multiple segments. Each segment corresponds to a physical cube and corresponds to an hbase table in actual storage, the user defines the Increment Based on a field (currently only time is supported, and this field must be a partition field of hive). When using this field, you must first define the definition of the cube, you can specify a partition field of time as the dependent field of the incremental cube. In fact, this selection is used as the condition for selecting the original data, for example, if you select the data starting from A to B, the created cube will only contain the data aggregation value for this time period. After creating A cube, you can build it based on the previous cube again, each build generates a new But the original data is different (according to the time range specified by each build), each query will query all the values after the segment aggregation for return, it is similar to the tablet storage method. However, when there are too many segments, the query efficiency will decrease. Therefore, you need to merge multiple segments, when merging, a time interval is specified. All segments in the time interval are selected internally for merging. after merging, use the new segment to replace multiple merged segments, the merge execution is very fast, and data does not need to be obtained from HDFS. You can directly aggregate the data with the same key in two hbase tables. However, when merging, The hbase tables corresponding to the merged segments are not deleted. During actual use, you can write a scheduled task for the incremental cube to build it every morning, after a number is reached, perform merge (in fact, it is also possible to perform merge after each build ).
Cube dictionary treeThe cube data of kylin is stored in hbase as the key-value structure. The key is the combination value of each dimension member. The key structures under different cuboids are different, for example, a key under cuboid = {brand, product, year} may be brand = 'Li-Ning', product = 'shoe', year = 2015, so this key can be written as Li-Ning: shoe: 2015, but if this method is used, there will be many duplicates. Therefore, we usually extract all the members in a dimension and store them in an array, the combination of the subscript of the array into a key can greatly save the storage space of the key. kylin also uses the same method, but uses the dictionary tree (Trie tree ), the dictionary tree of each dimension is stored in hbase as the metadata of the cube in binary mode, and a copy is maintained throughout the memory.
The above section summarizes the overall framework of kylin and the process of some modules. As we have been focusing on some operations on cube, such as creating, constructing, and merging, I do not know much about the query. Of course, this is one of the core of kylin. Next we will look at how kylin is built and mergecube from the source code perspective, and the query process.

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.