OLAP Engine--kylin Introduction

Source: Internet
Author: User
Tags sql using

Kylin is a set of OLAP systems developed by ebay, and unlike Mondrian, it is a MOLAP system that is primarily used to support data analytics businesses in the big data ecosystem, It mainly uses the Precomputed method to cache the user-defined multidimensional cubes into hbase (HBase is currently only supported), which is used for both Mondrian and Kylin, and finds that the two systems are a trade-off between time and space, Mondrian is a ROLAP system , all queries can be completed by real-time database queries, without any expected calculation, greatly saving storage space requirements (but there will be a cache of query results, is currently cached in the program memory, it is easy to cause oom), and Kylin is a MOLAP system, All the data results that need to be queried are cached in an precomputed way, requiring a large amount of storage space (the number of times the original data is doubled). In general, the data we want to analyze may be stored in a relational database (MySQL, Oracle, typically some business data written inside a program, there may be a need for sub-tables or even libraries), data on HDFS (structured data, typically business log information, query via hive), text files, Excel, and so on. Kylin is primarily a pre-calculation of data in hive, implemented using the MapReduce framework of Hadoop. And Mondrian can theoretically support arbitrary SQL interface data, because the relational database is generally indexed, so even if using Mondrian to query performance is acceptable, currently we use the Oracle database, thousands level of records, the query can be completed at the minute level, But for Hive,This kind of data source query is too slow, too slow to accept.
System ArchitectureSo, we started to try to use Kylin,kylin to solve the data analysis requirements of terabytes of data in big Data systems, and it might be a bit inappropriate to pre-compute data analysis in relational databases (relational databases are usually indexed so that even large amounts of data are not too slow to query, Unless the SQL is poorly written). In the process of using kylin, but also gradually to kylin have a certain understanding, first look at the Kylin system architecture:
kylin system ArchitectureThe Kylin consists of the following parts: · REST Server: Provides some restful interfaces, such as creating a cube, building a cube, refreshing a cube, merging cubes, and so on, project, table, cube and other metadata management, user access rights, system configuration dynamic modification, and so on. In addition to this interface can be implemented SQL query, these interfaces on the one hand through the call of third-party programs, the other side is also used by the Kylin Web interface. Jdbc/odbc interface: Kylin provides JDBC driver, driver classname for org.apache.kylin.jdbc.Driver, prefix of URL used Jdbc:kylin:, The flow of queries using the JDBC interface is the same as the internal process of querying with a restful interface. This kind of interface also makes Kylin very good compatible Tebleau even Mondrian. Query Engine: Kylin uses an open-source calcite framework for SQL parsing, which is equivalent to the SQL engine layer. Routing: The module is responsible for translating the execution plan generated by the parse SQL into a cube-cached query, which is done in HBase by the precomputed cache, which can be completed in seconds or even milliseconds. There are some operations that use querying raw data (stored on Hadoop on a hive query), which is a relatively high latency query. Metadata:kylin has a large number of metadata information, including the definition of cube, the definition of a star model, job information, job output information, the directory information of the dimension, and so on, metadata and cube are stored in hbase, stored in the format is a JSON string, In addition, you can also choose to store metadata in the local file system. Cube Build Engine: This module is the basis for all modules, it is responsible for the precomputed creation of the cube, the process of creating is to read the raw data through hive and then generate htable through some mapreduce computation and then load into hbase.
Key Processes      in Kylin, the most critical two processes are the cube's pre-calculation process and the SQL query into cube, the cube's construction can be divided into cube's build and cube merge, first need to create a cube definition, This includes setting the cube name, the star model structure of cube, dimension information, measure information, setting the Where condition, setting the increment cube based on the partition defined in hive, setting the Rowkey, These settings are also visible in Mondrian, where a cube contains some dimension and measure,where conditions that determine the size of the source data, which can be implemented in Mondrian by view. In addition, Kylin provides incremental computing capabilities, although not up to the need for real-time computing, but can basically meet the needs of data analysis. The       query parsing process primarily uses the calcite framework to parse and convert user-entered SQL into Key-value query operations on HBase to get results, but it is used to find that support for SQL is relatively poor, All SQL cannot use join methods such as from A, a where xxx, and must use the inner (left and right) join, otherwise parsing will be an error. This causes Mondrian generated SQL to not be able to use the Kylin query at all (because the SQL generated by Mondrian is the previous way), and another limitation is to find that only cube-related tables and columns can be queried, such as group by dimension By querying the metrics that are defined, and other queries returning errors, this is not a big problem, after all, the cube model has been defined and we are unlikely to query for anything other than this model. It's a little bit unacceptable that kylin support for subqueries is weak, the results of a test discovery query often return empty (no row), and the same query has results in hive, which is not good for some product-side needs For example, the product side may need to inquire about the total sales for each month in a region with annual sales greater than XX. We would normally write such a sql:select month, sum (sales) from the fact where location in (the Select location from fact Group is having sum ( Sales) >) Group by Month; The test found that SQL was a disaster for relational databasesThe in statement causes the subsequent subquery to have no cached results and is written as select month, sum (sales) from fact  as A inner JOIN (select location from Fact GROUP by year Having sum (sales) > +) as B on a.location = B.location Group BY month; performance can be improved, but test discovery Kylin returns a null result, and Kylin is very efficient when querying in statements (after all, go to cache), then we have to execute the subquery first to get the location collection, and then write a SQL using where location in XXX (Kylin for query support using the IN clause is still pretty good) way to get results, This should be the place to improve.
Cube Model      The settings required by cube during the creation process are described here, and here are some specific meanings of each setting, first we will set the cube name and the notification list, the former need to be guaranteed to be globally unique, The latter are some of the events that email is used to inform cube. Then we need to define a star-like model, and as with the general Data Warehouse model, you need to specify a fact table and any number of dimension tables, and if there is a dimension table, you also need to specify the correlation between the fact table and the dimension table, that is, the join method. Next is the definition of dimension, in the definition of dimension can choose the type of dimension, divided into normal, hierachy and derived, this later introduced, dimension definition determines the size of the cube, Users are also required to know the original table very well.       Next is the definition Measure,kylin will create a metric for each cube that has an aggregate function of COUNT (1), which does not need to be associated with any columns, and the user-defined metric can choose Sum, COUNT, DISTINCT COUNT, MIN, MAX, and each measure definition can also select the parameters of these aggregate functions, you can select a constant or a column of the fact table, in general we certainly choose a column. Here we find that Kylin does not provide relatively complex aggregation functions such as AVG (Variance, average difference is not even more), mainly because it requires the cache-based cube to do incremental calculations and merge into the new cube, and these complex aggregate functions can not simply calculate two values after the new value, For example, the sum value corresponding to a key in the two cubes that need to be incrementally merged is a and B respectively, then a+b after the merge, and if the aggregate function at this time is AVG, then we must know the count and sum of the key to do the aggregation. This requires the user to find a way to calculate their own.       Define the Where condition after measure, this step is to filter the original data, for example, we set sales less than XXX region is not within the scope of this analysis, then you can set the location in the Where condition in XXX (subquery), then the generated cube will filter out the location, this step is equivalent to the invalid data cleaning, but in the Kylin this is curable, not easy to change, for example, I would like to sell less than XX in the region to clean out, Tomorrow you may want to spend less than XXX users to remove, which requires each time to create a same cube, the difference is only the where condition, they will haveA lot of duplicate cache data can also lead to waste of storage space, but this is also the MOLAP system is unavoidable, so when the filter conditions change more, the better solution is to create a complete cube (not set any where condition), using a subquery to filter out some of the non-desirable dimension members.       The next step is to set up the incremental cube information by first selecting a partition column of one of the time types in the fact table (which seems to be partitioned only by day), and then specifying the time range (starting point and end point) of the cube that was built. The result of this step will be used as the where condition of the original data query, and ensure that the cube of this build contains only the data of this closed interval time, if the fact table does not have the difference of the time type or if no partition is selected, the data will not be updated dynamically, and the cube cannot be created incrementally.       The last step of setting up Rowkey, this step of the recommendation is to look at it, do not make changes, unless you have a deep understanding of kylin internal implementation to know how to modify. Of course, there is a can be modified mandatory dimension, if a dimension needs to appear every time the query, then you can set this dimension for mandatory, can save a lot of storage space, but also can be divided into all the dimensions group , dimension that do not combine queries can be divided into different groups, which also reduces storage space.

Dimension IntroductionIn a multidimensional data collection, the number of dimensions determines the number of possible combinations between dimensions, and the size of the member collection in each dimension determines the number of possible combinations, such as three normal dimensions A, B, and C, with different members of 10/100/ 1000, then a combination of a dimension has 2 of 3 of the square, is {empty, A, B, C, AB, BC, AC, ABC}, each member we call cuboid (the combination of dimensions), and these sets of the number of member combinations are 1, 10, 100, 1000, 10*100, 100 *1000, 10*1000 and 10*100*1000. We call each dimension the number of different members of the cardinatily, we should try to avoid the storage cardinatily higher dimensions of the combination, in the above example we can not cache BC and C these two cuboid,     You can calculate the value of a member combination in BC or C by calculating the value of a member in ABC, which is equivalent to a trade-off between time and space. The four dimensions that exist in Kylin are designed to reduce the number of cuboid, rather than whether each dimension is cached, and the current kylin is calculated and stored for all the combinations in all cuboid, for normal dimension,     From the above example, we can see that n dimensions of the number of cuboid is 2 of the N-square, and the Kylin set some dimensions can reduce the number of cuboid, of course, this requires the user to the dimension of their own need to understand, know what they may be based on the group by. Well, let's take a look at the three special dimension in Kylin and their role, here's a reference: Http://www.slideshare.net/YangLi43/design-cube-in-apache-kylin 1, Mandatory dimensionThis dimension means that each query is carried in the group by, and setting one dimension to mandatory can reduce the number of cuboid by half, such as: mandatory dimension This is because we are sure that every time the group By will carry a, then you can save all the cuboid that do not contain the dimension of A. 2, Hierarchy dimensionThis dimension is the most common, especially in the Mondrian, we often have the operation of the multi-dimensional data, such as the roll drill, which requires a hierarchical relationship between the dimensions, such as country, province, city, year, quarter, month, etc. A hierarchical dimension can also significantly reduce the number of cuboid. Such as: Hierarchy dimensionThis is only limited to a/b/c is a hierarchy, such as a is the year, B is the quarter, C is the month, then the query may be combined only the year, XX years of the quarter, xx xx quarter of XX months, which means that we can no longer separate quarterly and monthly aggregation, For example, we cannot use GROUP by month when we query and must use GROUP by Year,quart,month. If you need to aggregate the month separately, you also need to define a separate normal dimension with the month column. 3, derived dimensionThis kind of dimension meaning is the dimension that can be deduced, need one or more columns corresponding to this dimension can and the dimension table's primary key is one-on, this dimension can greatly reduce cuboid number, such as: derived dimensionFor example, Timeid is the primary key of the dimension table, that is, the fact table, the time is only accurate to days, then the year, month, and day three columns can only correspond to a time_id, and time_id is the foreign key of the fact table, then we can specify year, month, Day is a derived dimension, and the actual storage can only determine the combination of dimensions based on the value of Timeid, but this requires that we use the group by at query time to specify all the columns in the derived dimension collection.       Finally, a brief introduction to how to calculate the number of cuboid, assuming we have two common dimensions brand, product, there is a hierarchy, containing four dimensions for year, quart, month, and day , a derived dimension that specifies the location information, including the country, province, and city columns, which is equivalent to a total of 9 dimensions, but according to the above analysis we do not need 512 points cuboid. The No. 0 layer cuboid (does not contain any dimensions, does not contain group by), the number of cuboid is 1, the number of members of the cuboid is 1, the 1th layer cuboid contains a dimension, a total of 4 combinations (brand, product, year, Location, because Quart is the second level of hierarchy, cannot be a separate group by, and the three columns of the location can be considered as a whole, the number of members has each dimension of the cardinality, the 2nd layer of the cuboid there are 7 kinds, respectively { Brand, Product}, {brand, year}, {brand, location}, {product, year}, {product, location}, {year, location}, and {year, quart} There are 8 types of cuboid on the 3rd floor, {brand, product, year}, {brand, product, location}, {product, year, location}, {brand, product, year}, { Brand, year, quart}, {product, year, quart}, {location, year, quart}, {year, quart, month}, there are 8 types of cuboid on the 4th floor, respectively {brand, Product, year, location}, {brand, product, year, quart}, {brand, LocatiOn, year, quart}, {product, location, year, quart}, {brand, year, Quart, month}, {product, year, quart, month}, {location, There are 7 cuboid in the 5th tier of year, Quart, month, and {year, quart, month and day}, respectively {brand, product, year, quart, location}, {brand, product, Year, Quart, momth}, {brand, location, year, quart, month}, {product, location, year, quart, month}, {brand, year, Quart, There are 5 cuboid in the 6th tier of month, day}, {product, year, quart, month, day}, {location, year, quart, month, day}, respectively {brand, product, year, Quart, month, location}, {brand, product, year, quart, momth, day}, {brand, location, year, quart, month, day}, {product, Location, year, quart, month, day} The 7th level of cuboid has 1, for {brand, product, year, quart, month, day, location} So altogether 40 cuboid (Kylin calculated is 39, should not have the No. 0 layer of calculation included).
Incremental CubeSince the core of Kylin is to calculate the cache data, then for real-time data query support is not as good as Mondrian, but in general, our data analysis is not completely real-time requirements, data delay a few hours or even a day is acceptable, Kylin provides an incremental cube interface, Kylin implementation is a cube (here refers to the logical cube) can contain multiple segment, each segment corresponds to a physical cube, on the actual storage corresponding to an hbase table, user-defined based on a field increment (currently only support time , and this field must be a partition field of hive), when using the first need to define the definition of cube, you can specify a time of the partition field as a dependent field of the increment cube, in fact, this choice is as the original data selection criteria, For example, select the data from a to B from the start time then the cube created will only contain data aggregation values for that time period, and once the cube is created, it can be built again based on the previous cube, and each build will generate a new segment. Only the original data is different (according to the time interval specified by each build), each query will be queried every time after the segment aggregation of values to return, a bit similar to the way the tablet is stored, but when the segment there are too many times the query efficiency will fall, Therefore, it is necessary to merge them when there are multiple segment, when the merge is actually specified a time interval, the interior selects all the segment in this time interval for merging, and after the merge is complete, replace the merged segment with the new segment. The merge is executed very quickly, and the data does not need to be fetched from HDFs, and the data of the same key in the two hbase tables can be aggregated directly. One thing to note, however, is that when the merge is complete, the hbase tables corresponding to the merged segment are not deleted. The actual use of the cube for the increment can be written a timed task every morning to build, when a number of the merge (in fact, each build after the completion of the merge should also be possible).
the dictionary tree of CubeKylin cube data is stored as Key-value structure in HBase, key is the combined value of each dimension member, the structure of key under different cuboid is not the same, for example cuboid={brand,product,year} The following key may be brand= ' Nike ', product= ' shoe ', year=2015, then this key can be written as nike:shoe:2015, but if you use this way there will be a lot of repetition, So normally we take all the members of a dimension and then save it in an array, using the array's subscript combination as a key, which can save the key storage space, Kylin also uses the same method, but uses the dictionary tree (trie tree), The dictionary tree for each dimension is stored in HBase in binary form as metadata for the cube, and it remains in memory.
SummaryThe above describes the overall framework of Kylin and part of the process of the module, because the previous focus on the cube's operations, such as create, build, merge, etc., for the query this piece of understanding less, of course, this piece is one of the core of Kylin. The next step is to see how Kylin is built and Mergecube from a source code perspective, and the process of executing a query.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

OLAP Engine--kylin Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.