Kylin Building Cube Optimization

Source: Internet
Author: User
Tags uuid
Objective


In the following, the cube optimization idea is introduced by analyzing the Kylin building cube process.





Create Hive Intermediate Table


Kylin will build a hive's intermediate table in the first step of the cube build, which is associated with all the fact tables and dimension tables, which is a wide table.



Optimization points:



1. Hive table partition optimization, when building a wide table, Kylin need to traverse the Hive table, fact table and dimension table if it is partitioned table, it will reduce traversal time



2. Hive related configuration adjustment, join related configuration, MapReduce related configuration, etc.






After the creation, in order to prevent the file size inconsistency, Kylin again based on Hive to do a re-balance operation,



' kylin.engine.mr.mapper-input-rows=1000000 ', by default each file contains 100w of data volume






Code ' Createflathivetablestep '


Find the cardinality of all dimensions


The Hyperloglog algorithm is used to find out the dimension column, if the cardinality of a dimension is large, then this dimension is called Ultra High Cardinality column (UHC), which is the superelevation cardinality dimension. So how do you deal with such dimensions?


Business Layer Processing UHC


For example, the timestamp dimension cardinality may be billions of degrees, can be turned into a date, the base drops to hundreds of thousands of.





Technical layer Processing UHC


Kylin This step through MapReduce, at the reduce side, a dimension is weighed with a reduce, so when a dimension's cardinality is large, it causes the reduce of that dimension to run very slowly, even memory overflow, in order to cope with this scenario, Kylin offers two solutions



1. The globally unique dimension, that is, the statistical analysis of the 0 error rate is selected in the Count_dintinct.



2. Dimensions that need to be shard by, configured at Rowkey build time.



You can then configure ' kylin.engine.mr.uhc-reducer-count=1 ' to declare how many of these columns need to be split into reducer execution






Of course, Kylin also support the allocation of reducer number based on the number of cuboid, ' kylin.engine.mr.hll-max-reducer-number=1 ', by default Kylin does not turn on this feature, you can modify the configuration to improve the minimum number Then adjust the specific reduce quantity by configuring ' Kylin.engine.mr.per-reducer-hll-cuboid-number '


 
 
int nCuboids = cube.getCuboidScheduler().getAllCuboidIds().size();
int shardBase = (nCuboids - 1) / cube.getConfig().getHadoopJobPerReducerHLLCuboidNumber() + 1;

int hllMaxReducerNumber = cube.getConfig().getHadoopJobHLLMaxReducerNumber();
if (shardBase > hllMaxReducerNumber) {
    shardBase = hllMaxReducerNumber;
}





The final number of reducer is added from the two parts of UHC and Cuboids, and the specific code reference



' Factdistinctcolumnsreducermapping ' constructors






# Configure UHC Add additional steps, need to configure ZK's address (used as a global distributed lock)



# because in the course of running MapReduce, Kylin did not upload hbase-site.xml and other configurations to yarn, so it can only be configured in Kylin.properties again.



Kylin.engine.mr.build-uhc-dict-in-additional-step=true



Kylin.env.zookeeper-connect-string=host:port,host:port






Code ' factdistinctcolumnsjob ', ' uhcdictionaryjob '


Building a Dimension Dictionary


After finding the cardinality of all the dimensions, Kyin constructs a data dictionary for each dimension, the metadata of the dictionary is stored in HDFs, and the actual data is stored in HBase



The dictionary path rule in HDFs is



kylin/kylin_meta_data/kylin-$jobid/%cubeid/metadata/dict/$catalog. $table/$dimension/$uuid. Dict






The Rowkey rule for dictionary data in HBase is



/dict/$catalog. $table/$dimension/$uuid. Dict





Rowkey length


Long Rowkey can take up a lot of storage space, so the rowkey length needs to be controlled.



Currently Kylin is directly in the current process of dictionary encoding, that is, the string is mapped to an int, if the dimension column cardinality is large, then there may be a memory overflow situation (when the column is based on more than 1kw), then you need to consider changing the dimension column encoding, instead of ' fixed_ Length ' etc.





Rowkey Construction


The construction of Rowkey also has certain requirements, in general, you need to put a large base field in front, so that the scan process as far as possible to skip more rowkey.



On the other hand, the small cardinality of the column is placed behind the Rowkey, you can reduce the build of repeated calculations, some cuboid can be aggregated by more than one parent cuboid, in this case, Kylin will choose the smallest parent cuboid. For example, AB can be generated through an aggregation of ABC (ID:1110) and Abd (id:1101), so Abd is used as a parent cuboid because its ID is smaller than ABC. Based on the above processing, if the cardinality of D is very small, then this aggregation operation will take a small price. Therefore, when designing the Rowkey order of the cube, keep in mind that the lower cardinality of the dimension column is placed at the tail. This is not only good for the cube build process, but also good for cube queries, since post aggregation (which should refer to the process of finding the corresponding cuboid in HBase) also follows this rule.





Building the cube Build engine


You can choose either Spark or MapReduce to build the cube, which is typically the way the build engine is chosen


    1. The memory-consuming cube chooses MapReduce, for example, Count Distinct, Top-n
    2. Simple cube select Spark, e.g. Sum/min/max/count





Spark Engine



The spark build engine uses the ' By-layer ' algorithm, which is hierarchical computing



For example, there are 3 dimensions Abc,cube will build A,B,C,AB,AC,ABC6 combinations, here there are 3 layers,



1th Floor: A,b,c



2nd Floor: Ab,ac



3rd Floor: ABC



Each layer is an action on the calculation for Spark, and the RDD calculated by the layer relies on the results of its previous layer to continue to be calculated, thus avoiding a lot of repetitive computations.






Code ' Sparkcubingbylayer '


Design Patterns


Refer to the cube design pattern in the introduction of Kylin





Convert data to Hfile


Kylin imports the resulting cube into hbase by generating hfile, which can configure parameters related to HBase.


    1. Region number By default is 1, if the volume of data can increase the number of region
    2. The region size defaults to 5GB, which is the size of the HBAE official recommendation; If the cube size is smaller than this value, you can reduce the size of the single region
    3. hfile file size, default is 1GB, because it is written by MapReduce, small file means write fast, but read slow, large file means write slow, read fast





Code ' cubehfilejob '


Cleanup
    1. Clean up the intermediate tables in hive,
    2. Clean hbase Tables
    3. Clean up HDFs data





Cleanup command



# View the data you need to clean up



./bin/kylin.sh Org.apache.kylin.tool.StorageCleanupJob--delete False



# Clean up



./bin/kylin.sh Org.apache.kylin.tool.StorageCleanupJob--delete True






Clean reference



Http://kylin.apache.org/docs20/howto/howto_cleanup_storage.html





Summarize


Based on the Kylin UI, you can see how time-consuming the process of kylin is when building a cube, which can be based on these time-consuming optimizations, which are common and can be optimized from the longest-consuming steps, such as:


    1. Encountered creating hive intermediate table for a long time, consider partitioning the hive table, changing the file format in the table, using high-performance file formats such as Orc,parquet
    2. Encounter cube build time is too long to see whether the cube design is reasonable, the combination of dimensions can be reduced, build engine can optimize





The idea of optimization is to optimize the entire life cycle of a cube with cube as its center, and all the components involved are optimization points, which are combined with actual data dimensions and business.





Reference


Official documents



Http://kylin.apache.org/docs20/howto/howto_optimize_build.html






Official documentation, Cube performance optimization



Http://kylin.apache.org/docs23/tutorial/cube_build_performance.html





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.