http://blog.csdn.net/yu616568/article/details/50548967
Why you need to use views
Kylin uses hive as input to the cube during use, but in some cases the table definitions and data in hive do not meet the requirements of the analysis, such as the values of some columns that need to be processed, the types of some columns that do not meet the requirements, and sometimes, even when creating a hive table, to facilitate the diagram. All columns in hive are defined as string, so in many cases the data format on the hive needs to be trimmed appropriately before using Kylin, but using the alter Table's way of modifying hive's original schema information might have an impact on other components that depend on hive (for example, it might cause data import to fail), so it's a good solution to use hive's view at this point.
Of course, in addition to the limitations of the hive data source itself schema, Kylin has some limitations on the use of hive, which indirectly leads us to use the view:
1, the same project under the same table (may be filtered according to different filter conditions, or set different dimensions, etc.) created a different cube, will cause the query to locate the wrong cube.
2, only support star-shaped model
How to use views
Hive currently only supports logical views, and all we need is a modification of the original schema information for hive, rather than the need to optimize the query speed through materialized views (and, of course, if there is a better ~), the current support for the Hive view can meet the needs of Kylin. Here's how to create a view as an input source for Kylin, based on different scenarios.
1, sub-table situation: Two tables have the same structure, but save different data, such as a table to save Access data on the Android side, a access to the iOS data, can be done by view.
For example, in our environment there is a user with two tables product_android and Product_ios, both tables have the same table structure, and the user needs to analyze the platform (Android or iOS) as a dimension, so we created this view for it:
CREATE View Product_cube as
Select UserID, EventID, label, Day, ' Android ' as platform from Product_android WHERE category= ' a '
UNION All
Select UserID, EventID, label, Day, ' iOS ' as platform from Product_ios WHERE category= ' a ';
This allows Product_cube to create a cube as a fact table and platform as one of the dimensions
2, custom functions, Apache calcite is supported by custom functions, but the cost of Kylin support for custom functions is relatively large, so if you need to use custom functions, you can create a view in hive to get each desired field.
3, snowflake model support, currently kylin only support star-shaped model, and by creating a view can easily transform the snowflake model into a star model, and even generate a large wide table.
4, frequently modify the table field name, Kylin directly using the name of the fields in hive as metadata, if the frequent modification of the fact table or the field name of the dimension table will result in metadata errors ([https://issues.apache.org/jira/browse/ KYLIN-1173]), so adding a layer of mapping through view is a good way to make the field name of the native hive table transparent to the KYLIN cube, and then the need to modify the field name will not affect the cube, only the definition of the view needs to be modified.
5, UHC dimension, when a dimension of cardinality is larger, there may be a lot of problems, first of all, may lead to extract fact columns this step caused reducer to appear oom, Second, creating a dimension dictionary tree may cause the dimension dictionary tree to occupy a large amount of memory, in addition to the cube is built slowly and consumes a lot of storage space, it is necessary to think about whether such a UHC dimension is necessary, whether you can extract some information to reduce the dimension, such as timestamp dimension , whether it can be accurate to 5 minutes, the detailed address of the dimension, whether it can be accurate to county, town, this can greatly reduce the degree of dimension, and more detailed information is not much statistical significance. For example, the dimension of the URL type, whether the parameter can be removed only the access path is preserved. In our practice, if the user has the need for fine-grained time statistics, we recommend that the time be structured to 5 minutes or 1 minutes, and the time column is obtained using the following method when creating the View: From_unixtime ((Unix_timestamp ( Wanproxy_stat
. ts
) -Unix_timestamp ( wanproxy_stat
. TS
), ' Yyyy-mm-dd HH:mm:ss ') as ts
6, dimension as a measure, in the cube we created an example is this, the table field has time, position, speed, where time and position is a dimension, speed is a measure, type is double, the user needs to statistics according to each dimension or dimension combination of average speed is less than m and less than n average, This needs to be filtered according to the measure, so the speed is also required as a dimension, but the speed of the cardinality is too large, so we will take the speed after rounding it by 5 rounding, so that we can greatly reduce the value of this one dimension. For example, one of our users has the need to filter by speed, so we created the following dimension columns for Speed filtering:
Floor (CAST ( wanproxy_stat
. datasize
/ wanproxy_stat
. as uploaderusetime
int)/5) * 5 velocity
as
While the original Velocity column cast ( wanproxy_stat
. datasize
/ wanproxy_stat
. uploaderusetime
as int) as speed is directly measured, the SQL can be executed at query time:
Select TS, sum (speed)/count (1) from table where velocity < N and velocity > M Group by TS;
7, the Hive table type modification, sometimes in the creation of the hive table for convenience, all table fields are defined as string, but the restrictions on the type in the calcite is strict, for example, the Extract function only for types such as date operation, So sometimes it's a good idea to create a view by converting the fields in the Hive table before defining the cube.
8, composite data type processing, because in hive can define complex data types, such as map, struct, and kylin need to see a flat table structure, so complex type fields need to be split dimension and measure.
In our current practice, every cube-dependent fact table is created through view, which adds a layer of mapping that reduces the cube's reliance on the original table, improves flexibility, and, of course, does not find relevant data on query performance to show that Hive has poor view performance. In our actual use of the process, we did not find that the use of view on the Kylin building cube process speed has a significant impact.
Restrictions on using views
Problems needing attention because of the limitations of hive for using view:
1. Because HIVE cannot get data using Hcatalog for view ([https://issues.apache.org/jira/browse/HIVE-10851]), So when you load a view, the cardinality job of KYLIN calculates the table will usually fail ([https://issues.apache.org/jira/browse/KYLIN-916]), At this point, the user is asked to know the cardinality of each column, if it is not sure you can go to hive inside the query (you can only query the data for a period of time to estimate).
2. In general, when we create a cube, we specify a partition field (assuming that the field is day), Kylin uses it to incrementally calculate the daily new data, and for the cube that needs to be incrementally computed, there must be a date field in a table in the star model, usually the fact table. And this date field is one of the partition keys of the hive table, but for the log data, the time (timestamp) of the log is usually recorded in each row of data, typically timestamp, if time is required as a dimension, assuming the date as one of the dimensions, Then you need to create a view to convert the timestamp in the table to a date as a dimension, where the view already has a date data, but it is recommended that the user still put the time field of the Partition (day field) into the defined view and use it as the partition field for the cube. This will greatly speed up the first step of execution! (Sweep a partition and sweep the entire table).
So now that you have the partition field, does it mean that the date retains only hive's partition field and does not convert the timestamp field? No, because the partition field only identifies a partition, its value and the day value of the timestamp are not necessarily equal, although it is usually the same, and there is no guarantee that the values of all timestamp fields under a partition are the same date (in the case of the Hive table we use). For example, in 2016-01-15, the timestamp of the first 100 data in this partition is 2014-01-14 23:59:59.xxx, so you need to keep both fields in view, one as a measure (through timestamp field conversion in the table, see Scenario 5), A partition field that acts as a cube.
In our use of kylin in the process of encountering these problems can be very good through the view of the hive to solve, of course, there are some requirements through the view can be well implemented without the need to change the hive table or Kylin metadata, if there are any other scenarios also please add a lot.
"Go" kylin Practice using Hive View