Apache Kylin Advanced section using Hive view

Source: Internet
Author: User

In this chapter we will explain why we need to use Hive view in the process of creating a cube in Kylin, and what is the benefit of using the hive view, what problems to solve, and the need to learn how to use the view, what restrictions are used in the view, and so on.

1. Why you need to use a view

Kylin uses hive's table data as the input source during cube creation. However, in some cases, table definitions and data in hive do not meet the needs of analysis, such as the values of some columns that need to be processed, the types of some columns that do not meet the requirements, or even when we create hive tables for convenience, the field type of all columns of the hive table is defined as String. Therefore, in many cases, the data format on the hive needs to be handled appropriately before using Kylin.

However, if the above problem is resolved by modifying the original table in hive, such as modifying the schema information of the original table using ALTER TABLE, it may have an impact on other components that depend on hive (for example, it may cause failure to import data via Sqoop, etc.). It is also possible to cause previous jobs to not function properly. So we need to consider solving this problem without changing the original table, so the scenario we're thinking of is using Hive's view.

Of course, in addition to the limitations of the hive data source itself schema, Kylin has some limitations on the use of hive, which indirectly leads us to use the view:

A. Using the same table under the same project (which may filter by different filter conditions, or setting different dimensions, etc.) creates a different cube, which causes the query to navigate to the wrong cube and other anomalies.

B. Only star models are supported

Our source tables may contain multiple fact tables and Do Zhangwi tables, so multiple fact tables need to be consolidated into a large wide table.

2. How to use the view

Hive currently only supports logical views, and we only need to modify the original schema information of hive, rather than want to optimize the query speed through materialized views, so currently the support of Hive view can meet the needs of Kylin.

Here's how to create a view as an input source for Kylin, based on different scenarios:

A. Sub-table situation

Two tables have the same structure, but save different data, such as a table to save Access data on the Android side, a access to the iOS data, then can be resolved through the hive view.

For example, there is a user with two tables product_android and Product_ios, two tables with the same table structure, and the user needs to analyze the platform (Android or iOS) as a dimension, so we created a view for it:

Create VIEW Palearn_cube asselect userid, EventID, label, Day, ' Android ' as Platformfrom palearn_android WHERE category= ' g c001 ' UNION allselect userid, EventID, label, Day, ' iOS ' as platform Frompalearn_ios WHERE category= ' gc001 ';

This allows you to create a cube with Palearn_cube as a fact table, where platform as one of the dimensions

B. Custom Functions

Apache calcite is used as the query engine for SQL in Kylin, but the custom functions supported by Kylin are expensive, so if you need to use custom functions, you can create a view in hive to handle the fields.

C. Support for Snowflake models

Currently Kylin only supports star models, and by creating views in hive, we can easily convert snowflake models to star models and even generate a large wide table.

D. Frequently changing table field names

KYLIN directly uses the field names in hive as metadata, and if frequent changes to the fact table or the field names of dimension tables result in metadata errors (https://issues.apache.org/jira/browse/KYLIN-1173), Therefore, it is a good way to add a layer of mapping through the view so that the field name of the native hive table is transparent to the cube of Kylin, and then the field name needs to be modified without affecting the cube, only the definition of the view needs to be modified.

E. UHC dimension

When the cardinality of one dimension is larger, there are many problems that may arise, first of all leading to extract Fact Table Distinct Columns This step causes the reducer to appear oom Secondly, creating a dimension dictionary tree may cause the dimension dictionary tree to occupy a large amount of memory, and it will cause the cube to build slowly and occupy a lot of storage space. At this point need to think about whether such a UHC dimension is necessary, whether it can extract some information to reduce the dimension, such as timestamp dimension, whether it can be accurate to 5 minutes, the detailed address of the dimension, whether it can be accurate to county, town, so that can greatly reduce the degree of dimension, And the more detailed information does not have much statistical significance. For example, the dimension of the URL type, whether the parameter can be removed only the access path is preserved.

F. Change of table field type in hive

For example, we had a requirement to calculate the count distinct value for an indicator, although this field holds an integer value, but the Hive table field type is String,kylin, the Count distinct aggregate function does not support the character type. So we create a view in hive to solve this problem and convert the string type to an integer type.

G. Composite data type processing

Because a complex type of data can be defined in hive, such as map, struct, and kylin cannot handle this type, it is necessary to use a view to split the dimensions and measures of the complex types field.

In our current practice, some of the fact tables that cube relies on are created by view, which adds a layer of mapping that reduces the cube's reliance on the original table and increases flexibility.

3. Using View Limits

Hive cannot get data from view using Hcatalog due to hive limitations

(https://issues.apache.org/jira/browse/HIVE-10851), so when you load a view in Kylin, Kylin calculation table Cardinality job can not get to the value of cardinality, then asked the user to know each column cardinality approximate situation, if is not sure can go to hive inside query.

4. Use View combat

Here I give my friends a brief demonstration of the use of the view, in fact, the view in our project is still often encountered.

I don't know if my friends remember. We created three tables before (see the "Cube Creation practice for Multidimensional Analysis of the Apache Kylin Advanced section" section):

Fact Sheet: KYLIN_FLAT_DB.WEB_ACCESS_FACT_TBL

Dimension table: kylin_flat_db. Region_tbl

Dimension table: Kylin_flat_db.city_tbl

If you are based on the actual practice of this book, the current hive kylin_flat_db database should have these three tables below. We will now create a view of these three tables in hive, building the cube based on the view, presumably in the following steps:

Step One: Create a view

Execute the following SQL in the Hive CLI:

Use Kylin_flat_db;create view V_pvuv as select A.dayas v_date,       b.regionname,       c.cityname,       hash (A.cookieid) As Cookieid,       a.pvfrom web_access_fact_tbl ajoin region_tbl bon a.regionid = B.regionidjoin city_tbl con a.cityid = c.c Ityid;

In this case, the Cookieid field (string type) is processed using the hash function, and the result is treated as an integer type, and we do this in order to use the Count Distinct (which currently supports only integer types) for exact weight.

Note: Hash algorithm is characterized by repeatable and irreversible, that is, for different string hash processing, the results may be repeated, so the above method has some problems. In the actual project we used the hive custom function to process an integer type to a field that needed to use count distinct (the integer type would not have to be processed).

Step Two: Create a project

Create a "View_project" project, and of course you can use any project that already exists.

Step three: Import the data source

Import the view v_pvuv that you just created under the kylin_flat_db database.

Step four: Create the model

Here we take as a demonstration, only one view is selected as the fact table, no dimension table. which

Dimension field: V_date,regionname,cityname

Measure field: COOKIEID,PV

Partition field: V_date

Step five: Create a cube

Because friends are familiar with how to create a cube, here's a brief list of steps.

The dimensions selected for creating the cube are as follows:



The metrics selected for creating the cube are as follows:

Where we use the exact count_distinct function to calculate the value in the Cookieid field, so the return type is bitmap.

Step six: Building a cube

Set the end time for the build, and then build the cube.

Step Seven: Query cube

Query the following SQL:

Select V_date,       regionname,       CityName,       count (distinct Cookieid) from V_pvuvgroup by V_date,         Regionname ,         CityName;

The result is:

In the process of using Kylin, many of the issues mentioned above can be resolved through the hive view. If your friends encounter some kylin themselves that cannot be resolved during Kylin use, try the Hive view, which may surprise you.

Apache Kylin Advanced section using Hive view

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.