The practice of data Warehouse based on Hadoop ecosystem--Advanced technology (III.)

Source: Internet
Author: User
Tags month name hadoop ecosystem

three, a subset of dimensions
Some requirements do not require the most detailed data. For example, you want a record of a month rather than a day. For example, relative to the entire sales data, may be more interested in some specific status of data. These specific dimensions are included in the rows selected from the detail dimension, so they are called dimension subsets. The subset of dimensions has less data than the detail dimension, making it easier to use and faster to query.
In this section, you will prepare two specific dimensions that are taken from the existing dimension: The month dimension (a subset of the date dimension), and the Pennsylvania State Customer dimension (a subset of the Customer dimension).

1. Establishing the Month dimension table
Execute the following script to establish the month dimension table. Note The month dimension does not contain the Promo_ind column, which does not work on a monthly level because there may be multiple promotional periods in one months, and not every day in the one month is a promotional period. Promotion marks apply to the heavens level.
Use DW;    CREATE TABLE Month_dim (      month_sk INT comment ' surrogate key ',      month tinyint comment ' month ',    month_name Varc  Har (9) Comment ' month name ',    quarter tinyint comment ' quarter ', year    smallint comment ' year '  ) Comment ' Month Dimension table '  clustered by (Month_sk) into 8 buckets  stored as orc tblproperties (' transactional ' = ' true ');
In order to import the month dimension synchronously from the date dimension, the month is loaded into a preloaded script that is embedded in the date dimension. You need to modify the script that generates date dimension data in the Build Data Warehouse sample model. The contents of the modified date_dim_generate.sh file are displayed.

The contents of the modified Create_table_date_dim.sql file are as follows:
--Set variables to support transaction set hive.support.concurrency=true;  Set hive.exec.dynamic.partition.mode=nonstrict;  Set Hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.dbtxnmanager;  Set hive.compactor.initiator.on=true; Set Hive.compactor.worker.threads=1;  Use dw;--first execution will establish a Date dimension temporary table create table if not exists date_dim_tmp (date date comment ' Date,yyyy-mm-dd ', month tinyint comment ' month ', Month_name varchar (9) Comment ' month name ', Quarter tinyint comment ' quarter ', ye AR smallint comment ' year ', Promo_ind char (1) Comment ' Promotion index ') Comment ' date dimension table ' row format de  Limited fields terminated by ', ' stored as textfile;--first execution will establish Date dimension table CREATE table if not exists Date_dim (Date_sk int comment ' Surrogate key ', date date comment ' Date,yyyy-mm-dd ', month tinyint comment ' month ', Mon      Th_name varchar (9) Comment ' month name ', Quarter tinyint comment ' quarter ', year smallint comment ' year ', Promo_ind char (1) COMment ' Promotion index ') Comment ' date dimension table ' clustered by (Date_sk) into 8 buckets stored as ORC Tblprop Erties (' transactional ' = ' true ');--First execution establishes the Month dimension table CREATE TABLE Month_dim (Month_sk INT comment ' surrogate key ', MO Nth tinyint comment ' month ', Month_name varchar (9) Comment ' month name ', Quarter tinyint comment ' Quarter ', year smallint comment ' year ') Comment ' month dimension table ' clustered by (Month_sk) into 8 buckets stored as ORC Tblprope Rties (' transactional ' = ' true ');
The new Append_date.sql file contains the following:
--set variables to support transactions  Set hive.support.concurrency=true;  Set hive.exec.dynamic.partition.mode=nonstrict;  Set Hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.dbtxnmanager;  Set hive.compactor.initiator.on=true; Set Hive.compactor.worker.threads=1; Use dw;--append data to the Date dimension table insert into Date_dim select Row_number () over (order by date) + T2.sk_max, t1.date, T1.mo Nth, T1.month_name, T1.quarter, T1.year, T1.promo_ind from (SELECT * from date_dim_tmp) T1 cross Join (Select COALESCE (max (Date_sk), 0) Sk_max from Date_dim) T2;       --Append data to the Month dimension table insert into Month_dim select Row_number () over (order by T1.year,t1.month) + T2.sk_max, T1.month, T1.month_name, T1.quarter, t1.year from (select distinct month, Month_name, quarter, year from Date_dim_tmp ) T1cross Join (Select COALESCE (max (Month_sk), 0) Sk_max from Month_dim) T2; 
The following three modifications were made to the preloaded load:
    • The Surrogate key column was removed when the CSV file was generated, and the promo period tag column was added.
    • Before generating the Date_dim.csv date data file, call the Create_table_date_dim.sql script to build the table and add a Append_date.sql script to append the data.
    • Load the data into a temporary table date_dim_tmp, and then handle the load from date_dim_tmp to Date_dim in Append_date.sql. There are two reasons to do this, one is to consider the subsequent may need to append the date, rather than regenerate all the data, two is Date_dim is an ORC format binary file, cannot directly from the text file load data, only from a plain text file format table insert data.
Whenever you add a date record with a modified script, the month is loaded into the month dimension if the month in which the date is located is not in the month dimension. Test the pre-loaded data for the date and month dimension tables below.
(1) Delete date_dim_tmp, Date_dim, Month_dim table
Use Dw;drop table date_dim_tmp;drop table Date_dim;drop table Month_dim;
(2) Perform pre-load, generate date data from January 1, 2000 to December 31, 2010
./date_dim_generate.sh 2000-01-01 2010-12-31
This execution dimension table is new, resulting in the date and month dimension data as shown.


(3) Perform pre-load again, generate date data from January 1, 2011 to December 31, 2020
./date_dim_generate.sh 2011-01-01 2020-12-31
This execution is an append date to the existing dimension table, resulting in the date and month dimension data as shown.


(4) Execute the on_demand.sh script file established in the previous section, update the promo tag column, and restore the original data.
./on_demand.sh
2. Establish PA Customer Dimension
The month dimension is a rolled-up dimension that contains the upper-level data for the basic dimension. A specific subset of the dimensions is a specific subsets of the selected base dimension. Execute the following script to establish a specific dimension table and import the Pennsylvania (PA) Customer dimension subset data.
USE DW;      CREATE TABLE Pa_customer_dim (customer_sk int comment ' surrogate key ', Customer_number int comment ' number ', Customer_name varchar (comment ' name ', customer_street_address varchar comment ' address ', Customer_zip_co     de int comment ' zipcode ', customer_city varchar (+) Comment ' city ', Customer_state varchar (2) Comment ' state ', shipping_address varchar (comment ' shipping_address '), Shipping_zip_code int comment ' Shipping_zip_code ', Shi pping_city varchar (+) Comment ' shipping_city ', shipping_state varchar (2) Comment ' shipping_state ', version int Comment ' version ', effective_date date comment ' Effective date ', Expiry_date date comment ' expiry date ') CLUSTERED by (Customer_sk) into 8 BUCKETS STORED as ORC tblproperties (' transactional ' = ' true '); 
Note that the PA Customer dimension subset has a two-point difference from a subset of the month dimension:
    • The Pa_customer_dim table and the Customer_dim table have exactly the same columns, and Month_dim does not contain the date column of the Date_dim table.
    • The surrogate key for the Pa_customer_dim table is the surrogate key for the customer dimension, and the month dimension surrogate key for the Month_dim tables does not come from the date dimension.
3. Modify the regular load Regular_etl.sql file
Need to increase the PA Customer dimension processing, here just after loading Customer_dim simple overload PA Customer Dimension Data, the modified Regular_etl.sql file content is as follows:
--Set variable to support transaction set Hive.support.concurrency=true;set Hive.exec.dynamic.partition.mode=nonstrict;set hive.txn.manager= Org.apache.hadoop.hive.ql.lockmgr.dbtxnmanager;set Hive.compactor.initiator.on=true;set Hive.compactor.worker.threads=1;  Use DW; --Set the effective time and expiry time for SCD set hivevar:cur_date = Current_date (); SET hivevar:pre_date = Date_add (${hivevar:cur_date},-1);  SET hivevar:max_date = CAST (' 2200-01-01 ' as date); --Set the CDC max time insert OVERWRITE TABLE rds.cdc_time SELECT last_load, ${hivevar:cur_date} from rds.cdc_time;-- Load Customer Dimension--sets the expiration of SCD2 on deleted records and address related columns, and handles null values with the <=> operator.    UPDATE Customer_dim SET expiry_date = ${hivevar:pre_date} WHERE customer_dim.customer_sk in (SELECT A.customer_sk From (SELECT Customer_sk, Customer_number, customer_street_address, customer _zip_code, Customer_city, customer_state, shipping_address, SH               Ipping_zip_code, Shipping_city, Shipping_state from Customer_dim WHERE expiry_date = ${hivevar:max_date}) a left JOIN Rds.custo Mer B on a.customer_number = B.customer_number WHERE b.customer_number is NULL OR (!) ( A.customer_street_address <=> b.customer_street_address) OR!          (A.customer_zip_code <=> B.customer_zip_code) OR!          (A.customer_city <=> b.customer_city) OR!          (A.customer_state <=> b.customer_state) OR!          (A.shipping_address <=> b.shipping_address) OR!          (A.shipping_zip_code <=> B.shipping_zip_code) OR!          (A.shipping_city <=> b.shipping_city) OR! (A.shipping_state <=> b.shipping_state))); --processing of the new line INSERT SCD2 on the customer_street_addresses column into the Customer_dimselect row_number () over (ORDER by T1.customer_number ) + T2.sk_max, T1.customer_number, T1.customer_name, t1.customer_street_address, T1.customer_zip_code, t1.c Ustomer_city, T1.customEr_state, T1.shipping_address, T1.shipping_zip_code, t1.shipping_city, T1.shipping_state, t1.version, T1 . Effective_date, T1.expiry_datefrom (SELECT t2.customer_number customer_number, T2.customer_name Customer_na Me, t2.customer_street_address customer_street_address, T2.customer_zip_code Customer_zip_code, t2.customer_city Customer_city, T2.customer_state customer_state, t2.shipping_address shipping_address, T2.shipping_zip_code Shipp Ing_zip_code, t2.shipping_city shipping_city, T2.shipping_state shipping_state, t1.version + 1 version, ${hive Var:pre_date} effective_date, ${hivevar:max_date} expiry_date from Customer_dim t1 INNER JOIN rds.customer T2 O N T1.customer_number = t2.customer_number and t1.expiry_date = ${hivevar:pre_date} left joins Customer_dim T3 on t 1.customer_number = T3.customer_number and t3.expiry_date = ${hivevar:max_date} WHERE (! ( T1.customer_street_address <=> T2.customer_street_address) OR!   (T1.customer_zip_code <=> T2.customer_zip_code) OR!   (T1.customer_city <=> t2.customer_city) OR!   (T1.customer_state <=> t2.customer_state) OR!   (T1.shipping_address <=> t2.shipping_address) OR!   (T1.shipping_zip_code <=> T2.shipping_zip_code) OR!   (T1.shipping_city <=> t2.shipping_city) OR!  (T1.shipping_state <=> t2.shipping_state)) and T3.customer_sk is NULL) T1 cross JOIN (SELECT coalesce (MAX (Customer_sk), 0) Sk_max from Customer_dim) t2;--processing custome scd1--on the R_name column because the SET clause of Hive's update does not yet support subqueries, a temporary table is used to store records that need to be updated, with delete and insert instead of update--because SCD1 itself does not save historical data. So here you update all the customer_name in the dimension table to change the record, rather than just update the current version of the record drop TABLE IF EXISTS tmp; CREATE TABLE tmp asselect A.customer_sk, A.customer_number, B.customer_name, a.customer_street_address, A.C Ustomer_zip_code, A.customer_city, A.customer_state, a.shipping_address, A.shipping_zip_code, A.shipping_ci    Ty, A.shipping_state, A.version,A.effective_date, a.expiry_date from Customer_dim A, Rds.customer b WHERE a.customer_number = B.customer_number and !  (A.customer_name <=> b.customer_name);  DELETE from Customer_dim WHERE Customer_dim.customer_sk in (SELECT Customer_sk from TMP); INSERT INTO Customer_dim SELECT * from tmp;--handles the new customer record insert into Customer_dimselect row_number () over (ORDER B Y t1.customer_number) + T2.sk_max, T1.customer_number, T1.customer_name, t1.customer_street_address, T1.custom Er_zip_code, T1.customer_city, T1.customer_state, t1.shipping_address, T1.shipping_zip_code, T1.shipping_ci Ty, T1.shipping_state, 1, ${hivevar:pre_date}, ${hivevar:max_date}from (SELECT t1.* from Rds.customer T1 LE  FT JOIN Customer_dim t2 on t1.customer_number = T2.customer_number WHERE T2.customer_sk was NULL) T1 cross JOIN (SELECT  COALESCE (MAX (Customer_sk), 0) Sk_max from Customer_dim) t2;--heavy-duty PA Customer dimension truncate TABLE Pa_customer_dim; INSERT into Pa_customer_dim  SELECT Customer_sk, Customer_number, Customer_name, customer_street_address, Customer_zip_code, Customer_ci Ty, Customer_state, shipping_address, Shipping_zip_code, Shipping_city, Shipping_state, version, effective_d Ate, expiry_date from Customer_dim WHERE customer_state = ' PA '; --Load Product dimension--set deleted records and product_name, expired update SCD2 on product_category column Product_dim set expiry_date = ${hivevar:pre_ Date} WHERE Product_dim.product_sk in (select A.product_sk from (select Product_sk,product_code,product_name,product _category from Product_dim WHERE expiry_date = ${hivevar:max_date}) A left JOIN rds.product b O N A.product_code = B.product_code WHERE B.product_code is NULL or (a.product_name <> b.product_name OR A.P Roduct_category <> b.product_category);--to handle the addition of SCD2 on Product_Name, product_category columns INSERT into product_   Dimselect row_number () over (ORDER by T1.product_code) + T2.sk_max, T1.product_code, T1.product_name, T1.product_category, T1.version, T1.effective_date, T1.expiry_datefrom (SELECT t2.pr Oduct_code Product_code, T2.product_name product_name, T2.product_category product_category, t1.version + 1 V Ersion, ${hivevar:pre_date} effective_date, ${hivevar:max_date} expiry_date from Product_dim t1 INNER JOIN RDS.P    Roduct t2 on t1.product_code = T2.product_code and t1.expiry_date = ${hivevar:pre_date} left JOIN Product_dim t3 On t1.product_code = T3.product_code and t3.expiry_date = ${hivevar:max_date} WHERE (T1.product_name <> T2.pro Duct_name OR t1.product_category <> t2.product_category) and T3.product_sk is NULL) T1 cross JOIN (SELECT coalesce  (MAX (Product_sk), 0) Sk_max from Product_dim) t2;--process the new product record insert INTO Product_dimselect row_number () over (ORDER by T1.product_code) + T2.sk_max, T1.product_code, T1.product_name, T1.product_category, 1, ${hivevar:pre_d ATE}, ${hivevar:max_dAte}from (SELECT t1.* from rds.product T1 left joins Product_dim t2 on t1.product_code = T2.product_code WHERE T2.prod Uct_sk is NULL) T1 cross JOIN (SELECT coalesce (MAX (Product_sk), 0) Sk_max from Product_dim) t2;--load order dimension insert INTO or Der_dimselect row_number () over (ORDER by T1.order_number) + T2.sk_max, T1.order_number, T1.version, T1.effect Ive_date, t1.expiry_date from (SELECT order_number order_number, 1 version, Order_date effective_date, ' 220 0-01-01 ' expiry_date from Rds.sales_order, rds.cdc_time WHERE entry_date >= last_load and Entry_date < Current_loa d) T1cross JOIN (SELECT coalesce (MAX (Order_sk), 0) Sk_max from Order_dim) t2;--load Sales Order fact sheet insert into Sales_order_factsele    CT Order_sk, Customer_sk, Product_sk, Date_sk, Order_amount, order_quantity from Rds.sales_order A, Order_dim B, Customer_dim C, Product_dim D, Date_dim E, rds.cdc_time f WHERE a.order_number = B.order_nu Mberand A.customer_number = C.customer_numberand a.order_date >= c.effective_dateand a.order_date < C.expiry_dateand A.product_code = D.pro Duct_codeand a.order_date >= D.effective_dateand a.order_date < D.expiry_dateand to_date (a.order_date) = E.dateand a.entry_date >= f.last_load and A.entry_date < f.current_load;--Update timestamp table last_load field Insert OVERWRITE tabl E rds.cdc_time SELECT current_load, current_load from Rds.cdc_time;
4. Testing
(1) Execute the following SQL script to add a PA customer and four OH customers to the customer source data.
Use Source;insert into customer (customer_name, customer_street_address, Customer_zip_code,   customer_city, Customer_state, shipping_address,  Shipping_zip_code, shipping_city, shipping_state) VALUES (' PA Customer ', ' 1111 Louise Dr ', ' 17050 ', ' Mechanicsburg ', ' pa ', ' 1111 Louise Dr ', '  17050 ', ' Mechanicsburg ', ' Pa '), (' Bigger Customers ', ' 7777 Ridge Rd. ', ' 44102 ', ' Cleveland ', ' OH ', ' 7777 Ridge Rd. ', ' 44102 '  , ' Cleveland ', ' OH ', (' Smaller Stores ', ' 8888 Jennings fwy. ', ' 44102 ', ' Cleveland ', ' OH ', ' 8888 Jennings fwy ', ' 44102 ', ' Cleveland ', '  OH ', (' Small-medium Retailers ', ' 9999 Memphis Ave. ', ' 44102 ', ' Cleveland ', ' Oh ', ' 9999 Memphis Ave. ', '  44102 ', ' Cleveland ', ' Oh ', (' Oh Cus Tomer ', ' 6666 Ridge Rd. ', ' 44102 ', ' Cleveland ', ' OH ', ' 6666 Ridge Rd. ',  ' 44102 ', ' Cleveland ', ' oh ', commit;
(2) Perform regular loading and view the results.
Use the following command to perform a periodic load.
./regular_etl.sh
Use the following query to verify the results.
Use Dw;select customer_name, Customer_state, Effective_date, expiry_date from Pa_customer_dim;
As a result, the Pa_customer_dim table adds 19 records, as shown in.

This section illustrates the implementation of two typical subset of dimensions through the example of the month dimension and the PA customer dimension.

The practice of data Warehouse based on Hadoop ecosystem--Advanced technology (III.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.