Etl design issues
Http://blog.csdn.net/lili72
Background 1: With the increase of data access and processing, more and more production scripts are generated. Due to the lack of standardized management by the early-stage developers, the scripts are messy.
Solution:
1) Standard Directories on lunix are stored in directories Based on platforms and business modules.
2) Do a Good Job of version management. The scripts submitted to production must be commit to the svn server.
3) The directory on lunix reflects the directory ing to svn.
Background2: There are many areas in the script, such as the range, indicator, and parameter value. How can we make these operations more flexible, instead of writing them to death?
Solution:
1) map Chinese or English to numbers as much as possible, saving storage resources and making programs more flexible.
For example, if the platform has pc computers and mobile phones, you can use 1 to represent pc computers, 2 to represent mobile phones, and 3 to represent others. At the same time, make a code table to explain the corresponding relationship.
Homepage 0
Radio station 1 B
Language 2 c
Chinese 3 d
Multi-level directory Parsing
/Homepage/radio station/Language/Chinese
/0/1/2/3
/1/2/3
/Xxx
If the multi-level directory changes, how to automatically adapt to directory changes or return to the binary nature, expressed in binary.
2) flexibly apply the parameter list and create a parameter code table to dynamically generate hql statements.
For example, you need to calculate the number of user appearances in different time periods. The time periods are as follows:
Am-Am
Am-Am
PM-PM
PM-PM
PM-PM
Midnight-
Am-Am
Make a code table id comment time_region val par1 par2
1 am-PM 1 6 8
2 Am-Am 2 8 12
3 pm-PM 3 12 14
4 pm-Pm 4 14 18
5 pm-PM 5 18 23
6 midnight-6 23 24
7 Am-Am 7 0 6
Read the table and dynamically spell hql. The following val represents these time periods respectively. You only need to modify the table no matter how you modify it later.
For example, write the following hive_ SQL
insert overwrite table common.orderselect userid ,case when hour_time>=0 and hour_time<=2 then '00_03' when hour_time>=3 and hour_time<=5 then '03_06' when hour_time>=6 and hour_time<=7 then '06_08' when hour_time>=8 and hour_time<=11 then '08_12' when hour_time>=12 and hour_time<=13 then '12_14' when hour_time>=14 and hour_time<=17 then '14_18' when hour_time>=18 and hour_time<=23 then '18_24' else '0' end as hour_time ,count(distinct dt) hour_dt_num where dt >='2014-10-01' and dt<='2014-10-30' group by userid, case when hour_time>=0 and hour_time<=2 then '00_03' when hour_time>=3 and hour_time<=5 then '03_06' when hour_time>=6 and hour_time<=7 then '06_08' when hour_time>=8 and hour_time<=11 then '08_12' when hour_time>=12 and hour_time<=13 then '12_14' when hour_time>=14 and hour_time<=17 then '14_18' when hour_time>=18 and hour_time<=23 then '18_24' else '0' end
It can be written as follows:
#! /Bin/bash # add by lishc mysql = 'which mysql' user = "root" password = "123" database = "test" table = "parm" command = "select par1, par2, id from test. $ table "$ mysql-u $ {user}-p $ {password}-e" $ {command} "> m.txt ### initialize echo" insert overwrite table common. order "> mysql. sqlecho "select"> mysql. sqlecho "userId"> mysql. sqlecho "case"> mysql. SQL sed-I-e '1d 'm.txt cat m.txt | while read line do par1 =1 (echo "$ {line}" | awk-f''' {print $1} ') par2 = $ (echo "$ {line}" | awk-F ''' {print $2 }') id = $ (echo "$ {line}" | awk-F ''' {print $3} ') echo "par1 :1 {par1}" echo "par2: $ {par2} "echo" when hour_time >=$ {par1} and hour_time <=$ {par2} then '$ {id}' "> mysql. sqlDone
3) All scripts are stored in the database, and parameters are parsed and called and executed by the program.
Refer to kettle design:
Each step is componentized: input, output, script execution, SQL Execution, and execution order management.
The ETL process or data analysis model is an ordered SQL or script operation.