Etl design issues

Source: Internet
Author: User
Tags lunix

Etl design issues

Http://blog.csdn.net/lili72

Background 1: With the increase of data access and processing, more and more production scripts are generated. Due to the lack of standardized management by the early-stage developers, the scripts are messy.

Solution:

1) Standard Directories on lunix are stored in directories Based on platforms and business modules.

2) Do a Good Job of version management. The scripts submitted to production must be commit to the svn server.

3) The directory on lunix reflects the directory ing to svn.

Background2: There are many areas in the script, such as the range, indicator, and parameter value. How can we make these operations more flexible, instead of writing them to death?

Solution:

1) map Chinese or English to numbers as much as possible, saving storage resources and making programs more flexible.

For example, if the platform has pc computers and mobile phones, you can use 1 to represent pc computers, 2 to represent mobile phones, and 3 to represent others. At the same time, make a code table to explain the corresponding relationship.

Homepage 0

Radio station 1 B

Language 2 c

Chinese 3 d

 

Multi-level directory Parsing

/Homepage/radio station/Language/Chinese

/0/1/2/3

/1/2/3

/Xxx

If the multi-level directory changes, how to automatically adapt to directory changes or return to the binary nature, expressed in binary.

2) flexibly apply the parameter list and create a parameter code table to dynamically generate hql statements.

For example, you need to calculate the number of user appearances in different time periods. The time periods are as follows:

Am-Am

Am-Am

PM-PM

PM-PM

PM-PM

Midnight-

Am-Am

Make a code table id comment time_region val par1 par2

1 am-PM 1 6 8

2 Am-Am 2 8 12

3 pm-PM 3 12 14

4 pm-Pm 4 14 18

5 pm-PM 5 18 23

6 midnight-6 23 24

7 Am-Am 7 0 6

Read the table and dynamically spell hql. The following val represents these time periods respectively. You only need to modify the table no matter how you modify it later.


For example, write the following hive_ SQL

insert overwrite table common.orderselect   userid  ,case         when hour_time>=0  and hour_time<=2  then '00_03'        when hour_time>=3  and hour_time<=5  then '03_06'         when hour_time>=6  and hour_time<=7  then '06_08'        when hour_time>=8  and hour_time<=11  then '08_12'        when hour_time>=12 and hour_time<=13  then '12_14'        when hour_time>=14 and hour_time<=17  then '14_18'        when hour_time>=18 and hour_time<=23  then '18_24'        else '0'      end         as hour_time    ,count(distinct dt) hour_dt_num where dt >='2014-10-01' and dt<='2014-10-30' group by userid, case when hour_time>=0  and hour_time<=2  then '00_03'        when hour_time>=3  and hour_time<=5  then '03_06'         when hour_time>=6  and hour_time<=7  then '06_08'        when hour_time>=8  and hour_time<=11 then '08_12'        when hour_time>=12 and hour_time<=13 then '12_14'        when hour_time>=14 and hour_time<=17 then '14_18'        when hour_time>=18 and hour_time<=23 then '18_24'        else '0'      end 


 

It can be written as follows:

#! /Bin/bash # add by lishc mysql = 'which mysql' user = "root" password = "123" database = "test" table = "parm" command = "select par1, par2, id from test. $ table "$ mysql-u $ {user}-p $ {password}-e" $ {command} "> m.txt ### initialize echo" insert overwrite table common. order "> mysql. sqlecho "select"> mysql. sqlecho "userId"> mysql. sqlecho "case"> mysql. SQL sed-I-e '1d 'm.txt cat m.txt | while read line do par1 =1 (echo "$ {line}" | awk-f''' {print $1} ') par2 = $ (echo "$ {line}" | awk-F ''' {print $2 }') id = $ (echo "$ {line}" | awk-F ''' {print $3} ') echo "par1 :1 {par1}" echo "par2: $ {par2} "echo" when hour_time >=$ {par1} and hour_time <=$ {par2} then '$ {id}' "> mysql. sqlDone



 

3) All scripts are stored in the database, and parameters are parsed and called and executed by the program.

Refer to kettle design:

Each step is componentized: input, output, script execution, SQL Execution, and execution order management.

The ETL process or data analysis model is an ordered SQL or script operation.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.