Where hadoop data is prone to errors

Source: Internet
Author: User

Recently, I have summarized some data analysis projects.

Is the flow of system data.
Errors may occur easily.

1. Data enters the hadoop warehouse
There are four sources, which are the most basic data (ODS or original data source for short). The subsequent data comes from these combinations.
A. Log Files
B. HTTP interface
C. DB Query
D. Create a table pointing
Finally, the data is stored in hadoop as a hadoop file.

Log File:

    • The new machine does not notify the Data Analysis Group to capture logs.
    • An error occurs when obtaining logs according to the Conventions. For example, if the conventions are used to obtain the compression log of GZ, the result is not
    • Data Provider rsync log Problems

HTTP interface:

    • The interface is unstable, often 500
    • The data provided by the interface is inherently incorrect.

DB:

    • The data analysis group is not notified of changes in the data structure in a timely manner.

Table creation points:

    • Not provided after the agreed time

2. hadoop files
Hadoop.apache.org

3. hive
Reference: hive.apache.org
Hive is a hadoop-based data warehouse tool that maps structured data files into a database table and provides a complete SQL query function, you can convert SQL statements to mapreduce tasks for running. The advantage is that the learning cost is low. You can use SQL-like statements to quickly implement simple mapreduce statistics without having to develop special mapreduce applications. This is suitable for the statistical analysis of data warehouses.
Create a hive table and load the data into the hive table.

drop table if exists rpt_crm_cube_kpi_reserve_room_gb_seq;create external table rpt_crm_cube_kpi_reserve_room_gb_seq (    report_date string,    area_name string,    manager_name string,    manager_user_id string,    assistant_name string,    hotel_seq string,    hotel_name string,    hotel_grade string,    tree_code string,    city_name string,    confirmed bigint,    reserve_room bigint,    instant_confirmed bigint) partitioned by (dt string)*ROW FORMAT DELIMITED**  FIELDS TERMINATED BY ‘\001‘**  COLLECTION ITEMS TERMINATED BY ‘\002‘**  MAP KEYS TERMINATED BY ‘\003‘**  LINES TERMINATED BY ‘\n‘**STORED AS INPUTFORMAT**  ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat‘**OUTPUTFORMAT**  ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘*location ‘/user/qhstats/rpt/rpt_crm_cube_kpi_reserve_room_gb_seq‘;*  *标记的地方为约定好的,不能出错,否则数据载入就会出错 *
    insert overwrite table rpt_crm_cube_kpi_gb_sales partition (dt = ‘$DATE‘, kpi = ‘all_lose‘)    select        3 as target_id,        peer.report_date,        peer.area_name,        peer.tree_code,        peer.manager_name,        peer.manager_user_id,        peer.object,        peer.completed,        rank()   over (partition by peer.tree_code order by if(peer.object = 0, -1, 1 - peer.completed * 1.0 / peer.object) desc) as peer_rank,        count(1) over (partition by peer.tree_code) as peer_count,        parent.peer_rank  as parent_rank,        parent.peer_count as parent_count    from (        select            report_date,            area_name,            manager_name,            manager_user_id,            tree_code,            sum(1) as object,            sum(if(is_lose = 1, 0, 1)) as completed        from            rpt_crm_cube_kpi_lose_gb_seq        where            dt = ‘$DATE‘ and type=‘ALL‘        group by report_date, area_name, manager_name, manager_user_id, tree_code    ) peer    inner join (        select            *        from            rpt_crm_cube_kpi_gb_tree_code        where            dt = ‘$DATE‘ and kpi = ‘all_lose‘    ) parent    on peer.tree_code = parent.tree_code;EOF}

Errors:

    • The data type of a column must be clear; otherwise, an error occurs during conversion from a string to a hive table. For example, if the file contains 'xiaoqiang ', the column type is set to bigint, and the final data is null.

4. Import hive tables to DB

Hive data can be imported to DB

function export_to_crm_cube {    $HIVE -e "select * from rpt_crm_cube_kpi_gb_sales where dt = ‘$DATE‘ and kpi = ‘all_lose‘ " > $TMP_FILE    $crm_cube_DEV_STR << EOF    delete from crm_cube_kpi_gb_sales where report_date = ‘$FORMAT_DATE‘ and target_id = 3;    load data local infile ‘$TMP_FILE‘    into table crm_cube_kpi_gb_sales (        target_id,        report_date,        area,        tree_code,        manager_name,        manager_user_id,        object_cnt,        completed_cnt,        peer_rank,        peer_cnt,        parent_rank,        parent_cnt    );EOF}

Errors:

    • Data type conversion: The data analysis group statistics have more than 50% probability of data type conversion.

5. dB to app

The data from dB to app has been solidified in dB, and the rest is to present the data to users. At this time, the data accuracy needs to be ensured.

Data accuracy assurance depends on the following:

The correctness of the number.
Whether the data logic is correct or not.
Accuracy of data, whether the data logic is translated into code is correct
The front-end presentation and data are correctly presented to users.

 

Summary of the data from hadoop to the user process, where errors occur, know that errors can be made, and then know how to handle them.

Where hadoop data is prone to errors

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.