Research and implementation of distributed ETL based on Hadoop platform

Source: Internet
Author: User
Keywords Hadoop distributed ETL research and implementation
Tags analyzing based block data data warehouse design distributed framework

Research and implementation of distributed ETL based on Hadoop platform

Donghua University Gang

The author of this paper mainly studies and realizes the work as follows first, distributed ETL Framework design. Based on the theory of dimension modeling in Data Warehouse, a distributed ETL framework including dimension and fact parallel processing and HDFS data block allocation is designed by analyzing the MapReduce working mechanism and job scheduling under the Hadoop platform. Second, the study of the parallel processing of facts. Starting from the fact table lookup Agent key and the multi granularity fact pre-aggregation two angles, this paper proposes a multi-channel parallel lookup algorithm and an algorithm for aggregating the fact data on different granularity. The experimental results show that, compared with the hive Data Warehouse, the two algorithms have higher efficiency in the parallel processing of the factual data of the Data Warehouse. Third, the research of HDFS data block assignment algorithm. Based on the theory of minimum-cost maximum flow of network flow, an improved shortest augmented path method is used to solve the maximum flow, and a distribution algorithm for allocating HDFS data blocks to distributed Data Warehouse is presented at the cost of network distance and load balancing of nodes. The experimental results show that the proposed algorithm is more effective than the existing ones. Finally, the implementation process of the distributed ETL system based on Hadoop platform is given, which is superior to the existing distributed ETL system.


Research and implementation of distributed ETL based on Hadoop platform

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.