Research and implementation of distributed ETL based on Hadoop platform
Donghua University Gang
The author of this paper mainly studies and realizes the work as follows first, distributed ETL Framework design. Based on the theory of dimension modeling in Data Warehouse, a distributed ETL framework including dimension and fact parallel processing and HDFS data block allocation is designed by analyzing the MapReduce working mechanism and job scheduling under the Hadoop platform. Second, the study of the parallel processing of facts. Starting from the fact table lookup Agent key and the multi granularity fact pre-aggregation two angles, this paper proposes a multi-channel parallel lookup algorithm and an algorithm for aggregating the fact data on different granularity. The experimental results show that, compared with the hive Data Warehouse, the two algorithms have higher efficiency in the parallel processing of the factual data of the Data Warehouse. Third, the research of HDFS data block assignment algorithm. Based on the theory of minimum-cost maximum flow of network flow, an improved shortest augmented path method is used to solve the maximum flow, and a distribution algorithm for allocating HDFS data blocks to distributed Data Warehouse is presented at the cost of network distance and load balancing of nodes. The experimental results show that the proposed algorithm is more effective than the existing ones. Finally, the implementation process of the distributed ETL system based on Hadoop platform is given, which is superior to the existing distributed ETL system.
Research and implementation of distributed ETL based on Hadoop platform
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.