Data integration: Flume and Sqoop

Source: Internet
Author: User
Tags sqoop

Flume and Sqoop are Hadoop data integration and collection systems, both of which are positioned differently, following an introduction based on individual experience and understanding and everyone:

FlumebyClouderadeveloped, there are two major products:Flume-ogand theFlume-ng,Flume-ogThe architecture is too complex, there will be data loss in the inquiring, so give up. Now we are using theFlume-ng, mainly log capture, this log can beTCPlog data for the system, which can be file data (which is usuallyIntelserver, through the interfaces that the agencies pass through, or the logs collected through the firewall),HDFSup storage, can andKafkafor integration, which isFlumethe function.      

Flume architecture is distributed and can be scaled as needed Flume the number of nodes and numbers. Its extension has two meanings: one is horizontal, according to the original data source, the number of different types of expansion, the second is vertical, you can add more aggregation layer, to do more process data processing, rather than the load in the database before the conversion.

650) this.width=650; "Src=" Http://s4.51cto.com/wyfs02/M00/8B/6F/wKiom1hOSvzSeEWCAAE5y3UPxWI495.png-wh_500x0-wm_3 -wmp_4-s_2233535585.png "title=" 33.png "alt=" Wkiom1hosvzseewcaae5y3upxwi495.png-wh_50 "/>

Flume the performance of high reliability and high availability, reliability is mainly reflected in two aspects: on the one hand is a data is more important, in order to ensure the reliability of data transmission, can be two Agent pointing to this data, and two of Agent you can make a demonstration switch, and if one fails, the other can be transferred. On the other hand is inside the agent can do cache transparent zone, received data can be stored to disk, put into the database, even if the agent problem, the data still exists.

FlumeIt's a log collection, but more data comes from a structured database, and we needSqoop. Sqoopis a relational database andHDFSa bridge between the data can be implemented in the relational database andHDFSa transfer between the. So when do we pass the data toHDFSit? The main is to add new transactions, new accounts loaded over, when written in addition toHDFs,can also writeHive,you can even build a table straight away. And it can be established in the source database to guide the entire database, or a table, or to guide a specific column, which is common in the Data Warehouse in theETL.

650) this.width=650; "Src=" Http://s3.51cto.com/wyfs02/M00/8B/6C/wKioL1hOSxKycNmMAABP4-xHjW4051.png-wh_500x0-wm_3 -wmp_4-s_2335129358.png "title=" 22.png "alt=" Wkiol1hosxkycnmmaabp4-xhjw4051.png-wh_50 "/>

Sqoop Allow incremental import, there are two incremental, one is the direct append (such as new orders, transactions can be added), the other is the state of the change, such as a customer before the whitelist customer, repayment is good, but if a month overdue, blacklist, late repayment back to the white list, the state is constantly changing, Then you can no longer do the same as trading and so on, this time need to do is zipper. Need a modified date, then this state has no modification, if modified, then what to do before loading? Can be configured via sqoop , which allows them to be updated in Hadoop when loaded . We know that the HDFS file cannot be updated, and this time the file is merged, and the text is purged by merging the data.

when will the data be exported? Exporting Data isHadoopwe may need to download a data mart to export the data based on this market, soSqoopYou can also export the data. SqoopThe mechanism for exporting is: The default isMySQL,MySQLLess efficient, then choose the second way---Direct mode, using some of the export tools provided by the database itself. But the efficiency of these export tools is not high enough, the higher is the professional custom connector, the current custom connector hasMySQL,Postgres,Netezza,Teradata,Oracle.

The above is based on some of their study and work experience summarized about Flume and the Sqoop the relevant knowledge, some specific knowledge here is not much involved, if you want to know can learn by themselves. I myself usually will pay attention to " Big data cn" and " Big Data Times Learning Center " These public numbers, which share some information and knowledge points to me a lot of help, recommend everyone to see, look forward to common progress!


This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1881938

Data integration: Flume and Sqoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.