Use flume to extract MySQL table data to HDFs in real time

Last Update:2017-06-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://blog.csdn.net/wzy0623/article/details/73650053

First, why to use Flume in the past to build HAWQ Data Warehouse experimental environment, I use Sqoop extract from the MySQL database incrementally extract data to HDFs, and then use the HAWQ external table for access. This method requires only a small amount of configuration to complete the data Extraction task, but the disadvantage is also obvious, that is the real-time nature. Sqoop uses MapReduce to read and write data, and MapReduce is designed for batch scenarios, with the goal of large throughput and less concern for low latency issues. As is done in the experiment, the data is extracted at regular intervals every day.
Flume is a large-volume log collection, aggregation and transmission system that supports the customization of various data senders in the log system for data collection. At the same time, Flume provides the ability to simply process data and write to various data recipients. Flume processes data in a streaming manner and can be run continuously as an agent. When new data is available, Flume can immediately get the data and output it to the target, which can solve the real-time problem to a large extent.
Flume is initially just a log collector, but with the advent of the Flume-ng-sql-source plugin, it makes it possible for flume to collect data from a relational database. The following is a brief introduction to flume and details how to configure Flume to extract MySQL table data to HDFs in real time.

Ii. introduction of Flume 1. The concept of Flume flume is a distributed log collection system that collects data from various servers and sends them to designated locations, such as to HDFs, where Flume is simply a collection of logs, as shown in Architecture 1.
Figure 1
2. The concept of the event here is necessary to introduce the concept of the event in Flume: The core of Flume is to collect data from the data source to send the collected data to the specified destination (sink). To ensure that the delivery process must be successful, the data (channel) is cached before it is sent to the destination (sink), and after the data is actually reached the destination (sink), Flume deletes its cached data.
During the transmission of the entire data, the event is flowing, that is, the transaction is guaranteed at the event level. So what is an event? Event encapsulates the transmitted data, which is the basic unit of flume transmitted data, in the case of a text file, usually a row of records. An event is also the basic unit of a transaction. Event from source, to channel, to sink, is itself a byte array, and can carry headers (header information) information. An event represents the smallest complete unit of data, from an external data source, to an external destination.

3. Flume Architecture The reason why Flume is so magical is that it is a design of its own, which is the agent. The agent itself is a Java process that runs on the Log collection node-the so-called log collection node is the server node. The agent contains 3 core components: source, channel, and sink, similar to the architecture of the producer, warehouse, and consumer.

Source:source components are designed to collect data that can handle log data in various types and formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence generator , Syslog, HTTP, Legacy, custom.
The Channel:source component collects the data and temporarily stores it in the channel, that is, the channel component is dedicated to storing temporary data in the agent-a simple cache of the collected data, which can be stored in memory, JDBC, file, and so on.
The Sink:sink component is a component used to send data to a destination, including HDFs, logger, Avro, thrift, IPC, file, NULL, Hbase, SOLR, and custom.

4. Flume operating mechanism Flume core is an agent, the agent has two external interaction, one is to accept the data input source, one is the data output Sink,sink responsible for sending the data to the external designated destination. After the source receives the data, the data is sent to Channel,chanel as a data buffer to temporarily hold the data, and then sink sends the data in the channel to the specified place, such as HDFs. Note: The channel will only delete the temporary data after the sink has successfully sent the data in the channel, which guarantees the reliability and security of the data transmission.

Iii. installing Hadoop and flume My experiment was performed on HDP 2.5.0, and Flume was included in the HDP installation as long as the flume service was configured. Installation steps for HDP see "HAWQ Technical Analysis (ii)-Installation Deployment"

Iv. configuration and Testing 1. Build a MySQL database table to create a test table and add data.
[SQL]View PlainCopy

Use test;
Create table Wlslog
(id int not null,
Time_stamp varchar (+),
Category varchar (+),
Type varchar (+),
ServerName varchar (+),
Code varchar (+),
MSG varchar (+),
primary key (ID)
);
insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (1, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000365 ', ' Server state changed to standby ');
insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (2, ' apr-8-2014-7:0 6:17-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to starting ');
insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (3, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000365 ', ' server state changed to admin ');
insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (4, ' apr-8-2014-7:0 6:19-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to resuming ');
insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (5, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000361 ',
insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (6, ' apr-8-2014-7:0 6:21-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to running ');
INSERT INTO wlslog (id,time_stamp,category,type,servername,code,msg) values (7,' apr-8-2014-7:06:22- PM-PDT ',' notice ',' weblogicserver ',' adminserver ',' bea-000360 ',' server started in running mode ') ;
Commit

2. Establish the relevant directory and file (1) Create a local state file
[Plain]View PlainCopy

Mkdir-p/var/lib/flume
Cd/var/lib/flume
Touch Sql-source.status
Chmod-r 777/var/lib/flume

(2) Setting up the HDFs target directory
[Plain]View PlainCopy

HDFs dfs-mkdir-p/flume/mysql
HDFs Dfs-chmod-r 777/flume/mysql

3. Prepare the jar package from http://book2s.com/java/jar/f/flume-ng-sql-source/ download-flume-ng-sql-source-1.3.7.html Download the Flume-ng-sql-source-1.3.7.jar file and copy it to the Flume library directory.
[Plain]View PlainCopy

CP flume-ng-sql-source-1.3.7.jar/usr/hdp/current/flume-server/lib/

Copy the MySQL JDBC driver jar package to the Flume Library directory as well.
[Plain]View PlainCopy

CP Mysql-connector-java-5.1.17.jar/usr/hdp/current/flume-server/lib/mysql-connector-java.jar

4. Create HAWQ External Table [SQL]View PlainCopy

Create external table Ext_wlslog
(id int,
Time_stamp varchar (+),
Category varchar (+),
Type varchar (+),
ServerName varchar (+),
Code varchar (+),
MSG varchar (+)
Location (' pxf://mycluster/flume/mysql?profile=hdfstextmulti ') format ' csv ' (quote=e' ");

5. Configure Flume to configure the following properties in Flume.conf Ambari, Flume, configs:
[Plain]View PlainCopy

Agent.channels.ch1.type = Memory
Agent.sources.sql-source.channels = Ch1
Agent.channels = Ch1
Agent.sinks = HDFS
Agent.sources = Sql-source
Agent.sources.sql-source.type = Org.keedio.flume.source.SQLSource
Agent.sources.sql-source.connection.url = Jdbc:mysql://172.16.1.127:3306/test
Agent.sources.sql-source.user = root
Agent.sources.sql-source.password = 123456
Agent.sources.sql-source.table = Wlslog
Agent.sources.sql-source.columns.to.select = *
Agent.sources.sql-source.incremental.column.name = ID
Agent.sources.sql-source.incremental.value = 0
agent.sources.sql-source.run.query.delay=5000
Agent.sources.sql-source.status.file.path =/var/lib/flume
Agent.sources.sql-source.status.file.name = Sql-source.status
Agent.sinks.HDFS.channel = Ch1
Agent.sinks.HDFS.type = HDFS
Agent.sinks.HDFS.hdfs.path = Hdfs://mycluster/flume/mysql
Agent.sinks.HDFS.hdfs.fileType = DataStream
Agent.sinks.HDFS.hdfs.writeFormat = Text
Agent.sinks.HDFS.hdfs.rollSize = 268435456
Agent.sinks.HDFS.hdfs.rollInterval = 0
Agent.sinks.HDFS.hdfs.rollCount = 0

Flume specifies the source, channel, and sink-related configurations in the flume.conf file, and the properties are described in Table 1.

Property	Describe
Agent.channels.ch1.type	Channel type of Agent
Agent.sources.sql-source.channels	Channel name corresponding to source
Agent.channels	Channel name
Agent.sinks	Sink name
Agent.sources	Source Name
Agent.sources.sql-source.type	Source Type
Agent.sources.sql-source.connection.url	Database URL
Agent.sources.sql-source.user	Database user Name
Agent.sources.sql-source.password	Database Password
Agent.sources.sql-source.table	database table name
Agent.sources.sql-source.columns.to.select	The columns of the query
Agent.sources.sql-source.incremental.column.name	Increment column Name
Agent.sources.sql-source.incremental.value	Increment initial value
Agent.sources.sql-source.run.query.delay	The time interval, in milliseconds, at which the query was initiated
Agent.sources.sql-source.status.file.path	State file path
Agent.sources.sql-source.status.file.name	Status file name
Agent.sinks.HDFS.channel	Sink the corresponding channel name
Agent.sinks.HDFS.type	Sink type
Agent.sinks.HDFS.hdfs.path	Sink path
Agent.sinks.HDFS.hdfs.fileType	File types for streaming data
Agent.sinks.HDFS.hdfs.writeFormat	Data Write format
Agent.sinks.HDFS.hdfs.rollSize	Target file rotation size, in bytes
Agent.sinks.HDFS.hdfs.rollInterval	HDFs sink interval How long the temporary file is scrolled into the final target file, in seconds, and if set to 0, the file is not scrolled according to time
Agent.sinks.HDFS.hdfs.rollCount	When the events data reaches that number, the temporary file is scrolled into the target file, and if set to 0, the file is not scrolled according to the events data

Table 1

6. Run the flume agent to save the previous step settings, and then restart the Flume service, as shown in 2.
Figure 2
After the reboot, the status file has been logged with the latest ID value of 7, as shown in 3.
Figure 3
Looking at the target path, a temporary file was generated with 7 records, as shown in 4.
Figure 4
Query the HAWQ external table, and the results are all 7 data, as shown in 5.
Figure 5
At this point, the initial data pump has been completed.

7. Test the quasi-real-time incremental extraction in the source table with a new ID of 8, 9, 10 three records.
[SQL]View PlainCopy

Use test;
insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (8, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000360 ', ' Server started in running mode ');
insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (9, ' apr-8-2014-7:0 6:22-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000360 ', ' server started in running mode ');
INSERT INTO wlslog (id,time_stamp,category,type,servername,code,msg) values (' apr-8-2014-7:06:22- PM-PDT ',' notice ',' weblogicserver ',' adminserver ',' bea-000360 ',' server started in running mode ') ;
Commit

5 seconds after the query Hawq external table, from Figure 6 can be seen, has been queried out all 10 data, quasi-real-time incremental extraction success.
Figure 6
V. Advantages and disadvantages of the scheme the greatest advantage of using flume to collect relational database table data is its simple configuration without programming. Compared to the complexity of Tungsten-replicator, Flume has no difficulty in configuring the relevant properties of the source, channel, and sink in the flume.conf file. Compared with the current very hot canal, although not flexible enough, but after all, a line of code does not have to write. Then the scheme is implemented by ordinary SQL polling, which is universal and applicable to all relational library data sources.
The shortcomings of this scheme are as prominent as its merits, mainly embodied in the following aspects.

A query was executed on the source library, which is intrusive.
The increment is achieved by polling, only quasi-real-time, and the shorter the polling interval, the greater the impact on the source library.
Only new data is recognized and deletions and updates are not detected.
Requires that the source library have a field to represent the increment.

Even though there are many limitations, the solution of extracting relational database data with Flume is valuable, especially in the application scenarios that require rapid deployment, simplified programming and satisfying requirements, which is an effective complement to the traditional sqoop approach.

Reference: Flume Architecture and application Introduction
Streaming MySQL Database Table Data to HDFS with Flume
How to read the data from the Oracle using FLUME to Kafka broker
Https://github.com/keedio/flume-ng-sql-source

Use flume to extract MySQL table data to HDFs in real time

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More