Transferred from: http://blog.csdn.net/wzy0623/article/details/73650053
First, why to use Flume in the past to build HAWQ Data Warehouse experimental environment, I use Sqoop extract from the MySQL database incrementally extract data to HDFs, and then use the HAWQ external table for access. This method requires only a small amount of configuration to complete the data Extraction task, but the disadvantage is also obvious, that is the real-time nature. Sqoop uses MapReduce to read and write data, and MapReduce is designed for batch scenarios, with the goal of large throughput and less concern for low latency issues. As is done in the experiment, the data is extracted at regular intervals every day.
Flume is a large-volume log collection, aggregation and transmission system that supports the customization of various data senders in the log system for data collection. At the same time, Flume provides the ability to simply process data and write to various data recipients. Flume processes data in a streaming manner and can be run continuously as an agent. When new data is available, Flume can immediately get the data and output it to the target, which can solve the real-time problem to a large extent.
Flume is initially just a log collector, but with the advent of the Flume-ng-sql-source plugin, it makes it possible for flume to collect data from a relational database. The following is a brief introduction to flume and details how to configure Flume to extract MySQL table data to HDFs in real time.
Ii. introduction of Flume 1. The concept of Flume flume is a distributed log collection system that collects data from various servers and sends them to designated locations, such as to HDFs, where Flume is simply a collection of logs, as shown in Architecture 1.
Figure 1
2. The concept of the event here is necessary to introduce the concept of the event in Flume: The core of Flume is to collect data from the data source to send the collected data to the specified destination (sink). To ensure that the delivery process must be successful, the data (channel) is cached before it is sent to the destination (sink), and after the data is actually reached the destination (sink), Flume deletes its cached data.
During the transmission of the entire data, the event is flowing, that is, the transaction is guaranteed at the event level. So what is an event? Event encapsulates the transmitted data, which is the basic unit of flume transmitted data, in the case of a text file, usually a row of records. An event is also the basic unit of a transaction. Event from source, to channel, to sink, is itself a byte array, and can carry headers (header information) information. An event represents the smallest complete unit of data, from an external data source, to an external destination.
3. Flume Architecture The reason why Flume is so magical is that it is a design of its own, which is the agent. The agent itself is a Java process that runs on the Log collection node-the so-called log collection node is the server node. The agent contains 3 core components: source, channel, and sink, similar to the architecture of the producer, warehouse, and consumer.
- Source:source components are designed to collect data that can handle log data in various types and formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence generator , Syslog, HTTP, Legacy, custom.
- The Channel:source component collects the data and temporarily stores it in the channel, that is, the channel component is dedicated to storing temporary data in the agent-a simple cache of the collected data, which can be stored in memory, JDBC, file, and so on.
- The Sink:sink component is a component used to send data to a destination, including HDFs, logger, Avro, thrift, IPC, file, NULL, Hbase, SOLR, and custom.
4. Flume operating mechanism Flume core is an agent, the agent has two external interaction, one is to accept the data input source, one is the data output Sink,sink responsible for sending the data to the external designated destination. After the source receives the data, the data is sent to Channel,chanel as a data buffer to temporarily hold the data, and then sink sends the data in the channel to the specified place, such as HDFs. Note: The channel will only delete the temporary data after the sink has successfully sent the data in the channel, which guarantees the reliability and security of the data transmission.
Iii. installing Hadoop and flume My experiment was performed on HDP 2.5.0, and Flume was included in the HDP installation as long as the flume service was configured. Installation steps for HDP see "HAWQ Technical Analysis (ii)-Installation Deployment"
Iv. configuration and Testing 1. Build a MySQL database table to create a test table and add data.
[SQL]View PlainCopy
- Use test;
- Create table Wlslog
- (id int not null,
- Time_stamp varchar (+),
- Category varchar (+),
- Type varchar (+),
- ServerName varchar (+),
- Code varchar (+),
- MSG varchar (+),
- primary key (ID)
- );
- insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (1, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000365 ', ' Server state changed to standby ');
- insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (2, ' apr-8-2014-7:0 6:17-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to starting ');
- insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (3, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000365 ', ' server state changed to admin ');
- insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (4, ' apr-8-2014-7:0 6:19-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to resuming ');
- insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (5, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000361 ',
- insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (6, ' apr-8-2014-7:0 6:21-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to running ');
- INSERT INTO wlslog (id,time_stamp,category,type,servername,code,msg) values (7,' apr-8-2014-7:06:22- PM-PDT ',' notice ',' weblogicserver ',' adminserver ',' bea-000360 ',' server started in running mode ') ;
- Commit
2. Establish the relevant directory and file (1) Create a local state file
[Plain]View PlainCopy
- Mkdir-p/var/lib/flume
- Cd/var/lib/flume
- Touch Sql-source.status
- Chmod-r 777/var/lib/flume
(2) Setting up the HDFs target directory
[Plain]View PlainCopy
- HDFs dfs-mkdir-p/flume/mysql
- HDFs Dfs-chmod-r 777/flume/mysql
3. Prepare the jar package from http://book2s.com/java/jar/f/flume-ng-sql-source/ download-flume-ng-sql-source-1.3.7.html Download the Flume-ng-sql-source-1.3.7.jar file and copy it to the Flume library directory.
[Plain]View PlainCopy
- CP flume-ng-sql-source-1.3.7.jar/usr/hdp/current/flume-server/lib/
Copy the MySQL JDBC driver jar package to the Flume Library directory as well.
[Plain]View PlainCopy
- CP Mysql-connector-java-5.1.17.jar/usr/hdp/current/flume-server/lib/mysql-connector-java.jar
4. Create HAWQ External Table
[SQL]View PlainCopy
- Create external table Ext_wlslog
- (id int,
- Time_stamp varchar (+),
- Category varchar (+),
- Type varchar (+),
- ServerName varchar (+),
- Code varchar (+),
- MSG varchar (+)
- Location (' pxf://mycluster/flume/mysql?profile=hdfstextmulti ') format ' csv ' (quote=e' ");
5. Configure Flume to configure the following properties in Flume.conf Ambari, Flume, configs:
[Plain]View PlainCopy
- Agent.channels.ch1.type = Memory
- Agent.sources.sql-source.channels = Ch1
- Agent.channels = Ch1
- Agent.sinks = HDFS
- Agent.sources = Sql-source
- Agent.sources.sql-source.type = Org.keedio.flume.source.SQLSource
- Agent.sources.sql-source.connection.url = Jdbc:mysql://172.16.1.127:3306/test
- Agent.sources.sql-source.user = root
- Agent.sources.sql-source.password = 123456
- Agent.sources.sql-source.table = Wlslog
- Agent.sources.sql-source.columns.to.select = *
- Agent.sources.sql-source.incremental.column.name = ID
- Agent.sources.sql-source.incremental.value = 0
- agent.sources.sql-source.run.query.delay=5000
- Agent.sources.sql-source.status.file.path =/var/lib/flume
- Agent.sources.sql-source.status.file.name = Sql-source.status
- Agent.sinks.HDFS.channel = Ch1
- Agent.sinks.HDFS.type = HDFS
- Agent.sinks.HDFS.hdfs.path = Hdfs://mycluster/flume/mysql
- Agent.sinks.HDFS.hdfs.fileType = DataStream
- Agent.sinks.HDFS.hdfs.writeFormat = Text
- Agent.sinks.HDFS.hdfs.rollSize = 268435456
- Agent.sinks.HDFS.hdfs.rollInterval = 0
- Agent.sinks.HDFS.hdfs.rollCount = 0
Flume specifies the source, channel, and sink-related configurations in the flume.conf file, and the properties are described in Table 1.
Property |
Describe |
Agent.channels.ch1.type |
Channel type of Agent |
Agent.sources.sql-source.channels |
Channel name corresponding to source |
Agent.channels |
Channel name |
Agent.sinks |
Sink name |
Agent.sources |
Source Name |
Agent.sources.sql-source.type |
Source Type |
Agent.sources.sql-source.connection.url |
Database URL |
Agent.sources.sql-source.user |
Database user Name |
Agent.sources.sql-source.password |
Database Password |
Agent.sources.sql-source.table |
database table name |
Agent.sources.sql-source.columns.to.select |
The columns of the query |
Agent.sources.sql-source.incremental.column.name |
Increment column Name |
Agent.sources.sql-source.incremental.value |
Increment initial value |
Agent.sources.sql-source.run.query.delay |
The time interval, in milliseconds, at which the query was initiated |
Agent.sources.sql-source.status.file.path |
State file path |
Agent.sources.sql-source.status.file.name |
Status file name |
Agent.sinks.HDFS.channel |
Sink the corresponding channel name |
Agent.sinks.HDFS.type |
Sink type |
Agent.sinks.HDFS.hdfs.path |
Sink path |
Agent.sinks.HDFS.hdfs.fileType |
File types for streaming data |
Agent.sinks.HDFS.hdfs.writeFormat |
Data Write format |
Agent.sinks.HDFS.hdfs.rollSize |
Target file rotation size, in bytes |
Agent.sinks.HDFS.hdfs.rollInterval |
HDFs sink interval How long the temporary file is scrolled into the final target file, in seconds, and if set to 0, the file is not scrolled according to time |
Agent.sinks.HDFS.hdfs.rollCount |
When the events data reaches that number, the temporary file is scrolled into the target file, and if set to 0, the file is not scrolled according to the events data |
Table 1
6. Run the flume agent to save the previous step settings, and then restart the Flume service, as shown in 2.
Figure 2
After the reboot, the status file has been logged with the latest ID value of 7, as shown in 3.
Figure 3
Looking at the target path, a temporary file was generated with 7 records, as shown in 4.
Figure 4
Query the HAWQ external table, and the results are all 7 data, as shown in 5.
Figure 5
At this point, the initial data pump has been completed.
7. Test the quasi-real-time incremental extraction in the source table with a new ID of 8, 9, 10 three records.
[SQL]View PlainCopy
- Use test;
- insert into wlslog (id,time_stamp,category,type, servername,code,msg) values (8, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000360 ', ' Server started in running mode ');
- insert into wlslog (id,time_stamp, category,type,servername,code,msg) values (9, ' apr-8-2014-7:0 6:22-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000360 ', ' server started in running mode ');
- INSERT INTO wlslog (id,time_stamp,category,type,servername,code,msg) values (' apr-8-2014-7:06:22- PM-PDT ',' notice ',' weblogicserver ',' adminserver ',' bea-000360 ',' server started in running mode ') ;
- Commit
5 seconds after the query Hawq external table, from Figure 6 can be seen, has been queried out all 10 data, quasi-real-time incremental extraction success.
Figure 6
V. Advantages and disadvantages of the scheme the greatest advantage of using flume to collect relational database table data is its simple configuration without programming. Compared to the complexity of Tungsten-replicator, Flume has no difficulty in configuring the relevant properties of the source, channel, and sink in the flume.conf file. Compared with the current very hot canal, although not flexible enough, but after all, a line of code does not have to write. Then the scheme is implemented by ordinary SQL polling, which is universal and applicable to all relational library data sources.
The shortcomings of this scheme are as prominent as its merits, mainly embodied in the following aspects.
- A query was executed on the source library, which is intrusive.
- The increment is achieved by polling, only quasi-real-time, and the shorter the polling interval, the greater the impact on the source library.
- Only new data is recognized and deletions and updates are not detected.
- Requires that the source library have a field to represent the increment.
Even though there are many limitations, the solution of extracting relational database data with Flume is valuable, especially in the application scenarios that require rapid deployment, simplified programming and satisfying requirements, which is an effective complement to the traditional sqoop approach.
Reference: Flume Architecture and application Introduction
Streaming MySQL Database Table Data to HDFS with Flume
How to read the data from the Oracle using FLUME to Kafka broker
Https://github.com/keedio/flume-ng-sql-source
v
Use flume to extract MySQL table data to HDFs in real time