Use flume to extract MySQL table data to HDFs in real time

Source: Internet
Author: User
Tags relational database table hdfs dfs sqoop

Transferred from: http://blog.csdn.net/wzy0623/article/details/73650053

First, why to use Flume in the past to build HAWQ Data Warehouse experimental environment, I use Sqoop extract from the MySQL database incrementally extract data to HDFs, and then use the HAWQ external table for access. This method requires only a small amount of configuration to complete the data Extraction task, but the disadvantage is also obvious, that is the real-time nature. Sqoop uses MapReduce to read and write data, and MapReduce is designed for batch scenarios, with the goal of large throughput and less concern for low latency issues. As is done in the experiment, the data is extracted at regular intervals every day.
Flume is a large-volume log collection, aggregation and transmission system that supports the customization of various data senders in the log system for data collection. At the same time, Flume provides the ability to simply process data and write to various data recipients. Flume processes data in a streaming manner and can be run continuously as an agent. When new data is available, Flume can immediately get the data and output it to the target, which can solve the real-time problem to a large extent.
Flume is initially just a log collector, but with the advent of the Flume-ng-sql-source plugin, it makes it possible for flume to collect data from a relational database. The following is a brief introduction to flume and details how to configure Flume to extract MySQL table data to HDFs in real time.

Ii. introduction of Flume 1. The concept of Flume flume is a distributed log collection system that collects data from various servers and sends them to designated locations, such as to HDFs, where Flume is simply a collection of logs, as shown in Architecture 1.
Figure 1
2. The concept of the event here is necessary to introduce the concept of the event in Flume: The core of Flume is to collect data from the data source to send the collected data to the specified destination (sink). To ensure that the delivery process must be successful, the data (channel) is cached before it is sent to the destination (sink), and after the data is actually reached the destination (sink), Flume deletes its cached data.
During the transmission of the entire data, the event is flowing, that is, the transaction is guaranteed at the event level. So what is an event? Event encapsulates the transmitted data, which is the basic unit of flume transmitted data, in the case of a text file, usually a row of records. An event is also the basic unit of a transaction. Event from source, to channel, to sink, is itself a byte array, and can carry headers (header information) information. An event represents the smallest complete unit of data, from an external data source, to an external destination.

3. Flume Architecture The reason why Flume is so magical is that it is a design of its own, which is the agent. The agent itself is a Java process that runs on the Log collection node-the so-called log collection node is the server node. The agent contains 3 core components: source, channel, and sink, similar to the architecture of the producer, warehouse, and consumer.
    • Source:source components are designed to collect data that can handle log data in various types and formats, including Avro, thrift, exec, JMS, spooling directory, netcat, sequence generator , Syslog, HTTP, Legacy, custom.
    • The Channel:source component collects the data and temporarily stores it in the channel, that is, the channel component is dedicated to storing temporary data in the agent-a simple cache of the collected data, which can be stored in memory, JDBC, file, and so on.
    • The Sink:sink component is a component used to send data to a destination, including HDFs, logger, Avro, thrift, IPC, file, NULL, Hbase, SOLR, and custom.

4. Flume operating mechanism Flume core is an agent, the agent has two external interaction, one is to accept the data input source, one is the data output Sink,sink responsible for sending the data to the external designated destination. After the source receives the data, the data is sent to Channel,chanel as a data buffer to temporarily hold the data, and then sink sends the data in the channel to the specified place, such as HDFs. Note: The channel will only delete the temporary data after the sink has successfully sent the data in the channel, which guarantees the reliability and security of the data transmission.

Iii. installing Hadoop and flume My experiment was performed on HDP 2.5.0, and Flume was included in the HDP installation as long as the flume service was configured. Installation steps for HDP see "HAWQ Technical Analysis (ii)-Installation Deployment"

Iv. configuration and Testing 1. Build a MySQL database table to create a test table and add data.
[SQL]View PlainCopy
  1. Use test;
  2. Create table Wlslog
  3. (id int not null,
  4. Time_stamp varchar (+),
  5. Category varchar (+),
  6. Type varchar (+),
  7. ServerName varchar (+),
  8. Code varchar (+),
  9. MSG varchar (+),
  10. primary key (ID)
  11. );
  12. insert into wlslog (id,time_stamp,category,type, servername,code,msg)  values (1, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000365 ', ' Server state changed to standby ');   
  13. insert into wlslog (id,time_stamp, category,type,servername,code,msg)  values (2, ' apr-8-2014-7:0 6:17-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to  starting ');   
  14. insert into wlslog (id,time_stamp,category,type, servername,code,msg)  values (3, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000365 ', ' server state changed to admin ');   
  15. insert into wlslog (id,time_stamp, category,type,servername,code,msg)  values (4, ' apr-8-2014-7:0 6:19-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to  resuming ');   
  16. insert into wlslog (id,time_stamp,category,type, servername,code,msg)  values (5, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000361 ',
  17. insert into wlslog (id,time_stamp, category,type,servername,code,msg)  values (6, ' apr-8-2014-7:0 6:21-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000365 ', ' Server state changed to  running ');   
  18. INSERT INTO wlslog (id,time_stamp,category,type,servername,code,msg) values (7,' apr-8-2014-7:06:22- PM-PDT ',' notice ',' weblogicserver ',' adminserver ',' bea-000360 ',' server started in running mode ')  ;
  19. Commit

2. Establish the relevant directory and file (1) Create a local state file
[Plain]View PlainCopy
    1. Mkdir-p/var/lib/flume
    2. Cd/var/lib/flume
    3. Touch Sql-source.status
    4. Chmod-r 777/var/lib/flume

(2) Setting up the HDFs target directory
[Plain]View PlainCopy
    1. HDFs dfs-mkdir-p/flume/mysql
    2. HDFs Dfs-chmod-r 777/flume/mysql

3. Prepare the jar package from http://book2s.com/java/jar/f/flume-ng-sql-source/ download-flume-ng-sql-source-1.3.7.html Download the Flume-ng-sql-source-1.3.7.jar file and copy it to the Flume library directory.
[Plain]View PlainCopy
    1. CP flume-ng-sql-source-1.3.7.jar/usr/hdp/current/flume-server/lib/
Copy the MySQL JDBC driver jar package to the Flume Library directory as well.
[Plain]View PlainCopy
    1. CP Mysql-connector-java-5.1.17.jar/usr/hdp/current/flume-server/lib/mysql-connector-java.jar

4. Create HAWQ External Table [SQL]View PlainCopy
  1. Create external table Ext_wlslog
  2. (id int,
  3. Time_stamp varchar (+),
  4. Category varchar (+),
  5. Type varchar (+),
  6. ServerName varchar (+),
  7. Code varchar (+),
  8. MSG varchar (+)
  9. Location (' pxf://mycluster/flume/mysql?profile=hdfstextmulti ') format ' csv ' (quote=e' ");

5. Configure Flume to configure the following properties in Flume.conf Ambari, Flume, configs:
[Plain]View PlainCopy
  1. Agent.channels.ch1.type = Memory
  2. Agent.sources.sql-source.channels = Ch1
  3. Agent.channels = Ch1
  4. Agent.sinks = HDFS
  5. Agent.sources = Sql-source
  6. Agent.sources.sql-source.type = Org.keedio.flume.source.SQLSource
  7. Agent.sources.sql-source.connection.url = Jdbc:mysql://172.16.1.127:3306/test
  8. Agent.sources.sql-source.user = root
  9. Agent.sources.sql-source.password = 123456
  10. Agent.sources.sql-source.table = Wlslog
  11. Agent.sources.sql-source.columns.to.select = *
  12. Agent.sources.sql-source.incremental.column.name = ID
  13. Agent.sources.sql-source.incremental.value = 0
  14. agent.sources.sql-source.run.query.delay=5000
  15. Agent.sources.sql-source.status.file.path =/var/lib/flume
  16. Agent.sources.sql-source.status.file.name = Sql-source.status
  17. Agent.sinks.HDFS.channel = Ch1
  18. Agent.sinks.HDFS.type = HDFS
  19. Agent.sinks.HDFS.hdfs.path = Hdfs://mycluster/flume/mysql
  20. Agent.sinks.HDFS.hdfs.fileType = DataStream
  21. Agent.sinks.HDFS.hdfs.writeFormat = Text
  22. Agent.sinks.HDFS.hdfs.rollSize = 268435456
  23. Agent.sinks.HDFS.hdfs.rollInterval = 0
  24. Agent.sinks.HDFS.hdfs.rollCount = 0
Flume specifies the source, channel, and sink-related configurations in the flume.conf file, and the properties are described in Table 1.

Property

Describe

Agent.channels.ch1.type

Channel type of Agent

Agent.sources.sql-source.channels

Channel name corresponding to source

Agent.channels

Channel name

Agent.sinks

Sink name

Agent.sources

Source Name

Agent.sources.sql-source.type

Source Type

Agent.sources.sql-source.connection.url

Database URL

Agent.sources.sql-source.user

Database user Name

Agent.sources.sql-source.password

Database Password

Agent.sources.sql-source.table

database table name

Agent.sources.sql-source.columns.to.select

The columns of the query

Agent.sources.sql-source.incremental.column.name

Increment column Name

Agent.sources.sql-source.incremental.value

Increment initial value

Agent.sources.sql-source.run.query.delay

The time interval, in milliseconds, at which the query was initiated

Agent.sources.sql-source.status.file.path

State file path

Agent.sources.sql-source.status.file.name

Status file name

Agent.sinks.HDFS.channel

Sink the corresponding channel name

Agent.sinks.HDFS.type

Sink type

Agent.sinks.HDFS.hdfs.path

Sink path

Agent.sinks.HDFS.hdfs.fileType

File types for streaming data

Agent.sinks.HDFS.hdfs.writeFormat

Data Write format

Agent.sinks.HDFS.hdfs.rollSize

Target file rotation size, in bytes

Agent.sinks.HDFS.hdfs.rollInterval

HDFs sink interval How long the temporary file is scrolled into the final target file, in seconds, and if set to 0, the file is not scrolled according to time

Agent.sinks.HDFS.hdfs.rollCount

When the events data reaches that number, the temporary file is scrolled into the target file, and if set to 0, the file is not scrolled according to the events data

Table 1


6. Run the flume agent to save the previous step settings, and then restart the Flume service, as shown in 2.
Figure 2
After the reboot, the status file has been logged with the latest ID value of 7, as shown in 3.
Figure 3
Looking at the target path, a temporary file was generated with 7 records, as shown in 4.
Figure 4
Query the HAWQ external table, and the results are all 7 data, as shown in 5.
Figure 5
At this point, the initial data pump has been completed.

7. Test the quasi-real-time incremental extraction in the source table with a new ID of 8, 9, 10 three records.
[SQL]View PlainCopy
  1. Use test;
  2. insert into wlslog (id,time_stamp,category,type, servername,code,msg)  values (8, ' notice ', ' Weblogicserver ', ' AdminServer ', ' bea-000360 ', ' Server started in running mode ');   
  3. insert into wlslog (id,time_stamp, category,type,servername,code,msg)  values (9, ' apr-8-2014-7:0 6:22-PM-PDT ', ' notice ', ' Weblogicserver ', AdminServer ', ' bea-000360 ', ' server started in running  mode ');   
  4. INSERT INTO wlslog (id,time_stamp,category,type,servername,code,msg) values (' apr-8-2014-7:06:22- PM-PDT ',' notice ',' weblogicserver ',' adminserver ',' bea-000360 ',' server started in running mode ')  ;
  5. Commit
5 seconds after the query Hawq external table, from Figure 6 can be seen, has been queried out all 10 data, quasi-real-time incremental extraction success.
Figure 6
V. Advantages and disadvantages of the scheme the greatest advantage of using flume to collect relational database table data is its simple configuration without programming. Compared to the complexity of Tungsten-replicator, Flume has no difficulty in configuring the relevant properties of the source, channel, and sink in the flume.conf file. Compared with the current very hot canal, although not flexible enough, but after all, a line of code does not have to write. Then the scheme is implemented by ordinary SQL polling, which is universal and applicable to all relational library data sources.
The shortcomings of this scheme are as prominent as its merits, mainly embodied in the following aspects.
    • A query was executed on the source library, which is intrusive.
    • The increment is achieved by polling, only quasi-real-time, and the shorter the polling interval, the greater the impact on the source library.
    • Only new data is recognized and deletions and updates are not detected.
    • Requires that the source library have a field to represent the increment.
Even though there are many limitations, the solution of extracting relational database data with Flume is valuable, especially in the application scenarios that require rapid deployment, simplified programming and satisfying requirements, which is an effective complement to the traditional sqoop approach.

Reference: Flume Architecture and application Introduction
Streaming MySQL Database Table Data to HDFS with Flume
How to read the data from the Oracle using FLUME to Kafka broker
Https://github.com/keedio/flume-ng-sql-source
v

Use flume to extract MySQL table data to HDFs in real time

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.