Processing of flume_kafka_hdfs_hive data

Source: Internet
Author: User
Tags hadoop fs

Using flume to collect data, passing data to data on Kafka and Hdfs,kafka can be used to build real-time calculations using storm, and the data on HDFs can be processed by importing it into hive after Mr Processing.

Environment: hadoop1.2.1,hive 0.13.1,maven 3.2.5,flume 1.4,kafka 0.7.2,eclipse luna,jdk 1.7_ 75;mysql-connector-java-5.1.26.bin.jar,flume-kafka-master.zip.

Description: All services are set up on a single machine.

1: Install Hadoop: This article is written more complete, you can see: Ubuntu 12.10 Installation JDK, Hadoop whole process

I appeared during the installation process: Does not contain a valid Host:port authority:file:///, and looked at his core-site.xml,hdfs-site.xml again, Mapred-site.xml did not find errors, but also specifically to see some of the hosts configuration, finally found on the Internet, fs.default.name in the default write wrong, start Hadoop.

2: Install hive: After downloading the extract, set Hive_home, add Hive_home/bin to the path variable, and enter hive directly to start. The default hive is the Derby database using embedded mode, it is small, and the father is Apache, but there is a single session, can not be shared by many users, here refer to the information on the Internet to store metadata to MySQL: hive integrated MySQL as metadata, Here I have some problems, can only use localhost to connect, unable to use [email protected], try to follow the article to find my.conf, but did not find the relevant configuration, in the experimental environment This can also be used:

3: Install flume:

 for  for online analytic application.

Flume is a distributed, reliable, highly available service that is launched by Cloudera to efficiently collect, process, and move large volumes of log data. It has a simple and flexible architecture that is based on data flow. It uses reliable coordination, fault-tolerant transfer, and recovery mechanisms to make it robust and fault-tolerant. It uses a simple and scalable data model to enable online data analysis.

Flume an agent division, an agent consists of source,channel,sink three parts, source fetch data from Web server, push to Channel,sink will pull Channel to get data, An agent can have multiple channel, Sink.

Configure the Flume_home, add path to the execution paths under Flume, rename the flume-conf.properties.template under Conf to Flume-conf.properties, and then configure it in a single machine case:

Agent.sources = R1//add source to agent, named R1Agent.sinks = S1//add sink to the agent, named S1Agent.channels = C1//The agent adds a channel, named C1Agent.sinks.s1.channel = C1//S1 fetching data from the C1Agent.sources.r1.channels = C1//R1 give the data to C1#describe the Sourceagent.sources.r1.type= Exec//defines the type of R1 as execAgent.sources.r1.command =Tail-f/root/input/loginfo//commands executed by R1#use a channelwhichBuffers EventsinchMemoryagent.channels.c1.type= Memory//defines the type of C1 memoryAgent.channels.c1.capacity = +  //capacity of the C1Agent.channels.c1.transactionCapacity = -  //Channel acquisition or sink to get the maximum amount of data at once#sinks Typeagent.sinks.s1.type= Logger//defines the type of S1 as logger

When done, start flume, test:

Bin/flume-ng Agent--conf/conf/-F conf/flume-conf.properties-dflume.root.logger=debug,console-n Agent

4: Installation Kafka: Introduction to Kafka: Kafka Quick Start, simply put, a server in the Kafka cluster is a broker, the message is classified by name, called topic, the message is generated producer, the message gets the customer. Kafka installation method as above. Because the default configuration is used, configuration is not required under config in Kafka and can be simulated directly by booting:

Bin/zookeeper-server-start. sh config/zookeeper.propertiesbin/kafka-server-start. sh config/server.propertiesbin/kafka-console-producer. sh --zookeeper localhost:2181 localhost:9092 --topic test Bin/ Kafka-console-consumer. sh --zookeeper localhost:2181 --topic test--from-beginning

Related can see Apache-kafka.

5: Integration: Data processing processes include data collection, data cleansing, data storage, data analysis, data presentation. Here the collection of data is responsible by flume, regularly collects log related information from the Web server, the processing of real-time data, sends the data directly to the Kafka, and then to the back of the storm processing (this did not do), for the offline part, After a simple Mr Processing, it is stored on HDFs, and then the hive operation is used.

Overall architecture diagram:

Flume's design:

Before setting up, install Maven: installation steps ibid. flume and Kafka

After installation, echo $PATH:

/usr/lib/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.7.0_75/bin:/root/hadoop-1.2.1/bin:/root/apache-hive-0.13.1/bin:/root/apache-flume-1.4.0/bin:/root/kafka-0.7.2/bin:/root/bin:/usr/java/jdk1.7.0_75/bin:/root/hadoop-1.2.1/bin:/root/apache-hive-0.13.1/bin:/root/apache-flume-1.4.0/bin:/root/kafka-0.7.2/bin:/root/downloads/apache-maven-3.2.5/bin

Integration between Flume and Kafka requires a plugin: Here's a Flume-kafka plugin, Based on flume1.4,kafka0.7.2, the code is downloaded, entered into the directory, packaged into a jar file using Maven, the resulting jar package is placed in the flume Lib or related directory, followed by Hadoop.1.2.1-*.jar, Kafka0.7.2.jar, Scala-compiler.jar (2.8), Scala-library.jar (2.8), Zkclient-0.1.jar Import, MVN package may be error- If you can't find Kafka0.7.2.jar, you'll need to put the extra extra-dependencies under the ' ~/.m2/repository/com/linkedin/kafka/kafka/0.7.2/' and then the package.

For MYGGG, open 6 terminals:

For data sent to Kafka after processing, it is now mainly for data in Hadoop, first using Mr Processing, formatted text.

6: Next: Unzip Eclipse, put the previously prepared Hadoop-eclipse-plugin-1.2.1.jar into the plugins directory under Eclipse, connect to the machine using VNC, and write the Mr Program.

To view files using ' Hadoop fs-cat/myflume/flumedata.1426320728464 ':

1 , b 2 , C 3 , D 4 , E 5 , F 6 , G 7 , Z 0, O

Write Mr to split a line of records into Key,value:

 Public Static classMymapperextendsMapper<object, Text, intwritable, text>{      Privateintwritable Hello =Newintwritable (); PrivateText World =NewText ();  Public voidmap (Object key, Text value, context context)throwsIOException, interruptedexception {string[] array= Value.tostring (). Split (","); if(Array.Length = = 2) {Hello.set (Integer.parseint (array[0])); World.set (array[1]);        Context.write (Hello, world); }      }    }

View results:

[[email protected] eclipse]# Hadoop FS-cat /myoutput/part-r-000000    o1      b2    C3    d4    e 5     F6    g7    Z

Use hive to create an external table to view data:

Create Table int  by ' \t ' by ' \ n '  as '/myoutput ';

Then you can do the relevant query and processing.

Processing of flume_kafka_hdfs_hive data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.