87th Lesson: Flume push data to sparkstreaming case and insider source decryption

Source: Internet
Author: User

Contents of this issue:

1. Flume on HDFs case review

2. Flume push data to spark streaming combat

3. Analysis of principle drawing

1. Flume on HDFS case Review

The last lesson required everyone to install the configuration flume, and test the transmission of data. I was asked to teleport on HDFs yesterday.

File configuration:

~/.BASHRC:

Export Flume_home=/usr/local/flume/apache-flume-1.6.0-bin

Export flume_conf_dir= $FLUME _home/conf

Path added: ${flume_home}/bin;

Copy conf/flume-conf.properties.template, renamed to Conf/flume-cong.properties, write only the following:

Agent1 represents the proxy name

Agent1.sources=source1

Agent1.sinks=sink1

Agent1.channels=channel1

#配置source1

Agent1.sources.source1.type=spooldir

Agent1.sources.source1.spooldir=/usr/local/flume/tmp/testdir

Agent1.sources.source1.channels=channel1

Agent1.sources.source1.fileHeader = False

Agent1.sources.source1.interceptors = I1

Agent1.sources.source1.interceptors.i1.type = Timestamp

#配置sink1

Agent1.sinks.sink1.type=hdfs

Agent1.sinks.sink1.hdfs.path=hdfs://master:9000/library/flume

Agent1.sinks.sink1.hdfs.filetype=datastream

Agent1.sinks.sink1.hdfs.writeformat=text

Agent1.sinks.sink1.hdfs.rollinterval=1

Agent1.sinks.sink1.channel=channel1

agent1.sinks.sink1.hdfs.fileprefix=%y-%m-%d

#agent1. Sinks.sink1.type=avro

#agent1. Sinks.sink1.channel=channel1

#agent1. Sinks.sink1.hostname=master

#agent1. sinks.sink1.port=9999

#配置channel1

Agent1.channels.channel1.type=file

Agent1.channels.channel1.checkpointdir=/usr/local/flume/tmp/checkpointdir

Agent1.channels.channel1.datadirs=/usr/local/flume/tmp/datadirs

FLUME-ENV.SH configuration:

# Export Java_home=/usr/lib/jvm/java-6-sun

Export java_home=/usr/local/jdk/jdk1.8.0_60

# Give Flume more memory and pre-allocate, enable remote monitoring via JMX

# export java_opts= "-xms100m-xmx2000m-dcom.sun.management.jmxremote"

Export java_opts= "-xms100m-xmx2000m-dcom.sun.management.jmxremote"

Create a folder/usr/local/flume/tmp/testdir.

Create the/library/flume folder on HDFs.

Flume start flume under the Bin folder:

./flume-ng agent-n agent1-c conf-f/usr/local/flume/apache-flume-1.6.0-bin/conf/flume-conf.properties- Dflume.root.logger=debug,console

Under/usr/local/flume/tmp/testdir, copy into the test file, for example: NOTICE

The flume console will have some information about:

16/04/22 11:03:49 INFO Avro. Reliablespoolingfileeventreader:preparing to move File/usr/local/flume/tmp/testdir/notice to/usr/local/flume/tmp/ Testdir/notice. Completed

16/04/22 11:03:51 INFO HDFs. Hdfsdatastream:serializer = TEXT, Userawlocalfilesystem = False

16/04/22 11:03:51 INFO HDFs. Bucketwriter:creating hdfs://master:9000/library/flume/2016-04-22.1461294231806.tmp

16/04/22 11:03:52 INFO HDFs. Bucketwriter:closing hdfs://master:9000/library/flume/2016-04-22.1461294231806.tmp

16/04/22 11:03:52 INFO HDFs. Bucketwriter:renaming hdfs://master:9000/library/flume/2016-04-22.1461294231806.tmp to Hdfs://master:9000/library /flume/2016-04-22.1461294231806

You can find the local notice file renamed to notice.completed.

Browser query: Http://localhost:50070/explorer.html#/library/flume, you can see the flume notice file to HDFs/library/flume, the file name is 2016-04-22.1461294231806. Open the file to see what can be verified. Description When a new file is available in the source folder specified by flume, Flume automatically imports this file into the HDFs folder specified in the Flume configuration.

Generally normal business situation, should be put flume data into Kafka, and then let different data consumers to consume data. If you want to choose between Flume and Kafka, it depends on whether the data in your business is constantly being produced. If this is the case, you should choose Kafka. If the resulting data is large, small, or even some time without data, then there is no need to use Kafka, you can save resources.

2. Flume push data to spark streaming combat

Instead of importing flume data into HDFs, we push the data into spark streaming.

Modify the Conf/flume-cong.properties file, import it into HDFs, and change it to the spark streaming.

#配置sink1

#agent1. Sinks.sink1.type=hdfs

#agent1. Sinks.sink1.hdfs.path=hdfs://master:9000/library/flume

#agent1. Sinks.sink1.hdfs.filetype=datastream

#agent1. Sinks.sink1.hdfs.writeformat=text

#agent1. sinks.sink1.hdfs.rollinterval=1

#agent1. Sinks.sink1.channel=channel1

#agent1. sinks.sink1.hdfs.fileprefix=%y-%m-%d

Agent1.sinks.sink1.type=avro

Agent1.sinks.sink1.channel=channel1

Agent1.sinks.sink1.hostname=master

agent1.sinks.sink1.port=9999

Write a Java program for the spark streaming application:

public class Flumepushdata2sparkstreaming {

public static void Main (string[] args) {

sparkconf conf = new sparkconf (). Setmaster ("local[4]"). Setappname ("flumepushdate2sparkstreaming");

Javastreamingcontext JSC = new Javastreamingcontext (conf, Durations.seconds (30));

Javareceiverinputdstream lines = Flumeutils.createstream (JSC, "Master", 9999);

Note the type of event entered here is the sparkflumeevent type.

javadstream<string> words = Lines.flatmap (new flatmapfunction<sparkflumeevent, string> () {

@Override

Public iterable<string> Call (Sparkflumeevent event) throws Exception {

String line = new String (Event.event (). GetBody (). Array ());

Return Arrays.aslist (Line.split (""));

}

});

javapairdstream<string, integer> pairs = Words.maptopair (new pairfunction<string, String, integer> () {

@Override

Public tuple2<string, integer> call (String Word) throws Exception {

return new tuple2<string, integer> (Word, 1);

}

});

javapairdstream<string, integer> wordscount = Pairs.reducebykey (new Function2<integer, Integer, Integer> ( ) {

@Override

Public integer Call (integer v1, Integer v2) throws Exception {

Return v1 + v2;

}

});

Wordscount.print ();

Jsc.start ();

Jsc.awaittermination ();

Jsc.close ();

}

}

Flumeutils is used in the code. Let's dissect the flumeutils used in the code.

The Flumeutil method in the above code is createstream:

The following CreateStream method is actually called:

You can see that stream processing is stored by default, both in memory and on disk, at the same time as serialization, and with two machines.

Continue to see the CreateStream method of the call:

The Flumeinputdstream object is actually returned, and the event is the event sparkflumeevent defined by Flume. So note that when the above Java code is FLATMAP, the input type of flatmapfunction must be the sparkflumeevent type.

And look at the Flumeinputdstream code:

You can see that getreceiver returned the Flumereceiver object that was used to receive the data. See Flumereceiver again:

You can see that Flume used Netty. If you have distributed programming, you should pay attention to using Netty.

Run the above application of the Spark streaming Java program. Confirm that the Flume is also running.

We find a number of files to copy into the TestDir folder, such as: Flume under a number of text files. Then in the Java Run console, you can find the following information:

Description flume push data to Spark Streaming,spark streaming the data is processed in a timely manner.

3. Analysis of principle drawing

Summarize:

With spark streaming you can handle a variety of data source types, such as database, HDFS, server log logs, network streams, which are more powerful than you might imagine, but are often not used by people, and the real reason for this is the spark, spark Streaming itself does not understand.

Written by: Imf-spark Steaming Enterprise Development Practical team (Xiayang, etc.)

Main editor: Liaoliang

Note:

Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)

For more private content, please follow the public number: Dt_spark

If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580

Life was short,you need to spark!

87th Lesson: Flume push data to sparkstreaming case and insider source decryption

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.