Now, the online spark-based code is basically Scala, a lot of books are based on Scala, no way, who called Spark is Scala written out, but I do not have a systematic study of Scala, so I can only use Java to write Spark program, Spark supports Java, and Scala is based on the JVM, not to mention, directly on the code
This is the official online example, the Big Data Learning Classic case of Word count
In Linux next terminal enter $ nc-lk 9999
Then run the following code
Packagecom. TG. Spark. Stream;Import Java. Util. Arrays;import org. Apache. Spark.*;import org. Apache. Spark. API. Java. function.*;import org. Apache. Spark. Streaming.*;import org. Apache. Spark. Streaming. API. Java.*;Import Scala. Tuple2;/** * * @author Soup High * * *public class Sparkstream {public static void main (string[] args) {//Create a local StreamingContext with Working thread andBatch//Interval of1Second sparkconf conf = new sparkconf (). Setmaster("Local[4]"). Setappname("Networkwordcount"). Set("Spark.testing.memory","2147480000");Javastreamingcontext JSSC = new Javastreamingcontext (conf, durations. Seconds(1));System. out. println(JSSC);Create a DStream that would connect to hostname:port, like//localhost:9999javareceiverinputdstream<string> lines = JSSC. Sockettextstream("Master",9999);javadstream<string> lines = JSSC. Textfilestream("Hdfs://master:9000/stream");Split each line into words javadstream<string> words = lines. FlatMap(New flatmapfunction<string, string> () {@Override public iterable<string>Pager(Stringx) {System. out. println(Arrays. Aslist(x. Split(" ")). Get(0));Return Arrays. Aslist(x. Split(" "));} });Count each wordinchEach batch javapairdstream<string, integer> pairs = words. Maptopair(New pairfunction<string, String, integer> () {@Override public tuple2<string, integer>Pager(String s) {return new tuple2<string, integer> (S,1);} });System. out. println(pairs);javapairdstream<string, integer> wordcounts = pairs. Reducebykey(New Function2<integer, Integer, integer> () {@Override public IntegerPager(integer i1, integer i2) {return i1 + i2;} });Print the first ten elements of each RDD generatedinchThis DStream to//the console wordcounts. Print();Wordcounts. Saveashadoopfiles("hdfs://master:9000/testfile/","Spark", New Text (), New Intwritable (), javapairdstream<text,intwritable> ());Wordcounts. Dstream(). Saveastextfiles("hdfs://master:9000/testfile/","Spark");Wordcounts. Saveashadoopfiles("hdfs://master:9000/testfile/","Spark", text,intwritable);System. out. println(wordcounts. Count());Jssc. Start(); System. out. println(wordcounts. Count());//Start the computationJssc. Awaittermination(); Wait for the computation to terminate}}
And then just the terminal input Hello World
# TERMINAL 1:# Running Netcat9999hello world
You can see it through the console.
1357008430000 ms-------------------------------------------(hello,1)(world,1)...
And the real-time files generated by the calculations can also be seen on HDFs
The second case is that the input data source is not passed through the sockettextstream socket, but directly through a file directory on HDFs
Packagecom. TG. Spark. Stream;Import Java. Util. Arrays;import org. Apache. Spark.*;import org. Apache. Spark. API. Java. function.*;import org. Apache. Spark. Streaming.*;import org. Apache. Spark. Streaming. API. Java.*;Import Scala. Tuple2;/** * * @author Soup High * * *public class SparkStream2 {public static void main (string[] args) {//Create a local StreamingContext with Working thread andBatch//Interval of1Second sparkconf conf = new sparkconf (). Setmaster("Local[4]"). Setappname("Networkwordcount"). Set("Spark.testing.memory","2147480000");Javastreamingcontext JSSC = new Javastreamingcontext (conf, durations. Seconds(1));System. out. println(JSSC);Create a DStream that would connect to hostname:port, like//localhost:9999javareceiverinputdstream<string> lines = JSSC. Sockettextstream("Master",9999);javadstream<string> lines = JSSC. Textfilestream("Hdfs://master:9000/stream");Split each line into words javadstream<string> words = lines. FlatMap(New flatmapfunction<string, string> () {@Override public iterable<string>Pager(Stringx) {System. out. println(Arrays. Aslist(x. Split(" ")). Get(0));Return Arrays. Aslist(x. Split(" "));} });Count each wordinchEach batch javapairdstream<string, integer> pairs = words. Maptopair(New pairfunction<string, String, integer> () {@Override public tuple2<string, integer>Pager(String s) {return new tuple2<string, integer> (S,1);} });System. out. println(pairs);javapairdstream<string, integer> wordcounts = pairs. Reducebykey(New Function2<integer, Integer, integer> () {@Override public IntegerPager(integer i1, integer i2) {return i1 + i2;} });Print the first ten elements of each RDD generatedinchThis DStream to//the console wordcounts. Print();Wordcounts. Saveashadoopfiles("hdfs://master:9000/testfile/","Spark", New Text (), New Intwritable (), javapairdstream<text,intwritable> ());Wordcounts. Dstream(). Saveastextfiles("hdfs://master:9000/testfile/","Spark");Wordcounts. Saveashadoopfiles("hdfs://master:9000/testfile/","Spark", text,intwritable);System. out. println(wordcounts. Count());Jssc. Start(); System. out. println(wordcounts. Count());//Start the computationJssc. Awaittermination(); Wait for the computation to terminate}}
So there is the port has been monitoring your directory, as long as it has a file generation, it will immediately read the contents of it, you can run the program, and then manually add a file to the first directory, you can see the output
Code word is not easy, reprint please specify source http://blog.csdn.net/tanggao1314/article/details/51606721
Reference
Spark Programming Guide
Spark live stream Compute Java case