Common ways to use Java programming in Spark

Last Update:2017-12-22 Source: Internet

Author: User

Tags iterable stub

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SOURCE Quote: http://blog.sina.com.cn/s/blog_628cc2b70102w9up.html

First, the initialization of the Sparkcontext

system.setproperty ("Hadoop.home.dir", "d:\\spark-1.6.1-bin-hadoop2.6\\spark-1.6.1-bin-hadoop2.6"); sparkconf conf = new sparkconf (). Setappname ("Spark test1"). Setmaster ("local[2]"); Javasparkcontext context = new Javasparkcontext (conf); second, the use of Parallelize methodThe simplest way to create an RDD is to pass an existing set of programs to the Sparkcontext Parallelize () method Javardd lines = context.parallelize (Arrays.aslist ("Pandas", "I like Pandas")); System.out.println (Lines.collect ());Output: [Pandas, I like pandas] Three, Rdd operation (Filter method)The RDD supports two types of operations: conversion and action. The RDD conversion operation is to return a new RDD operation, such as map () and filter (), while the action operation is the operation that wants the drive program to return the result or write the result to the external system, triggering the actual calculation, such as count () and first (). Javardd Inputrdd = Context.textfile ("D:\\log\\521.txt"); Javardd Errorsrdd = Inputrdd.filter ( New Function () { @Override Public Boolean Call (String x) throws Exception { //TODO auto-generated method stub return X.contains ("error"); } }); System.out.println ("Errors Display as:" + errorsrdd.collect ()); System.out.println ("Number of Errors:" + errorsrdd.count ());521.log is the Android Logcat file, which contains many error messages. iv. using lambda expressionsJava8 begins to support lambda expressions, and can be used to implement function interfaces succinctly. Javardd Inputrdd = Context.textfile ("D:\\log\\521.txt"); Javardd errors = Inputrdd.filter (S-s.contains ("error")); System.out.println (Errors.Count ());Output: 23 v. Use of the Map methodApply the function to each element in the RDD, and the return value constitutes a new RDD Javardd Rdd = Context.parallelize (Arrays.aslist (1, 3, 5, 7)); Javardd result = Rdd.map ( New Function () { @Override Public integer Call (integer x) throws Exception { //TODO auto-generated method stub return x * x; } }); System.out.println (Stringutils.join (Result.collect (), ","));Output: 1,9,25,49 Vi. Use of the Flatmap methodApplies a function to each element in the RDD, which makes all the contents of the returned iterator a new rdd, often used to slice words. The difference from map is that the value returned by this function is one of the list, removing the original format Javardd lines = context.parallelize (arrays.aslist ("Hello World", "HI"); Javardd words = Lines.flatmap ( New Flatmapfunction () { @Override Public iterable Call (String lines) throws Exception { //TODO auto-generated method stub return Arrays.aslist (Lines.split ("")); } }); System.out.println (Words.collect ()); System.out.println (Words.first ());Output: [Hello, World, Hi]hello Vii. Use of the Pairrdd methodSpark provides some proprietary operations for RDD that contain key-value pairs of types, called Pair rdd. When you need to convert a normal rdd to a pair rdd, you can call the map () function to implement it. Javardd lines = context.parallelize (arrays.aslist ("Hello World", "Hangtian are from Hangzhou", "Hi", "hi")); pairfunction keyData = new Pairfunction () { @Override Public Tuple2 Call (String x) throws Exception { //TODO auto-generated method stub return new Tuple2 (X.split ("") [0], x); } }; Javapairrdd pairs = (javapairrdd) lines.maptopair (keyData); System.out.println (Pairs.collect ());Output: [(Hello,hello World), (Hangtian,hangtian was from Hangzhou), (Hi,hi), (Hi,hi)] Eight, calculate the number of words Javardd input = Context.textfile ("D:\\test.txt"); Javardd words = Input.flatmap (new Flatmapfunction () { @Override Public iterable Call (String x) throws Exception { //TODO auto-generated method stub return Arrays.aslist (X.split ("")); } }); Javapairrdd Wordspair = Words.maptopair (new Pairfunction () { @Override Public Tuple2 Call (String x) throws Exception { //TODO auto-generated method stub return new Tuple2 (x, 1); } }); Javapairrdd result = Wordspair.reducebykey (new Function2 () { @Override Public integer Call (integer x, integer y) throws Exception { //TODO auto-generated method stub return x + y; } }); System.out.println (Result.sortbykey (). Collect ());Output: [(, 2), (are,1), (can,1), (go,1), (i,2), (love,1), (me,1), (much,1), (ok?,1), (should,1), (so,2), (with,1), (you,3)] ix. Use of the accumulator methodThere are two types of shared variables for spark: accumulator and broadcast variables. Accumulators are used to aggregate information, while broadcast variables are used to efficiently distribute larger objects. The accumulator provides a simple syntax for aggregating values from a work node into a drive program in China. Javardd Rdd = Context.textfile ("D:\\test.txt"); Final Accumulator blanklines = context.accumulator (0); Javardd callsigns = Rdd.flatmap (new Flatmapfunction () { @Override Public iterable Call (String line) throws Exception { //TODO auto-generated method stub if (Line.equals ("")) { Blanklines.add (1); } return Arrays.aslist (Line.split ("")); } }); System.out.println (Callsigns.collect ()); System.out.println ("Blank lines:" + blanklines.value ());Output: [I, love, we, so, much,, so, I, should, you, can, go, with, me,, is, you, OK?] Blank Lines:2 10. Spark SQL UseSpark provides spark SQL to manipulate structured and semi-structured data. Frankly, you can use SQL statements to manipulate JSON and TXT files for data query operations. Javardd Rdd = Context.textfile ("D:\\test.json"); SqlContext sqlcontext = Sqlcontext.getorcreate (Rdd.context ()); DataFrame DataFrame = Sqlcontext.read (). JSON (RDD); dataframe.registertemptable ("person"); DataFrame resultdataframe = Sqlcontext.sql ("select * from person where lovespandas=true"); Resultdataframe.show (false);Output: +-----------+---------+|lovespandas|name |+-----------+---------+|true |nanchang | | True |qier | | True |kongshuai|+-----------+---------+ 11. Spark Stream UseUsed to calculate data in real time, its constructor interface is used to specify how long the batch interval for new data is processed as input.The following code failed to execute locally. The idea is to use the Netcat tool as the input source to print and process the input information in the program. Javastreamingcontext JSSC = new Javastreamingcontext (conf, new Duration ()); Javadstream lines = Jssc.sockettextstream ("localhost", 7778); lines.print (); Jssc.start (); jssc.awaittermination ();Running the above code also requires removing the topmost context-initialized code.

Common ways to use Java programming in Spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More