Common ways to use Java programming in Spark

Source: Internet
Author: User
Tags iterable stub

SOURCE Quote: http://blog.sina.com.cn/s/blog_628cc2b70102w9up.html

First, the initialization of the Sparkcontext

system.setproperty ("Hadoop.home.dir", "d:\\spark-1.6.1-bin-hadoop2.6\\spark-1.6.1-bin-hadoop2.6"); sparkconf conf = new sparkconf (). Setappname ("Spark test1"). Setmaster ("local[2]"); Javasparkcontext context = new Javasparkcontext (conf); second, the use of Parallelize methodThe simplest way to create an RDD is to pass an existing set of programs to the Sparkcontext Parallelize () method Javardd lines = context.parallelize (Arrays.aslist ("Pandas", "I like Pandas")); System.out.println (Lines.collect ());Output: [Pandas, I like pandas] Three, Rdd operation (Filter method)The RDD supports two types of operations: conversion and action. The RDD conversion operation is to return a new RDD operation, such as map () and filter (), while the action operation is the operation that wants the drive program to return the result or write the result to the external system, triggering the actual calculation, such as count () and first (). Javardd Inputrdd = Context.textfile ("D:\\log\\521.txt"); Javardd Errorsrdd = Inputrdd.filter ( New Function () {   @Override Public Boolean Call (String x) throws Exception { //TODO auto-generated method stub return X.contains ("error"); } }); System.out.println ("Errors Display as:" + errorsrdd.collect ()); System.out.println ("Number of Errors:" + errorsrdd.count ());521.log is the Android Logcat file, which contains many error messages. iv. using lambda expressionsJava8 begins to support lambda expressions, and can be used to implement function interfaces succinctly. Javardd Inputrdd = Context.textfile ("D:\\log\\521.txt"); Javardd errors = Inputrdd.filter (S-s.contains ("error")); System.out.println (Errors.Count ());Output: 23 v. Use of the Map methodApply the function to each element in the RDD, and the return value constitutes a new RDD Javardd Rdd = Context.parallelize (Arrays.aslist (1, 3, 5, 7)); Javardd result = Rdd.map ( New Function () {   @Override Public integer Call (integer x) throws Exception { //TODO auto-generated method stub return x * x; } }); System.out.println (Stringutils.join (Result.collect (), ","));Output: 1,9,25,49 Vi. Use of the Flatmap methodApplies a function to each element in the RDD, which makes all the contents of the returned iterator a new rdd, often used to slice words. The difference from map is that the value returned by this function is one of the list, removing the original format Javardd lines = context.parallelize (arrays.aslist ("Hello World", "HI"); Javardd words = Lines.flatmap ( New Flatmapfunction () {   @Override Public iterable Call (String lines) throws Exception { //TODO auto-generated method stub return Arrays.aslist (Lines.split ("")); } }); System.out.println (Words.collect ()); System.out.println (Words.first ());Output: [Hello, World, Hi]hello Vii. Use of the Pairrdd methodSpark provides some proprietary operations for RDD that contain key-value pairs of types, called Pair rdd. When you need to convert a normal rdd to a pair rdd, you can call the map () function to implement it. Javardd lines = context.parallelize (arrays.aslist ("Hello World", "Hangtian are from Hangzhou", "Hi", "hi")); pairfunction keyData = new Pairfunction () {   @Override Public Tuple2 Call (String x) throws Exception { //TODO auto-generated method stub return new Tuple2 (X.split ("") [0], x); } }; Javapairrdd pairs = (javapairrdd) lines.maptopair (keyData); System.out.println (Pairs.collect ());Output: [(Hello,hello World), (Hangtian,hangtian was from Hangzhou), (Hi,hi), (Hi,hi)] Eight, calculate the number of words Javardd input = Context.textfile ("D:\\test.txt"); Javardd words = Input.flatmap (new Flatmapfunction () {   @Override Public iterable Call (String x) throws Exception { //TODO auto-generated method stub return Arrays.aslist (X.split ("")); } }); Javapairrdd Wordspair = Words.maptopair (new Pairfunction () {   @Override Public Tuple2 Call (String x) throws Exception { //TODO auto-generated method stub return new Tuple2 (x, 1); } }); Javapairrdd result = Wordspair.reducebykey (new Function2 () {   @Override Public integer Call (integer x, integer y) throws Exception { //TODO auto-generated method stub return x + y; } }); System.out.println (Result.sortbykey (). Collect ());Output: [(, 2), (are,1), (can,1), (go,1), (i,2), (love,1), (me,1), (much,1), (ok?,1), (should,1), (so,2), (with,1), (you,3)] ix. Use of the accumulator methodThere are two types of shared variables for spark: accumulator and broadcast variables. Accumulators are used to aggregate information, while broadcast variables are used to efficiently distribute larger objects. The accumulator provides a simple syntax for aggregating values from a work node into a drive program in China. Javardd Rdd = Context.textfile ("D:\\test.txt"); Final Accumulator blanklines = context.accumulator (0); Javardd callsigns = Rdd.flatmap (new Flatmapfunction () {   @Override Public iterable Call (String line) throws Exception { //TODO auto-generated method stub if (Line.equals ("")) { Blanklines.add (1); } return Arrays.aslist (Line.split ("")); } }); System.out.println (Callsigns.collect ()); System.out.println ("Blank lines:" + blanklines.value ());Output: [I, love, we, so, much,, so, I, should, you, can, go, with, me,, is, you, OK?] Blank Lines:2 10. Spark SQL UseSpark provides spark SQL to manipulate structured and semi-structured data. Frankly, you can use SQL statements to manipulate JSON and TXT files for data query operations. Javardd Rdd = Context.textfile ("D:\\test.json"); SqlContext sqlcontext = Sqlcontext.getorcreate (Rdd.context ()); DataFrame DataFrame = Sqlcontext.read (). JSON (RDD); dataframe.registertemptable ("person"); DataFrame resultdataframe = Sqlcontext.sql ("select * from person where lovespandas=true"); Resultdataframe.show (false);Output: +-----------+---------+|lovespandas|name |+-----------+---------+|true |nanchang | | True |qier | | True |kongshuai|+-----------+---------+ 11. Spark Stream UseUsed to calculate data in real time, its constructor interface is used to specify how long the batch interval for new data is processed as input.The following code failed to execute locally. The idea is to use the Netcat tool as the input source to print and process the input information in the program. Javastreamingcontext JSSC = new Javastreamingcontext (conf, new Duration ()); Javadstream lines = Jssc.sockettextstream ("localhost", 7778); lines.print (); Jssc.start (); jssc.awaittermination ();Running the above code also requires removing the topmost context-initialized code.

Common ways to use Java programming in Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.