Use of the radio and accumulator for Spark

Source: Internet
Author: User

One, broadcast variables and accumulator 1.1 broadcast variables:

Broadcast variables allow programmers to cache a read-only variable on each machine instead of passing variables between tasks. Broadcast variables can be used to effectively give each node a copy of a large input dataset. Spark also attempts to use efficient broadcast algorithms to distribute variables, thereby reducing the overhead of communication.
The actions of Spark are performed in a series of steps that are separated by a distributed shuffle operation. Spark automatically broadcasts the common data required for each task in each step. These broadcast data are serialized and deserialized before the task is run. This means that it is useful to explicitly create a broadcast variable when we need to use the same data between tasks in multiple phases, or if it is important to cache the data in deserialization form.

1.2 Accumulator:

An accumulator is a variable that is only accumulated by the associated operation, so it can be effectively supported in parallel. It can be used to implement counters and sums. Spark natively supports only numeric types of accumulators, and programmers can add new types of support. If you specify a name when you create the accumulator, you can see it in the Spark's UI interface. This facilitates understanding of the process at each stage of execution. (not supported for Python)
The accumulator is created by calling Sparkcontext.accumulator (v) on an initialized variable v. Tasks running on the cluster can be accumulated on the accumulator by adding or "+ =" methods. However, they cannot read its value. Only the driver can read its value, through the accumulator value method.

Two. Java and Scala version of the actual demo 2.1 Java version:
 PackageCom. streaming;ImportOrg.apache.spark.Accumulator;Importorg.apache.spark.SparkConf;ImportOrg.apache.spark.api.java.JavaPairRDD;ImportOrg.apache.spark.api.java.function.Function;ImportOrg.apache.spark.broadcast.Broadcast;ImportOrg.apache.spark.streaming.Durations;ImportOrg.apache.spark.streaming.Time;ImportOrg.apache.spark.streaming.api.java.JavaStreamingContext;ImportOrg.apache.spark.api.java.function.FlatMapFunction;ImportOrg.apache.spark.api.java.function.Function2;ImportOrg.apache.spark.api.java.function.PairFunction;ImportOrg.apache.spark.streaming.api.java.JavaDStream;ImportOrg.apache.spark.streaming.api.java.JavaPairDStream;ImportOrg.apache.spark.streaming.api.java.JavaReceiverInputDStream;ImportScala. Tuple2;Importjava.util.*;/** * blacklist filter with broadcast! * * Whether it's a counter or a broadcast! It's not as simple as you think! * Combined use is very powerful!!! Definitely a high-end application! * * If the joint use of extension, what to do!!! * * ? */ Public  class broadcastaccumulator {    /** * Be sure to create a broadcast list * * instantiated in context! */    Private Static volatileBroadcast<list<string>> broadcastlist =NULL;/** * counter!     * Instantiate in Context! */    Private Static volatileaccumulator<integer> accumulator =NULL; Public Static void Main(string[] args) {sparkconf conf =NewSparkconf (). Setmaster ("local[2]"). Setappname ("Wordcountonliebroadcast"); Javastreamingcontext JSC =NewJavastreamingcontext (conf, Durations.seconds (5));/** * If there is no action, the broadcast will not be sent out!         * * Use broadcast broadcast blacklist to every executor! */Broadcastlist = Jsc.sc (). Broadcast (Arrays.aslist ("Hadoop","Mahout","Hive"));/** * Global Counter!         Used to count the number of blacklisted online filters! */Accumulator = Jsc.sparkcontext (). Accumulator (0,"Onlineblacklistcounter"); Javareceiverinputdstream<string> lines = Jsc.sockettextstream ("Master",9999);/** * Omitted here Flatmap because the list is one! */javapairdstream<string, integer> pairs = Lines.maptopair (NewPairfunction<string, String, integer> () {@Override             PublicTuple2<string, integer>Pager(String Word) {return NewTuple2<string, integer> (Word,1);        }        }); javapairdstream<string, integer> wordscount = Pairs.reducebykey (NewFunction2<integer, Integer, integer> () {@Override             PublicIntegerPager(Integer v1, integer v2) {returnV1 + v2; }        });/** * Funtion The first few parameters are the entry parameter.         * The following arguments are given.         * Reflected in the call Method!         * * This is done directly based on the RDD! */Wordscount.foreach (NewFunction2<javapairrdd<string, Integer>, Time, void> () {@Override             PublicVoidPager(Javapairrdd<string, integer> Rdd, Time time)throwsException {Rdd.filter (NewFunction<tuple2<string, Integer>, boolean> () {@Override                     PublicBooleanPager(tuple2<string, integer> Wordpair)throwsException {if(Broadcastlist.value (). Contains (wordpair._1)) {/** * Accumulator should not be used only for counting.                             * Can be written into the database or Redis! */Accumulator.add (wordpair._2);return false; }Else{return true; }                    };/** * Here really wants the broadcast and counters to be executed.                     To do an action action! */}). Collect (); System.out.println ("The value inside the broadcaster"+broadcastlist.value ()); System.out.println ("The value inside the timer"+accumulator.value ());return NULL;        }        });        Jsc.start ();        Jsc.awaittermination ();    Jsc.close (); }    }
2.2 Scala version
 PackageCom. StreamingImportJava.utilImportOrg.apache.spark.streaming. {Duration, StreamingContext}ImportOrg.apache.spark. {accumulable, accumulator, Sparkcontext, sparkconf}ImportOrg.apache.spark.broadcast.Broadcast/** * Created by Lxh on 2016/6/30. */ Object broadcastaccumulatorstreaming {  /** * Declare a broadcast and accumulator! */  Private varBroadcastlist:broadcast[list[string]] = _Private varAccumulator:accumulator[int] = _defMain (args:array[string]) {Valsparkconf =NewSparkconf (). Setmaster ("Local[4]"). Setappname ("Broadcasttest")Valsc =NewSparkcontext (sparkconf)/** * Duration is MS */    ValSSC =NewStreamingContext (Sc,duration ( -))//broadcastlist = Ssc.sparkContext.broadcast (util. Arrays.aslist ("Hadoop", "Spark"))Broadcastlist = Ssc.sparkContext.broadcast (List ("Hadoop","Spark")) accumulator= Ssc.sparkContext.accumulator (0,"Broadcasttest")/** * GET Data! */    ValLines = Ssc.sockettextstream ("localhost",9999)/** * What to do when you get the data!      * * 1.flatmap splits the line into words. * 2.map turn Word into tuple (word,1) * 3.reducebykey Cumulative value * (4.sortBykey rank) * 4. Filter.      Value is in the accumulator.      * 5. Print the display. */    ValWords = Lines.flatmap (line = Line.split (" "))ValWordpair = Words.map (Word,1)) Wordpair.filter (record = {BroadcastList.value.contains (record._1)})ValPair = Wordpair.reducebykey (_+_)/** * Why do you foreachrdd this step first?      * Because this pair is pairdstream<string, integer> * * Foreachrdd is for? *      *// * PAIR.FOREACHRDD (Rdd = {rdd.filter) (record = {if (BroadcastList.value.contains (record._1)) { Accumulator.add (1) return true} (else {return false}})}) */    ValFiltedpair = Pair.filter (record = {if(BroadcastList.value.contains (record._1)) {Accumulator.add (record._2)true}Else{false}}). Print println ("Accumulator Value"+accumulator.value)//Pair.filter (record = {BroadcastList.value.contains (record._1)})   / * val keypair = pair.map (pair = = (pair._2,pair._1)) */    /** * If Dstream does not have an operator operation.      by transforming transform! */   /* Keypair.transform (Rdd = {Rdd.sortbykey (false)//todo}) */Pair.print () Ssc.start () Ssc.awaittermination ()}}

Use of the radio and accumulator for Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.