Use of the radio and accumulator for Spark

Last Update:2016-07-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One, broadcast variables and accumulator 1.1 broadcast variables:

Broadcast variables allow programmers to cache a read-only variable on each machine instead of passing variables between tasks. Broadcast variables can be used to effectively give each node a copy of a large input dataset. Spark also attempts to use efficient broadcast algorithms to distribute variables, thereby reducing the overhead of communication.
The actions of Spark are performed in a series of steps that are separated by a distributed shuffle operation. Spark automatically broadcasts the common data required for each task in each step. These broadcast data are serialized and deserialized before the task is run. This means that it is useful to explicitly create a broadcast variable when we need to use the same data between tasks in multiple phases, or if it is important to cache the data in deserialization form.

1.2 Accumulator:

An accumulator is a variable that is only accumulated by the associated operation, so it can be effectively supported in parallel. It can be used to implement counters and sums. Spark natively supports only numeric types of accumulators, and programmers can add new types of support. If you specify a name when you create the accumulator, you can see it in the Spark's UI interface. This facilitates understanding of the process at each stage of execution. (not supported for Python)
The accumulator is created by calling Sparkcontext.accumulator (v) on an initialized variable v. Tasks running on the cluster can be accumulated on the accumulator by adding or "+ =" methods. However, they cannot read its value. Only the driver can read its value, through the accumulator value method.

Two. Java and Scala version of the actual demo 2.1 Java version:

 PackageCom. streaming;ImportOrg.apache.spark.Accumulator;Importorg.apache.spark.SparkConf;ImportOrg.apache.spark.api.java.JavaPairRDD;ImportOrg.apache.spark.api.java.function.Function;ImportOrg.apache.spark.broadcast.Broadcast;ImportOrg.apache.spark.streaming.Durations;ImportOrg.apache.spark.streaming.Time;ImportOrg.apache.spark.streaming.api.java.JavaStreamingContext;ImportOrg.apache.spark.api.java.function.FlatMapFunction;ImportOrg.apache.spark.api.java.function.Function2;ImportOrg.apache.spark.api.java.function.PairFunction;ImportOrg.apache.spark.streaming.api.java.JavaDStream;ImportOrg.apache.spark.streaming.api.java.JavaPairDStream;ImportOrg.apache.spark.streaming.api.java.JavaReceiverInputDStream;ImportScala. Tuple2;Importjava.util.*;/** * blacklist filter with broadcast! * * Whether it's a counter or a broadcast! It's not as simple as you think! * Combined use is very powerful!!! Definitely a high-end application! * * If the joint use of extension, what to do!!! * * ？ */ Public  class broadcastaccumulator {    /** * Be sure to create a broadcast list * * instantiated in context! */    Private Static volatileBroadcast<list<string>> broadcastlist =NULL;/** * counter!     * Instantiate in Context! */    Private Static volatileaccumulator<integer> accumulator =NULL; Public Static void Main(string[] args) {sparkconf conf =NewSparkconf (). Setmaster ("local[2]"). Setappname ("Wordcountonliebroadcast"); Javastreamingcontext JSC =NewJavastreamingcontext (conf, Durations.seconds (5));/** * If there is no action, the broadcast will not be sent out!         * * Use broadcast broadcast blacklist to every executor! */Broadcastlist = Jsc.sc (). Broadcast (Arrays.aslist ("Hadoop","Mahout","Hive"));/** * Global Counter!         Used to count the number of blacklisted online filters! */Accumulator = Jsc.sparkcontext (). Accumulator (0,"Onlineblacklistcounter"); Javareceiverinputdstream<string> lines = Jsc.sockettextstream ("Master",9999);/** * Omitted here Flatmap because the list is one! */javapairdstream<string, integer> pairs = Lines.maptopair (NewPairfunction<string, String, integer> () {@Override             PublicTuple2<string, integer>Pager(String Word) {return NewTuple2<string, integer> (Word,1);        }        }); javapairdstream<string, integer> wordscount = Pairs.reducebykey (NewFunction2<integer, Integer, integer> () {@Override             PublicIntegerPager(Integer v1, integer v2) {returnV1 + v2; }        });/** * Funtion The first few parameters are the entry parameter.         * The following arguments are given.         * Reflected in the call Method!         * * This is done directly based on the RDD! */Wordscount.foreach (NewFunction2<javapairrdd<string, Integer>, Time, void> () {@Override             PublicVoidPager(Javapairrdd<string, integer> Rdd, Time time)throwsException {Rdd.filter (NewFunction<tuple2<string, Integer>, boolean> () {@Override                     PublicBooleanPager(tuple2<string, integer> Wordpair)throwsException {if(Broadcastlist.value (). Contains (wordpair._1)) {/** * Accumulator should not be used only for counting.                             * Can be written into the database or Redis! */Accumulator.add (wordpair._2);return false; }Else{return true; }                    };/** * Here really wants the broadcast and counters to be executed.                     To do an action action! */}). Collect (); System.out.println ("The value inside the broadcaster"+broadcastlist.value ()); System.out.println ("The value inside the timer"+accumulator.value ());return NULL;        }        });        Jsc.start ();        Jsc.awaittermination ();    Jsc.close (); }    }

2.2 Scala version

 PackageCom. StreamingImportJava.utilImportOrg.apache.spark.streaming. {Duration, StreamingContext}ImportOrg.apache.spark. {accumulable, accumulator, Sparkcontext, sparkconf}ImportOrg.apache.spark.broadcast.Broadcast/** * Created by Lxh on 2016/6/30. */ Object broadcastaccumulatorstreaming {  /** * Declare a broadcast and accumulator! */  Private varBroadcastlist:broadcast[list[string]] = _Private varAccumulator:accumulator[int] = _defMain (args:array[string]) {Valsparkconf =NewSparkconf (). Setmaster ("Local[4]"). Setappname ("Broadcasttest")Valsc =NewSparkcontext (sparkconf)/** * Duration is MS */    ValSSC =NewStreamingContext (Sc,duration ( -))//broadcastlist = Ssc.sparkContext.broadcast (util. Arrays.aslist ("Hadoop", "Spark"))Broadcastlist = Ssc.sparkContext.broadcast (List ("Hadoop","Spark")) accumulator= Ssc.sparkContext.accumulator (0,"Broadcasttest")/** * GET Data! */    ValLines = Ssc.sockettextstream ("localhost",9999)/** * What to do when you get the data!      * * 1.flatmap splits the line into words. * 2.map turn Word into tuple (word,1) * 3.reducebykey Cumulative value * (4.sortBykey rank) * 4. Filter.      Value is in the accumulator.      * 5. Print the display. */    ValWords = Lines.flatmap (line = Line.split (" "))ValWordpair = Words.map (Word,1)) Wordpair.filter (record = {BroadcastList.value.contains (record._1)})ValPair = Wordpair.reducebykey (_+_)/** * Why do you foreachrdd this step first?      * Because this pair is pairdstream<string, integer> * * Foreachrdd is for? *      *// * PAIR.FOREACHRDD (Rdd = {rdd.filter) (record = {if (BroadcastList.value.contains (record._1)) { Accumulator.add (1) return true} (else {return false}})}) */    ValFiltedpair = Pair.filter (record = {if(BroadcastList.value.contains (record._1)) {Accumulator.add (record._2)true}Else{false}}). Print println ("Accumulator Value"+accumulator.value)//Pair.filter (record = {BroadcastList.value.contains (record._1)})   / * val keypair = pair.map (pair = = (pair._2,pair._1)) */    /** * If Dstream does not have an operator operation.      by transforming transform! */   /* Keypair.transform (Rdd = {Rdd.sortbykey (false)//todo}) */Pair.print () Ssc.start () Ssc.awaittermination ()}}

Use of the radio and accumulator for Spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use of the radio and accumulator for Spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use of the radio and accumulator for Spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support