Spark-shared Variables

Last Update:2018-11-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Shared Variables

Normally, when a function passed to a spark operation (suchmapOrreduce) Is executed on a remote cluster node, it works on separate copies of all the variables used in the function. these variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. supporting General, read-write shared variables into SS tasks wocould be inefficient. however, spark does provide two limited typesShared VariablesFor two common usage patterns: Broadcast variables and accumulators.

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. they can be used, for example, to give every node A copy of a large input dataset in an efficient manner. spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed "shuffle" operations. spark automatically broadcasts the common data needed by tasks within each stage. the data broadcasted this way is cached in serialized form and deserialized before running each task. this means that explicitly creating broadcast variables is only useful when tasks when SS multiple stages need the same data or when caching the data in deserialized form is important.

Broadcast variables are created from a variablevBy callingSparkContext.broadcast(v). The broadcast variable is a wrapper aroundv, And its value can be accessed by callingvalueMethod.

Shared variable

Generally, when the spark operation is passed to a remote cluster node (for examplemapOrreduce), It will work on the separate copies of all variables used in the function. These variables will be copied to each computer, and variable updates on the remote computer will not be propagated back to the driver. Supports cross-task universality and inefficient reading/writing shared variables. However, spark does provide two limited typesShared variable: Broadcast variables and accumulators.

Broadcast variable

Broadcast variables allow programmers to keep a read-only variable on each machine, rather than sending its copy together with the copy. For example, they can be used to provide a copy of a large input dataset for each node in a valid way. Spark also tries to use effective broadcast algorithms to distribute broadcast variables to reduce communication costs.

Spark actions are executed in a group of stages, separated by distributed "Shuffle" operations. Spark automatically broadcasts the public data required for tasks in each stage. The data broadcast in this way is cached in serialized form and deserialized before running each task. This means that explicit creation of broadcast variables is useful only when the same data or data is cached in the form of deserialization is required for tasks across multiple stages.

Broadcast variable isvBy callingSparkContext.broadcast(v). Broadcast variable is a packagev, You can callvalueMethod to access its value.

Java version:

Package CN. rzlee. spark; import Org. apache. spark. sparkconf; import Org. apache. spark. API. java. javardd; import Org. apache. spark. API. java. extends parkcontext; import Org. apache. spark. API. java. function. function; import Org. apache. spark. API. java. function. voidfunction; import Org. apache. spark. broadcast. broadcast; import Java. util. arrays; import Java. util. list;/*** @ author ^_^ * @ create 2018/11/3 */public class broadcastvariable {public static void main (string [] ARGs) {sparkconf conf = new sparkconf (). setappname ("Persist "). setmaster ("local"); extends parkcontext SC = new extends parkcontext (CONF); // in Java, creating a shared variable is to call the broadcast () of sparkcontext () method // The returned result is broadcast <t> type final int factor = 3; final broadcast <integer> factorbroadcast = SC. broadcast (factor); List <integer> numberlist = arrays. aslist (, 9); javardd <integer> numbers = SC. parallelize (numberlist); // multiply each number in the set by the externally defined factor javardd <integer> multiplenumbers = numbers. map (new function <integer, integer> () {@ override public integer call (integer V1) throws exception {// when using shared variables, call its value () method, you can obtain the value integer factor = factorbroadcast. value (); Return V1 * factor ;}}); multiplenumbers. foreach (New voidfunction <integer> () {@ override public void call (integer) throws exception {system. out. println (integer) ;}}); SC. close ();}}

Scala version:

package cn.rzlee.spark.scalaimport org.apache.spark.broadcast.Broadcastimport org.apache.spark.rdd.RDDimport org.apache.spark.{SparkConf, SparkContext}object BroadcastVariable {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setMaster("local").setAppName(this.getClass.getSimpleName)    val sc = new SparkContext(conf)    val factor = 3    val factorBroadcast: Broadcast[Int] = sc.broadcast(factor)        val numbersArray = Array(1,2,3,4,5,6,7,8,9)    val numbers: RDD[Int] = sc.parallelize(numbersArray, 1)    val mutipleNumbers: RDD[Int] = numbers.map(num =>num * factorBroadcast.value)    mutipleNumbers.foreach(num=>println(num))    sc.stop()  }}

Accumulators

The accumulators are "add" variables only through association operations, so they can be effectively supported in parallel. They can be used to implement counters (such as mapreduce) or sum. Spark itself supports accumulators of the numeric type. programmers can add support for the new type. If you use the name to create the accumulators, they are displayed in the spark UI. This is useful for understanding the progress of the running phase (note: this function is not supported in Python ).

vCreate a accumulators by calling an initial valueSparkContext.accumulator(v). Then, you can useaddMethod or+=Add the operator (in Scala and Python) to it. However, they cannot understand its value. Only the driver can usevalueMethod to read the value of the accumulator.

The following code shows an element used by the accumulators to add an array:

Java version:

Package CN. rzlee. spark. core; import Org. apache. spark. accumulator; import Org. apache. spark. sparkconf; import Org. apache. spark. API. java. javardd; import Org. apache. spark. API. java. extends parkcontext; import Org. apache. spark. API. java. function. function; import Org. apache. spark. API. java. function. voidfunction; import Java. util. arrays; import Java. util. list;/*** @ author ^_^ * @ create 2018/11/3 */public class accumulatorvariable {public static void main (string [] ARGs) {sparkconf = new sparkconf (). setappname ("accumulatorvariable "). setmaster ("local"); extends parkcontext SC = new extends parkcontext (CONF); // create the accumulator variable // you need to call the accumulator () method of sparkcontext accumulator <integer> sum = SC. accumulator (0); List <integer> numberslist = arrays. aslist (1, 2, 3, 4, 5, 6, 7, 9, 8); javardd <integer> numbers = SC. parallelize (numberslist); numbers. foreach (New voidfunction <integer> () {@ override public void call (integer) throws exception {// then inside the function, you can call the add () method, accumulating value sum. add (integer) ;}}); // In the driver program, you can call the value () method of accumulator to obtain its value system. out. println (sum. value ());}}

Scala version:

package cn.rzlee.spark.scalaimport org.apache.spark.{Accumulable, Accumulator, SparkConf, SparkContext}object AccumulatorValiable {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local")    val sc = new SparkContext(conf)    val sum: Accumulator[Int]  = sc.accumulator(0)    val numbersArray = Array(1,2,3,4,5,6,7,8,9)    val numbers = sc.parallelize(numbersArray,1)    numbers.foreach(number=>sum+=number)    println(sum.value)  }}

Spark-shared Variables

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More