Shared Variables
Normally, when a function passed to a spark operation (suchmap
Orreduce
) Is executed on a remote cluster node, it works on separate copies of all the variables used in the function. these variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. supporting General, read-write shared variables into SS tasks wocould be inefficient. however, spark does provide two limited typesShared VariablesFor two common usage patterns: Broadcast variables and accumulators.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. they can be used, for example, to give every node A copy of a large input dataset in an efficient manner. spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Spark actions are executed through a set of stages, separated by distributed "shuffle" operations. spark automatically broadcasts the common data needed by tasks within each stage. the data broadcasted this way is cached in serialized form and deserialized before running each task. this means that explicitly creating broadcast variables is only useful when tasks when SS multiple stages need the same data or when caching the data in deserialized form is important.
Broadcast variables are created from a variablev
By callingSparkContext.broadcast(v)
. The broadcast variable is a wrapper aroundv
, And its value can be accessed by callingvalue
Method.
Shared variable
Generally, when the spark operation is passed to a remote cluster node (for examplemap
Orreduce
), It will work on the separate copies of all variables used in the function. These variables will be copied to each computer, and variable updates on the remote computer will not be propagated back to the driver. Supports cross-task universality and inefficient reading/writing shared variables. However, spark does provide two limited typesShared variable: Broadcast variables and accumulators.
Broadcast variable
Broadcast variables allow programmers to keep a read-only variable on each machine, rather than sending its copy together with the copy. For example, they can be used to provide a copy of a large input dataset for each node in a valid way. Spark also tries to use effective broadcast algorithms to distribute broadcast variables to reduce communication costs.
Spark actions are executed in a group of stages, separated by distributed "Shuffle" operations. Spark automatically broadcasts the public data required for tasks in each stage. The data broadcast in this way is cached in serialized form and deserialized before running each task. This means that explicit creation of broadcast variables is useful only when the same data or data is cached in the form of deserialization is required for tasks across multiple stages.
Broadcast variable isv
By callingSparkContext.broadcast(v)
. Broadcast variable is a packagev
, You can callvalue
Method to access its value.
Java version:
Package CN. rzlee. spark; import Org. apache. spark. sparkconf; import Org. apache. spark. API. java. javardd; import Org. apache. spark. API. java. extends parkcontext; import Org. apache. spark. API. java. function. function; import Org. apache. spark. API. java. function. voidfunction; import Org. apache. spark. broadcast. broadcast; import Java. util. arrays; import Java. util. list;/*** @ author ^_^ * @ create 2018/11/3 */public class broadcastvariable {public static void main (string [] ARGs) {sparkconf conf = new sparkconf (). setappname ("Persist "). setmaster ("local"); extends parkcontext SC = new extends parkcontext (CONF); // in Java, creating a shared variable is to call the broadcast () of sparkcontext () method // The returned result is broadcast <t> type final int factor = 3; final broadcast <integer> factorbroadcast = SC. broadcast (factor); List <integer> numberlist = arrays. aslist (, 9); javardd <integer> numbers = SC. parallelize (numberlist); // multiply each number in the set by the externally defined factor javardd <integer> multiplenumbers = numbers. map (new function <integer, integer> () {@ override public integer call (integer V1) throws exception {// when using shared variables, call its value () method, you can obtain the value integer factor = factorbroadcast. value (); Return V1 * factor ;}}); multiplenumbers. foreach (New voidfunction <integer> () {@ override public void call (integer) throws exception {system. out. println (integer) ;}}); SC. close ();}}
Scala version:
package cn.rzlee.spark.scalaimport org.apache.spark.broadcast.Broadcastimport org.apache.spark.rdd.RDDimport org.apache.spark.{SparkConf, SparkContext}object BroadcastVariable { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local").setAppName(this.getClass.getSimpleName) val sc = new SparkContext(conf) val factor = 3 val factorBroadcast: Broadcast[Int] = sc.broadcast(factor) val numbersArray = Array(1,2,3,4,5,6,7,8,9) val numbers: RDD[Int] = sc.parallelize(numbersArray, 1) val mutipleNumbers: RDD[Int] = numbers.map(num =>num * factorBroadcast.value) mutipleNumbers.foreach(num=>println(num)) sc.stop() }}
Accumulators
The accumulators are "add" variables only through association operations, so they can be effectively supported in parallel. They can be used to implement counters (such as mapreduce) or sum. Spark itself supports accumulators of the numeric type. programmers can add support for the new type. If you use the name to create the accumulators, they are displayed in the spark UI. This is useful for understanding the progress of the running phase (note: this function is not supported in Python ).
v
Create a accumulators by calling an initial valueSparkContext.accumulator(v)
. Then, you can useadd
Method or+=
Add the operator (in Scala and Python) to it. However, they cannot understand its value. Only the driver can usevalue
Method to read the value of the accumulator.
The following code shows an element used by the accumulators to add an array:
Java version:
Package CN. rzlee. spark. core; import Org. apache. spark. accumulator; import Org. apache. spark. sparkconf; import Org. apache. spark. API. java. javardd; import Org. apache. spark. API. java. extends parkcontext; import Org. apache. spark. API. java. function. function; import Org. apache. spark. API. java. function. voidfunction; import Java. util. arrays; import Java. util. list;/*** @ author ^_^ * @ create 2018/11/3 */public class accumulatorvariable {public static void main (string [] ARGs) {sparkconf = new sparkconf (). setappname ("accumulatorvariable "). setmaster ("local"); extends parkcontext SC = new extends parkcontext (CONF); // create the accumulator variable // you need to call the accumulator () method of sparkcontext accumulator <integer> sum = SC. accumulator (0); List <integer> numberslist = arrays. aslist (1, 2, 3, 4, 5, 6, 7, 9, 8); javardd <integer> numbers = SC. parallelize (numberslist); numbers. foreach (New voidfunction <integer> () {@ override public void call (integer) throws exception {// then inside the function, you can call the add () method, accumulating value sum. add (integer) ;}}); // In the driver program, you can call the value () method of accumulator to obtain its value system. out. println (sum. value ());}}
Scala version:
package cn.rzlee.spark.scalaimport org.apache.spark.{Accumulable, Accumulator, SparkConf, SparkContext}object AccumulatorValiable { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local") val sc = new SparkContext(conf) val sum: Accumulator[Int] = sc.accumulator(0) val numbersArray = Array(1,2,3,4,5,6,7,8,9) val numbers = sc.parallelize(numbersArray,1) numbers.foreach(number=>sum+=number) println(sum.value) }}
Spark-shared Variables