Spark Programming Guide

Last Update:2016-05-19 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last year, I studied spark for some time, picked it up this year and found that a lot of things have been forgotten. Now talk about things on the official website, review and record them.

Profile

From an architectural perspective, each spark application consists of a driver program that runs the user's main function in the cluster and performs a large number of parallel operations. The core abstraction concept of Spark is the elastic distributed data Set (RDD), a collection of node operations elements that span parallel clusters. The RDD is built on the Hadoop file system (or other Hadoop-supported file systems), or the existing Scala collection of drivers, and can be transforming. The user can also persist an RDD in memory and sequentially reuse it across parallel operations. Finally, the Rdds can be automatically recovered from the node failure.

The second abstract concept in Spark is shared variables , and Spark has a copy of each of the variables in the function in each task node. Sometimes, however, some variables between tasks or tasks and driver need to be shared. Spark supports two types of shared variables:
* Broadcast variable, this variable is cached in the memory of all nodes
* accumulators variable, this variable is the only one that can be added, such as counters and sums

Spark initialization

The first thing about a spark program is the Javasparkcontext object, which tells the spark program how to connect to the cluster. In creating a sparkcontext we first need to create the Sparkconf object, which includes some information about your app.

/*把spark看做一台超跑（速度非常快），SparkDriver用户提交应用程序，实际可看做火箭的驾驶员        * 那么sparkcontext就是火箭的引擎,火箭要起飞就必须要启动发动机        * 开车需要控制吧，比如方向盘，油门等等控制，需要通过SparkConf来进行参数配置        * 到目前为止，车子已经基本结束了。        */        new SparkConf()                .setAppName("estimator test");        new JavaSparkContext(conf);

Elastic distributed Data Set (RDD)

Spark is mainly around the concept of rdd developed, RDD with high fault tolerance, parallelism and other advantages. There are two ways of creating Rdds:
* Parallelizing A collection that exists in the driver program
* Read from external memory, such as shared file system, HDFS, HBase, or people also a data source that provides Hadoop input format.

Rdd operation

Rdds supports two types of operations:
* Transformation, convert from an already existing rdd to a new rdd. For example, a map is a transformation that runs a function for each dataset element, returning an RDD that represents the result. Functionally, it is the map function in Java, but the result here is an RDD.

Actions to return a value after the calculation of the completed dataset. For example, the map above generates a new RDD, and I want to know what to do with the results. At this point, you can use reduce, which is an action that aggregates all the elements in the RDD with some functions, returning a final result to the driver program. (Of course, the parallel Reducebykey here can also return a parallel result).

Note that all of the transfomation here are lazy (lazy), meaning that when you use a transfomation, it doesn't do it, they just remember to use the transformations in a data set. These transformation are executed only when an action is executed and return the result to the driver program. To do a metaphor, we are more lazy, the class teacher assigned the homework, we only put these assignments down, the evening did not do. After a few days the teacher let us hand in the homework (need a result), we follow the written down of the homework, has been completed.

JavaRDD<String> lines = sc.textFile("data.txt");JavaRDD<Integer> lineLengths = lines.map(s -> s.length());int totalLength = lineLengths.reduce((a, b) -> a + b);

By default, each transformed Rdd is recalculated only when an action is run. However, you can also use the persist (or cache) method to persist the RDD in memory, and in this case, spark will keep the elements in the cluster so that you can query it faster next time. Of course, you can also have rdds saved on your hard disk or copied to multiple nodes.

lineLengths.persist(StorageLevel.MEMORY_ONLY());

Spark's passing function

In the driver function in the cluster, the Spark's API relies heavily on the passing function. In Java, this function is accomplished by implementing an interface in the Org.apache.spark.api.java.function package. Here are two ways to create such a function:
* Implement the function interface in your own class, whether it is an anonymous inner class or named, to instantiate him in spark.
* In Java8, you can define an interface directly with an anonymous function.
As follows

JavaRDD<String> lines = sc.textFile("data.txt");JavaRDD<integer> lineLengths = lines.map(new Function<String, Integer>(){    publiccallreturn s.length();}}int totalLength = lineLengths.reduce(new Function2<Integer, Integer, Integer>(){    publiccallreturn a+b;}})

The following wording is not very sensible:

class GetLength implements Function<String, Integer>{    publiccallreturn s.length();}}class Sum implements Functions<Integer, Integer, Integer>{    publiccallreturn a+b;}}JavaRDD<String> lines = sc.textFile("data.txt");JavaRDD<Integer> lineLengths = lines.map(new GetLength());int totalLength = lineLengths.reduce(new Sum());

Understand closed

One of the hardest things about spark is understanding the scope and lifecycle of variables or methods that are performed in a cluster. Modifying a variable outside the scope of an RDD is a common cause of confusion. The following example uses foreach () to implement the code increment counter, and similar actions occur on other similar issues.

Example

int0;JavaRDD<Integer> rdd = sc.parallelize(data);WrongDon‘tdo this!!rdd.foreach(x -> counter += x);println("Counter value: " + counter);

This is an example of the addition of a simple rdd element, and the result is different from whether it is related to the same JVM. For example, in local mode and cluster in the significant results are different.
The above code may not work as expected. Spark converts the rdd into tasks, with each task running in a executor. As in the previous execution, each task of Spark is closed for execution.
The closure is those variables and methods which must are visible for the executor to perform their computations on the RDD ( In this case foreach ()). It is the official explanation of closure, how to translate all twist, so directly to the original.
In the above example, the counter variable in executor is not counter in the driver node. There is still a counter variable in the driver node, but it is no longer visible to executor. Each executor can only see the counter of parallelism, that is, every executor has a copy of counter. After executing the above program, the counter in each executor is changed, but the counter variable in driver is still 0.

In the native mode, the above example Excutor and driver are actually executed in the same JVM, so the counter excutor see is Driver counter, so the update can be done.

If performed better, a accumulator (accumulator) can be used.

In general, closures-want to loop or locally define a method structure and should not change some global state. Spark does not define or guarantee that the introduction of external variables from the closed space will cause unforeseen events. Some code can be allowed locally, but it does not work in situations where it is not possible to rely on this contingency. So it's best to use some global accumulators (aggregation).

Print elements on an RDD

Another common mistake is to print the elements in the RDD using the Rdd.foreach (println) or Rdd.map (println) method. This can be printed out in a machine. In cluster mode, however, the output stdout is output to excutor stdout instead of driver. So the driver above stdout does not show any data. For all the data above the single driver, one method is to use the Collect () method to first collect the RDD into the driver node: Rdd.collect (). foreach (println). This may cause a driver memory overflow, Because collect the whole RDD to a machine. In fact, if you want to print out some rdd data, a more appropriate way is to use the Take () method: Rdd.take. foreach (println);

Use Key-value to

While most operations on spark can work on the rdds of any type of object, there are several special operations that are built on Key-value.
In Java, Key-value is using Scala. Represented by the Tuple2 class. You can see the definition of new Tuple2 (A, b) to create a tuple that can be accessed using tuple._1 () and tuple._2 ().
The Key-value type of RDD is represented by the Javapairrdd class. You can use special maps, such as Maptopair and Flatmaptopair, to build Javapairrdds.

For example, the following code uses the Reducebykey operation on Key-value to calculate the number of occurrences of each line in the article:

JavaRDD<String> lines = sc.textFile("data.txt"new1));JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);

We can also use Counts.sortbykey () to sort by keyword, and Suizhong uses Counts.collect () to return the result as an array to the driver program.

Transformations

Transformation	meaning
Map (func)	Converts each element in the original RDD through a custom function func to an RDD containing the new element.
Filter (func)	The elements in the original rdd are filtered and each element is entered into the Func function, and if the Func function returns True then it is preserved and false is discarded.
FlatMap (func)	The function is similar to map, but the output is a collection.
Mappartitions (func)	functionality is similar to map, but Mappatitions gets an iterator for each partition.
Mappartitionswithindex (func)	The function is similar to mappatitions, but the Func function returns a value that represents the Interger type of the partition index
Sample (withreplacement, fraction, Seed)	Sample data from a dataset and think of it as a new RDD
Union (Otherdataset)	Combine two types of rdd with the same data type into one RDD
Intersection (Otherdataset)	Returns a new RDD that contains the intersection of two data types with the same RDD
Distinct ([numtasks]))	To redo the elements in the RDD
Groupbykey ([Numtasks])	Returns a (k, iterable) key-value pair. Note: If you are grouping to perform an aggregation (such as summing or averaging), using Reducebykey or Aggregatebykey will have better performance. Note: By default, the number of partitions for parallel output depends on the number of partitions that the parent samples. You can set a different number of tasks with an optional numtasks parameter.
Reducebykey (func, [Numtasks])	Call the Func function for the value in the same key-value pair as K, merging to produce a value
Sortbykey ([ascending], [numtasks])	Returns a key value to the RDD that is sorted by the K value.
Join (Otherdataset, [numtasks])	The two need to be connected to the RDD cogroup function operation, cogroup principle, such as the cogroup operation after the formation of the new RDD, the elements under each key Cartesian product operation, the return results are flattened.
Cogroup (Otherdataset, [numtasks])	The two Rdd is co-partitioned, each of the elements in the RDD is aggregated into a single set, and an iterator that returns the collection of elements in the two rdd corresponding to the key is returned.
Cartesian (Otherdataset)	Perform a Cartesian product operation on all elements within two rdd.
Pipe (command, [Envvars])	For each partition of the RDD through the script command, the RDD element can write the stdin of the process and the line output to the standard output as a string return
COALESCE (Numpartitions)	Setting the number of partitions for the RDD data allows the data set to operate more in the university.
Repartition (Numpartitions)	Modify the number of partitions for the RDD data
Repartitionandsortwithinpartitions (Partitioner)	Re-set the RDD partition, sorted according to the keys value, which is more efficient than repartition

Action

Essentially, action runjob the execution of the Rdd dag by executing the sparkcontext operation that submits the job through the.

Action	meaning
Reduce (func)	The elements in the collection are combined by the Func function, which should be commutative and combined in order to be applied to parallel computations.
Collect ()	Returns the element in the data as an array, which is typically used after other operations such as filter.
Count ()	Returns the number of elements in a data set
First ()	Returns the first element in a dataset
Take (N)	Returns the first n elements of a dataset
Takesample (withreplacement, NUM, [seed])	Sampling by set number of samples
Takeordered (n, [ordering])	Returns the natural order of elements in the first n Rdd or a custom comparer
Saveastextfile (PATH)	Stores the elements in the dataset, the local file system in a given directory, or any other file system supported by Hadoop HDFS. Spark can call ToString to have each element in text in the form of each line
Saveassequencefile (Path) (Java and Scala)	The RDD in the dataset exists in the form of a Hadoop sequencefile file system, HDFs, or any other Hadoop-supported file system, on a given path. This is useful for reading in the presence of a health value in Hadoop.
Saveasobjectfile (Path) (Java and Scala)	Using Java serialization to store elements in a dataset as a simple format, we can read through Sparkcontext.objectfile ().
Countbykey ()	Returns the number of key-value pairs with different health
foreach (func)	The Func function is applied to each element in the RDD, not the Rdd and Arry, but the uint.

Shuffle operation

Spark has a particularly important operation: Shuffle. Shuffle is the spark mechanism in which data is reassigned to allow for cross-region grouping. This usually involves copy data between the executors and the machine, which makes the shuffle very complex and costly.

Background

To understand what happened in the shuffle process, we took the reducebykey operation as an example. The reducebykey operation will produce a new RDD, All key-value of the same key here are combined into an array-this key and the reduce function execution need to find all the keys associated with the key. The problem is that all of the value that is not a key corresponds to a partition, or the same machine, but the result of the operation must be placed in the same sub- Area.

During the spark calculation, a single task will only operate on a single rollover-but when using Reducebykey, all data is traversed and all data is transferred to the Reducebykey task execution. This must read all the partitions to find all values corresponding to all keys, and then aggregate all the values in the partition to calculate the final result of the corresponding key, which is called Shuffle.

Although each partition dataset after the new shuffled is determined, the order of the partitions themselves is deterministic, but the order of the data elements is indeterminate. If you need to get sequential data after shuffle, the following may be useful:

Mappartitions sorting using. sorted
Repartitionandsortwithinpartitions is able to sort the partitions more efficiently when partitioning at the same time
SortBy can get a global sort of rdd

The following operations include the shuffle process
* repartitioning: Repartition and Coalesce
* Sort: Groupbykey and Reducebykey
* Connection: Cogroup and join

Performance

The Shuffle process requires a lot of disk I/O operations, data serialization, and a large amount of network I/O. The shuffle process can be modified by setting a large number of parameters. See the Spark Configuration guide for details

RDD Persistence

One of the most important capabilities of spark is to persist (cache) a dataset in memory. When you persist an RDD, each node has this RDD partition, and other operations in the dataset reuse those partitions when it is calculated in memory. This will make our future operations faster (often around 10x). Caching is a key part of an iterative algorithm that can be used quickly.

You can use the persist () or the cache () method to mark an RDD as a persistent rdd. When it is evaluated for the first time in an action, it will persist in the memory of the node. The spark's cache is fault-tolerant-if any of the partitions in the Rdd are lost, it will recreate it using the previous transformation.

Outside of the forest, each persistent rdd can use a different storage level, for example, persistent on disk, persisted in memory but as a serialized Java object, copied to a node, or stored in a Tachyon memory file system. At persist (), these levels can be set by Storagelevel. The Cashe () method is simply to use the default storage level--storagelevel.memory_only (storing the serialized data in memory). The full storage levels are as follows:

Level

Storage	meaning
memory_only
memory_and_disk
memory_only_ser	stored order The serialized Java object. In general, this is more space-saving than for serializing objects, especially with fast serializer, but requires more CPUs to read
memory_and_disk_ser
disk_only
off_heap (experimental)

Which storage mode do you choose?

Spark's storage mode thinks this balances memory usage and CPU efficiency. Here are some good suggestions:
* If the RDD uses the default storage mode (memory_only), then get rid of him. This is the most funny way to use the CPU, let the operation Rdds as fast as possible.
* Memory_only_ser and select Fast Serialization library is the most space efficient way, still fast.
* Do not store the data on disk, the cost of departure calculation is very high or need to select a large amount of data, otherwise, recalculate a sub-Shuby from disk read faster.
* If you want to recover quickly, use replication storage levels for a long time (for example, a request for a web app that uses spark). All storage levels recalculate the decorated data to provide comprehensive fault tolerance, and synchronization allows you to continue to run the RDD task instead of waiting to recalculate the missing partition.
* If there is a large amount of memory and applications, the OFF_HEAP mode has the following series of cases:
* Allow multiple executor to share memory pools in Tachyon
* Now reduces the storage Collectiondaijia
* If a single executors crashes, the cached data is not lost

removing data

Spark automatically caches data and removes some data in accordance with the least recently used principle. If you want to manually delete an in-memory Rdd, you can use the Rdd.unpersist () method

Shared variable broadcast type variable

Broadcast variables allow programmers to keep a read-only variable cache on each machine instead of copying the variable with the task. For example, they can give a copy of a large input dataset to each node in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication costs.

The action of Spark is performed through a set of policies that are separate from the distributed "shuffle" operation. Spark automatically broadcasts the common data required for tasks. Broadcast data is cached in a serialized form and deserialized before each task. This means that explicitly creating broadcast variables is only useful when working in multiple stages requires the same data or is important when data is cached in the deserialization format. This means that it is only useful to create broadcast variables at multiple stages that require the same data to be displayed, or it is important to cache the data in a non-serialized format.

Broadcast variable v is created by calling Sparkcontext.broadcast (v). The broadcast obtains the value of the variable by calling value (). The code is as follows:

Broadcast<int[]> broadcastVar = sc.broadcast(newint[] {123});broadcastVar.value();// returns [1, 2, 3]

Accumulator

Accumulators are the only type of operation that has a small support accumulation in parallel situations. This can be used to implement counters or sums. Spark supports a numeric type of accumulator, or it can be a new type of shelf. If the accumulator is created with a name, they will be limited in the Spark's UI.

accumulator<integer> accum = Sc.accumulator ( Span class= "Hljs-number" >0 ) sc.parallelize (arrays.aslist (1 , 2 , 3 , 4 ). foreach (X- Accum.add (x)); //...  //10/09/29 18:41:08 INFO Sparkcontext:tasks finished in 0.317106 s  accum.value (); Span class= "Hljs-comment" >//returns

class VectorAccumulatorParam implements AccumulatorParam<Vector> {  publiczero(Vector initialValue) {    return Vector.zeros(initialValue.size());  }  publicaddInPlace(Vector v1, Vector v2) {    return v1;  }}// Then, create an Accumulator of this type:Accumulator<Vector> vecAccum = sc.accumulator(newnew VectorAccumulatorParam());

Spark Programming Guide

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More