Brief introduction
In general, each spark application consists of a driver that runs the user's main function and performs a variety of parallel operations on a cluster. The main abstraction (concept) provided by Spark is an elastic distributed dataset, which is a collection of elements that can be manipulated in parallel by dividing it into different nodes of the cluster . The creation of Rdds can start with a file on the Hadoop file system (or any file system that supports Hadoop) or by converting a Scala collection that already exists in the driver . Users can also enable spark to persist an RDD into memory so that it can be reused effectively in parallel operations. Finally, the Rdds can automatically recover from a node failure.
The second abstraction (concept) in Spark is a shared variable that can be used in parallel operations. By default, spark runs a function in parallel with a series of tasks on different nodes. He passes every copy of the variable used in each function to each task. Sometimes, a variable needs to be shared between different tasks, or between tasks and drivers. Spark supports two types of shared variables: broadcast variables, which can cache a value in memory for all nodes, an accumulator, a variable that can only be added, such as counters and sums.
This guide shows each of the features of spark in each of the languages supported by Spark. It's easiest to follow along and if you launch Spark's interactive shell–either Bin/spark-shell for the Scala shell or Bin/pyspark for the Python one.
Access Sparkjava
Spark1.0.2 work after Java6 or JAVA6. If you are using Java8,spark to support LAMDBA expressions to simplify function writing, You can use classes under the Org.apache.spark.api.java.function package .
To write spark applications in Java, you need to add Spark dependencies that spark can use with Maven Central:
Groupid=org.apache.spark
artifactid=spark-core_2.10
version=1.0.2
Also, if you want to access an HDFS cluster, you need to add a hadoop-client dependency based on your HDFs version. Some commonly used HDFs version tags are displayed on the page.
Groupid=org.apache.hadoop
Artifactid=hadoop-client
version=
Finally, you need to import some spark classes into your program by adding the following lines:
Import Org.apache.spark.api.java.JavaSparkContext
Import Org.apache.spark.api.java.JavaRDD
Import org.apache.spark.SparkConf
Initialize Sparkjava
The first thing the spark program needs to do is create a Javasparkcontext object that tells Spark how to access a cluster. To create a sparkcontext, you first have to create a Sparkconf object that contains information about your application.
sparkconf conf=new sparkconf (). Setappname (AppName). Setmaster (master);
Javasparkcontext sc=new javasparkcontext (conf);
The appname parameter is the name of your application and will be displayed on the UI of the cluster. Master is a spark, Mesos, or yarn cluster URL, or a dedicated string "local" to make it run in local mode. In practice, when running on a cluster, you will not want to hardcode master into the program, but instead run the program using Spark-submit and accept master. However, in a local test or unit test, you can pass "local" to run spark within the process.
Resilient distributed data sets
One of the concepts that spark repeats around is the elastic distributed data set. It is a collection of elements with a fault-tolerant mechanism and can be manipulated in parallel. There are two ways to create Rdds. Parallelize A collection that already exists in your driver, or reference a dataset of an external storage system, such as a shared file system, HDFS, HBase, or any data source that can provide a Hadoop inputformat.
Parallel collections
The parallel collection is created by calling Javasparkcontext's Parallelize method on collection that already exists in your driver. The elements of the collection will be copied into a distributed data set that can be manipulated in parallel. For example, here's how to create a parallel collection that contains numbers 1 through 5:
List data=arrays.aslist (1,2,3,4,5);
Javardd distdata=sc.parallelize (data);
Once created, distributed Datasets (Distdata) can be manipulated in parallel. For example, we can call Distdata.reduce (A, B)->a+b to add the elements in the list. We'll describe it later in the operation of the distributed dataset.
Note: In this guide, we often use the concise Java8 LAMDBA syntax to define Java functions, but in the old Java version you can implement the interfaces in the Org.apache.spark.api.java.function package. We will describe passing functions to Spark in detail below.
Another important parameter to a parallel collection is the number of data sets that are sliced into slices (slices). Spark will run a task for each slice in the cluster. Typically, you want 2-4 slice per CPU in a cluster. Typically, Spark tries to automatically set the number of slice based on your cluster. However, you can set it manually and pass it as a second parameter to Parallelize (for example: Sc.parallelize (data,10)).
External data sets
Spark can create distributed datasets from any Hadoop-supported storage source. Including your local file system, Hdfs,cassandra,hbase,amazon S3 and so on. Spark supports the text files (text file), Sequencefiles (serialized file), and any other Hadoop inputformat (input format).
The Text file can be created by using the Sparkcontext Textfile method. This method takes the URI of a file (either a local path on the machine, or a URI such as hdfs://,s3n://) and reads the file as a collection of rows. The following is an example of a call:
Javardd distfile=sc.textfile ("Data.txt");
Once created, Distfile can be manipulated by datasets. For example, we can add the lengths of all the rows of data by using map and reduce. For example: Distfile.map (S->s.length ()). reduce ((b), (A+B)).
Some things to note when spark reads a file:
- If you use a path on the local file system,
- All of the file-based input methods of spark, including Textfile, support for running directories, and compressed file box wildcard characters. For example, you can use Textfile ("/my/directory/"), Textfile ("/my/directory/.txt"), and Textfile ("/my/directory/.gz")
- The Textfile method can also accept an optional second parameter to control the number of slice for this file. By default, Spark creates a slice for each file (the block defaults to 64MB in HDFs). But you can specify a high slice value by passing a larger value. Note that your slice number cannot be less than the block number.
In addition to text files, Spark's Java API also supports centralizing other data formats.
- Javasparkcontext.wholetextfiles lets you read a directory containing multiple small text files, and returns each of them a s (filename, content) pairs. This was in contrast with Textfile, which would return one record per line in each file.
- For serialization files (sequencefiles), using Sparkcontext sequencefile[k,v],k and V is the type of key and value in the file. They must be subclasses of the writeable interface of Hadoop, such as intwriteable and text.
- For other Hadoop input formats, you can use the Javasparkcontext.hadooprdd method. It can accept arbitrary (type) jobconf and input format classes, key classes, and value classes. As with the Hadoop job, you can set the input source. You can also use Javasparkcontext.newhadooprdd for Inputformats, based on the "new" MapReduce API (org.apache.hadoop.mapreduce).
- Javardd.saveasobjectfile and Javacontext.objectfile support the RDD in a simple format consisting of the serialization of Java objects. While this is not an effective specialized format for Avro, it provides an easy way to store the RDD.
Rdd operation
Rdds supports two types of operations: transform (Transformations), which creates a new dataset from an existing dataset . Action (actions), which returns a value to the driver after the calculation is run on the dataset. For example, a map is a transformation that passes each element of a dataset to a function and returns a new RDD representing the result . On the other hand,Reduce is an action that aggregates all the elements of some rdd with some number of rows and returns the final result to the driver (but there is also a parallel reducebykey that returns a distributed dataset).
All transitions in spark are inert, that is, they do not immediately calculate the result. Instead, they just remember the transformations applied to these underlying datasets, such as file. These conversions are performed only when an action is required to return a result to the driver. This design makes spark more efficient-for example, we can do this by creating a dataset from map and using it in reduce, and ultimately returning only the results of reduce to the driver instead of the entire large new data set.
By default, each converted Rdd will be recalculated when you run an action on it. However, you can also use the Persist method (or cache) to persist an RDD into memory. In this case, spark will save the relevant element in the cluster, and the next time you access the RDD, it will be able to access it more quickly. persisting datasets on disk, or replicating datasets across clusters is also supported.
Basic operations
To illustrate the RDD Foundation, consider the following simple procedure:
Javardd lines=sc.textfile ("Data.txtt");
Javardd Linelengths=lines.map (S->s.length ());
int Totallength=linelengths.reduce ((A, b)->a+b);
The first line defines a basic RDD through an external file. This data set is not loaded into memory and does not perform actions on it. Lines is just a pointer to this file. The second line defines the result of linelengths as the map transformation. In addition, linelengths is not calculated immediately because of inertia. Finally, we run reduce, and he is an action. atthis point, spark splits the calculation into different tasks and makes it run on separate machines , and each machine runs its own map section and local reducation, returning only his results to the driver .
If we want to re-use linelengths later, we can add:
Linelengths.persist ();
Before reduce, this will cause linelengths to be saved in memory after the first calculation.
Passing functions to spark
The Spark's API relies heavily on transfer functions to make its drivers run on the cluster. In Java, a function has a class representation that implements an interface in the Org.apache.spark.api.java.function package. There are two ways to create such a function:
- Implement the function interface in your own class, which can be an anonymous inner class, the latter naming the class, and you are going to pass one of his instances to spark
- In Java8, a LAMDBA expression is used to succinctly define an implementation
For the sake of brevity, most of this guide uses the LAMDBA syntax, which is easy to use, all APIs in Long-form, for example, we can write the above code as follows:
Javardd<string> lines = Sc.textfile ("Data.txt"); Javardd<Integer> linelengths = Lines.map (newfunction<string, integer>() { public = Linelengths.reduce (Newfunction2<integer, Integer, integer>() { public Integer call (integer A, integer b) {return a + b;}});
Or, if you write an inline function that looks awkward:
Class getlengthimplementsfunction<string, integer> { public Integer Call (String s) { Returns.length (); }}class SumimplementsFunction2<integer, Integer, integer> { public integer call ( Integer A, integer b) {return a + b;}} Javardd<String> lines = sc.textfile ("Data.txt"); Javardd<Integer> linelengths = lines.map (newgetlength ()); int totallength = linelengths.reduce (Newsum ());
Note that anonymous inner classes in Java can also access variables in the enclosing scope as long as they is marked FINA L. Spark would ship copies of these variables to all worker node as it does for other languages
Wroking with Key-value pairs work with key/value pairs
While most spark operations work on Rdds that contain objects of various types, some special operations can only use rdds that contain key-value pairs. One of the most common operations is the distributed "Shuffle" (mobile? ) operations, such as grouping by key or aggregating elements.
In Java, Key-value is represented by the Scala Tuple2 (tuple, Array) class using the Scala standard package. You can simply call new Tuple2 (A, b) to create a tuple and access its fields through TUPLE.1 () and tuple.2 ().
The Rdds of Key-value is expressed by Javapairrdd. You can build Javapairrdds with Javardds, using the specified map operation version, like Maptopair and Flatmaptopair. Javapair will not only have the standard RDD function, but also have a special key-value function.
For example, the following code uses the Reducebykey action on the Key-value pair to calculate the number of occurrences of each line of text in a file and.
Javardd lines=sc.textfile ("Data.txt");
Javapairrdd Pairs=lines.maptopair (s->new Tuple2 (s,1))
Javapairrdd Counts=pairs.reducebykey ((b)->a+b);
We can also use Counts.sortbykey (), for example, to sort the key-value pairs alphabetically. and finally call Counts.collect () as an array of objects to return to the driver.
Note: When using a custom object as a key to an key-value operation, you must ensure that the custom equals () method is accompanied by a matching hashcode () method. For details, refer to the provisions listed in the Object.hashcode () documentation outline.
RDD Persistence
One of the most important features of Spark is to persist (or cache) a data set into memory between different operations. When you persist an RDD, each node stores its computed shard results in memory and reuses it for other actions on this dataset (or derived datasets). This will make the subsequent action faster (by more than 109 times times faster). The cache is the (Spark) iterative algorithm and the key tool for fast interaction use.
You can use the persist () and cache () methods to mark an RDD that will be persisted. The first time he was evaluated by an action, he would remain in the memory of the node. Spark's cache is fault tolerant-if any one of the RDD's partitions is lost, he will automatically recalculate it by using the original creation of its transform operation.
In addition, each persistent rdd can be stored using a different storage level. Allows you, for example, to persist datasets to disk, persist datasets to memory as serialized Java Objects (save space), replicate across nodes, or store it off-heap in Tachyon. These levels are set by passing a Storagelevel object (Scala,java,python) to persist (). The cache () method is a quick way to use the default storage level, which is storagelevel.memory_only (store deserialization object to memory) and the full storage level is set to:
Spark also automatically persists some intermediate data in the shuffle operation (for example, Reducebykey). Even when the user does not call the Persist method. This is done to prevent the entire input from being recalculated as a result of a node failure while the shuffle operation is in progress. We still recommend that users use the RDD as a result to persist if they want to reuse it.
Spark Development Guide