Spark Learning--rdd

Source: Internet
Author: User
Tags foreach anonymous closure

Before introducing the RDD, let's start by saying something before:

Because I'm using the Java API, the first thing to do is create a Javasparkcontext object that tells Spark how to access the cluster

sparkconf conf = new sparkconf (). Setappname (AppName). Setmaster (master);
Javasparkcontext sc = new Javasparkcontext (conf);

This appName parameter is a name that shows the application on the cluster UI. Master is the URL address of a Spark,mesos or YARN cluster, or is designated as a "local" string to run in local mode. In practice, when running on a cluster, you do not want to hardcode the master in the program, but instead use Spark-submit to start the application and receive it. However, for local and unit tests, you can run the Spark process through "local".

Some spark-related shell operations do not elaborate here.

Here's an introduction to spark's most important concept, RDD:

RDD (Resilient distributed Datasets), an elastic distributed dataset, is primarily centered on Spark, which is a collection of fault-tolerant elements that can perform parallel operations. There are two ways to create an RDD: parallelizing an existing collection in your driver program (driver), or reference a dataset in an external storage system, such as a shared file system, Hdfs,hbase, or a Hadoop Any data source for InputFormat.

parallelized Collections:

list<integer> data = arrays.aslist (1, 2, 3, 4, 5);
javardd<integer> Distdata = sc.parallelize (data);

You can create a parallel collection on a collection that already exists in your driver program (driver) by calling Sparkcontext's Parallelize method. The elements of the collection are copied from a distributed dataset (a distributed dataset) that can operate in parallel to another dataset (DataSet).

One important parameter in a parallel collection is the number of partitions (partitions), which can be used to cut datasets (datasets). Spark will run a task on each partition in the cluster. Typically, you want each CPU in the cluster to compute 2-4 partitions. In general, Spark will try to set the number of partitions automatically based on your cluster situation. Of course, you can also pass the number of partitions as a second parameter to the parallelize (e.g. sc.parallelize (data, 10)) method to manually set it.

External Datasets:

javardd<string> distfile = Sc.textfile ("Data.txt");

Spark can create distributed datasets (distributed datasets) from any storage source supported by Hadoop, including local file systems, Hdfs,cassandra,hbase,amazon S3, and so on. Spark supports text files, sequencefiles, and any other Hadoop inputformat.
You can use the Sparkcontext Textfile method to create an RDD for a text file. This method requires a URI for the file (the URI of the local path on the computer, hdfs://,s3n://, and so on), and reads them as a collection of lines (rows).

Some considerations for using Spark to read files :
If you use the path of the local file system, the file must be accessible under the same access path for the working node. Copy the file to all work nodes, or use a shared network to mount the file system.
All of the file-based input methods in Spark, including textfile (text files), support directories, compressed files, or wildcard operators. For example, you can use Textfile ("/my/directory"), Textfile ("/my/directory/*.txt"), and Textfile ("/my/directory/*.gz").
The Textfile method can also control the number of partitions for a file by using the Second optional parameter. By default, Spark creates a partition for each block (block) of a file (the block size in HDFS is 64M by default), but you can also ask for a higher number of partitions by passing a larger value. Please note that the number of partitions cannot be less than the number of blocks.

Rdd Operation:

RDD supports two types of operations: transformation and action

1, Transformations (conversion): Create a new dataset on an existing dataset.
2. Actions: Returns the result of the calculation running on the dataset to the driver.

It is important to note that the RDD conversion process is lazy-evaluated. This means that spark does not start computing until the action operation is invoked, and Spark will record information about what is required to be done internally, and we can think of each RDD as a specified list of what we have built out of the conversion operation to record how the data is calculated. The operation of reading data into the RDD is also inert.

The RDD provides a number of conversion operations, each of which generates a new RDD, the new RDD relies on the original RDD, and the dependency between the Rdd eventually forms a dag (Directed acyclic graph directed acyclic graph).

There are two types of dependencies between RDD, namely Narrowdependency and shuffledependency, where each partition of shuffledependency as a child RDD relies on all partition of the parent RDD, Narrowdependency, however, relies on only one or part of the partition. The GroupBy and join operations in the following figure are Shuffledependency,map and union are narrowdependency.


When an action is executed, it is inverted sequentially, completing each transformation. By default, each transformed RDD will be recalculated every time you run an action on the RDD. However, you can also use the persist (or cache) method to persist (persist) the RDD into memory, in which case the Spark is saved on the cluster for faster access for the next query. In addition, continuous persistence of RDDs to disk, or replication to multiple nodes, is also supported.

Persistent storage:

Sparkrdd are lazy, and sometimes we want to be able to use the same rdd multiple times. If you simply invoke an action operation on the RDD, Spark will re-calculate the RDD and all its dependencies every time. This is very expensive in the iterative algorithm.
At this point we can let spark persist the data. When we let spark persist to store an RDD, the calculated Rdd node saves the RDD partition data that they find separately. If a node with persistent data fails, spark will re-calculate the lost data partition when it needs to use the cached data. We can back up our data to a number of nodes to avoid this happening.

Pass parameters to spark:

When the driver is running on a cluster, the Spark's API relies heavily on the transfer function. There are 2 recommended ways to do this:

1. The syntax of anonymous functions Anonymous function syntax, which can be used for short code snippets.

Javardd<string> lines = Sc.textfile ("Data.txt");
javardd<integer> linelengths = Lines.map (New function<string, integer> () {public
  Integer call (String s {return s.length ();}
});
int totallength = linelengths.reduce (new Function2<integer, Integer, integer> () {public
  integer call (Integer A, Integer b) {return a + B;}
});
2. Static method in global Singleton object. For example, you can define an object myfunctions and then pass MYFUNCTIONS.FUNC1.

Class GetLength implements Function<string, integer> {public
  Integer call (String s) {return s.length ();}
}
Class Sum implements Function2<integer, Integer, integer> {public
  integer call (integer A, integer b) {return a + b; }
}

javardd<string> lines = sc.textfile ("Data.txt");
javardd<integer> linelengths = Lines.map (New GetLength ());
int totallength = linelengths.reduce (New Sum ());


Key issues. Understanding closures in Spark cluster mode:

When executing code in a cluster, one of the more difficult things about Spark is understanding the scope and life cycle of variables and methods.
Modifying a variable beyond its scope the RDD operation can be confusing for common reasons. In the following example, we will look at the use of the foreach () Code increment counter, but similar problems may occur on other operations.

int counter = 0;
Javardd<integer> Rdd = sc.parallelize (data);

Wrong:don ' t do this!!
Rdd.foreach (x-counter + = x);

println ("Counter value:" + Counter);

The above code behavior is indeterminate and may not work as expected. When Spark executes a job, it decomposes the RDD operation into each performer. Before execution, Spark calculates the closure (closure) of the task. The closures are variables and methods (in this case, foreach ()) that the performer on the RDD must be able to access. Closures are serialized and sent to each executor.
A copy of the closure variable is sent to each executor, and when counter is referenced by the foreach function, it is no longer the counter of driver node. Although there is still a counter in memory in driver node, it is not visible to executors. Executor sees only one copy of the serialized closure. So the final value of counter is 0, because all operations on counter refer to values within the serialized closure.
In local mode, in some cases the foreach feature is actually executed in a driver on the same JVM, and will reference the same original counter that may actually be updated.
In order to ensure that these types of scenes explicitly behave, the accumulator (accumulator) should be used. When an executing task is assigned to individual worker nodes in a cluster, the Spark's accumulator is a mechanism that specifically provides security update variables. We'll talk about this later.
In general, closures-constructs-like loops or locally defined methods should not be used to alter some global state. Spark does not prescribe or guarantee the behavior of mutations to refer to objects from the outside of the enclosing piece. Some code, which may run in local mode, but this is only accidental and such code does not behave as expected in distributed mode. Use if you need some global aggregation accumulator.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.