about RddBehind the cluster, there is a very important distributed data architecture, the elastic distributed data set (resilient distributed Dataset,rdd). The RDD is the most basic abstraction of spark and is an abstraction of distributed memory, implementing an abstract implementation of distributed datasets in a way that operates local collections. The RDD is
persist the RDD, such as count, collect, save.Rdd Dependency RelationshipDepending on the nature of the operation, different dependencies may be generated, and there are two types of dependencies between Rdd:
Narrow dependency (Narrow Dependencies)A parent RDD partition is referenced by at most one of the child
The conversion of the RDDSpark generates a dependency between the RDD based on the conversion and action of the RDD in the user-submitted calculation logic, and the compute chain generates a logical DAG. Next, take "Word Count" as an example to describe the implementation of this DAG build in detail.The Spark Scala version of Word count program is as follows:1:val file = Spark.textfile ("hdfs://...") 2:val
(transformation) and the Action (action). The main difference between the two types of functions is that transformation accepts the RDD and returns the RDD, while the action accepts the RDD to return the non-rdd.The transformation operation is deferred, meaning that a conversion operation that generates another
What is an RDD?The official explanation for RDD is the elastic distributed data set, the full name is resilient distributed Datasets. The RDD is a collection of read-only, partitioned records. The RDD can only be created based on deterministic operations on datasets in stable physical storage and other existing
according to their own understanding.
This dependency is only dependent on the Partitionpruningrdd object. Prune has a clipping meaning, here means that a portion of the parent RDD partition is dependent on the sub-RDD partition, partly because the filter function is filtered out. Usually filter according to the index value of parttion, such as the following cod
Operation of the RDDThe RDD supports two types of operations: transformations and actions.1) transform, that is, create a new dataset from an existing data set.2) Action, that is, after the calculation on the data set, return a value to the driver program.For example, a map is a transformation that passes each element of a dataset to a function and returns a new distributed dataset that represents the result. In another aspect, reduce is an action tha
The cache of the RDDOne of the reasons that spark is fast is to persist (or cache) a dataset in memory in different operations. When an rdd is persisted, each node will store the computed shard results in memory and reuse them in other actions (action) for this dataset (or derived datasets). This allows subsequent movements to become faster (usually 10 times times faster). RDD-related persistence and cachin
The creation of an RDDTwo ways to create an rdd:1) created by an already existing Scala collection2) created by the data set of the external storage system, including the local file system, and all data sets supported by Hadoop, such as HDFs, Cassandra, HBase, Amazon S3, etc.The RDD can only be created based on deterministic operations on datasets in stable physical storage and other existing
. Change the duration of the RDD (persistence), such as the cache () function. The default RDD is cleared in memory after it is computed. The computed Rdd is cached in memory through the cache () function. two operator operators for RddThere are two computational operators for Rdd: Transformation (transform) and action
Check points for RddThe RDD cache can be saved to memory, local file system, or Tachyon after the first calculation is completed. With caching, spark avoids repetitive computations on the RDD and can greatly increase the computational speed. However, if the cache is missing, it needs to be recalculated. If the calculations are particularly complex or time-consuming, the impact of cache loss on the entire jo
collections between nodes, or store collections into Tachyon. We can StorageLevel set these storage levels by passing an object to persist() the method. cache()method uses the default storage level- StorageLevel.MEMORY_ONLY . The complete storage level is described below:
Storage level
meaning
memory_only The
stores the RDD as a non-serialized Java ob
Tags: spark Dag stage
RDD is the most basic and fundamental data abstraction of spark. Http://www.cs.berkeley.edu /~ Matei/papers/2012/nsdi_spark.pdf is a thesis about RDD. If you think it is too time-consuming to read English, you can read this article
This article also analyzes the implementation of RDD based on this paper and the source code.
First, what is
Transformation processing data for the Key-value form of operators can be broadly divided into: input partition and output partition one-to-one, aggregation, connection operation.input partition and output partition one-to-one mapvaluesMapvalues: Map operation for Value in (key,value) type data, not Key processing.The box represents the RDD partition. The a=>a+2 represents only 1 of the data (V1, 1) plus 2 operation, and the result is 3.Source: /**
Rdd It is the spark base, which is the most fundamental data abstraction. Http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf It is open with an Rdd file. Suppose the English reading is too time consuming: http://shiyanjun.cn/archives/744.htmlThis article is also based on this paper and source code, analysis of the implementation of RDD.First question, what is an
The main contents of this lesson:1, Rdd creation of several ways2. RDD Create Combat3. Rdd InsiderThere are many ways to create an RDD, and the following are some ways to create an rdd:1, use the collection of programs to create RDD
ProblemHow does Spark's computational model work in parallel? If you have a box of bananas, let three people take home to eat, if not unpacking the box will be very troublesome right, haha, a box, of course, only one person can be carried away. At this time, people with normal IQ know to open the box, pour out bananas, respectively, take three small boxes to reload, and then, each to go home to chew it. Spark and many other distributed computing systems have borrowed this idea to achieve paralle
The conversion of RDD and the generation of DagsSpark generates a dependency between the RDD based on the conversion and action of the RDD in the user-submitted calculation logic, and the compute chain generates a logical DAG. Next, take "Word Count" as an example to describe the implementation of this DAG build in detail.The Spark Scala version of Word count pro
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.