Spark-rdd Introduction

Source: Internet
Author: User

About RDD

Behind the spark cluster, there is a very important distributed data architecture, the elastic distributed data set (resilient distributed Dataset,rdd), which is a logical set of entities that partition data across multiple clusters in a cluster. By controlling different RDD partitions on multiple machines, data Shuffle between machines can be reduced. Spark provides the "Partitionby" operator to create a new RDD through data redistribution of the original RDD between multiple machines in the cluster. The RDD is the core data structure of spark, which forms the dispatch order of the spark through the rdd dependency relationship. The entire Spark program is formed through the operation of the RDD.
There are four ways to create an RDDAs follows:
1. Created from the Hadoop file system (or other persistent storage systems compatible with Hadoop, such as Hive, Cassandra, HBase) input (such as HDFs).
2. Convert from parent Rdd to get new Rdd.
3. Call the parallelize of the Sparkcontext () method to parallelize the data set on the driver and convert it into a distributed rdd.
4. Change the duration of the RDD (persistence), such as the cache () function. The default RDD is cleared in memory after it is computed. The computed Rdd is cached in memory through the cache () function.
two operator operators for Rdd
There are two computational operators for Rdd: Transformation (transform) and action.
1. Transformation (change) operator
The transformation operation is deferred, meaning that the conversion from one RDD conversion to another is not performed immediately, and the operation is not actually triggered until there is an actions action.
2. Action operator
The action operator triggers the Spark submission job (job) and outputs the data to the spark system.
important internal properties of the Rdd
1. List of partitions.
2, calculate the function of each shard.
3. A dependency list for the parent RDD.
4, the Key-value data type RDD partition, control the partition policy and the number of partitions,
5. The address list for each data partition (such as the address of the data block on HDFs).

The core of spark data storage is elastic distributedData set (RDD). The RDD can be abstracted as a large array, but the array is distributed across the cluster. Each partition of the logical RDD is called a partition. During the execution of Spark, the RDD undergoes a transformation operator and is finally triggered by an action operator. Each time a transformation is logically experienced, the RDD is converted into a new rdd,rdd between the lineage to generate a dependency relationship. This relationship plays a very important role in fault tolerance. Both the input and output of the transformation are RDD. The RDD is divided into a number of partitions that are distributed across multiple nodes in the cluster. Partitioning is a logical concept, and the old and new partitions before and after the transformation are physically likely to be stored in the same piece of memory. This is an important optimization to prevent the unlimited expansion of memory requirements caused by functional data invariance (immutable). Some rdd is the intermediate result of the calculation, its partitioning does not necessarily have the corresponding memory or disk data corresponding to it, if you want to iterate over the use of data, you can call the cache () function to buffer data.
In physics, the Rdd object is essentially a metadata structure that stores mappings and other metadata information for blocks, node, and so on. An RDD is a component area, and on the physical data store, each partition of the RDD corresponds to a block,block that can be stored in memory and stored on disk when there is not enough memory. Each block stores a subset of all data items in the RDD, exposing the user to an iterator that can be a block (for example, a user can get a partitioned iterator through mappartitions), It can also be a data item (for example, the map function computes each data item in parallel). If you are using external storage such as HDFS as the input data source, data is partitioned according to the data distribution policy in HDFs, and one block in HDFs corresponds to one partition of spark. Colleague Spark supports repartitioning, where data is determined by the Spark's default or user-defined partitioner to distribute the data blocks across those nodes. For example, a policy that supports hash partitioning (hash value by data item key value, elements with the same hash value into the same partition) and range partition (data that belongs to the same data range are placed in the same partition).

Spark-rdd Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.