"Spark" Rdd operation detailed 1--transformation and actions overview

Source: Internet
Author: User

The role of the spark operator

Describes how spark transforms an rdd through operators in a run conversion. Operators are functions defined in the RDD and can be transformed and manipulated into the data in the RDD.

    1. Input: During the Spark program run, data is entered into spark from the external data space (such as distributed storage: Textfile read HDFs, parallelize method input Scala collection or data), data enters spark runtime data space, Transform into a block of data in Spark, managed by Blockmanager.
    2. Operation: After the spark data input forms an RDD, the data can be manipulated by a transform operator, such as filter, and the RDD is transformed into a new rdd, triggering the spark submission job via the action operator. If the data needs to be reused, the data can be cached to memory through the cache operator.
    3. Output: Program run end data outputs spark runtime space, stored in distributed storage (e.g. Saveastextfile output to HDFs), or Scala data or collection (collect output to Scala collection, COUNT returns scala int data).

The core data model of Spark is the RDD, but the RDD is an abstract class that is implemented by subclasses such as Mappedrdd, Shuffledrdd, and so on. Spark translates common big data operations into a subclass of the RDD.

Transformation and Actions overview transformation specific content
  • Map (Func): Returns a new distributed dataset composed of each original element after the Func function is converted
  • Filter (func): Returns a new dataset consisting of the original elements that return a value of true after the Func function
    *flatmap (func): similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)
  • FlatMap (func): similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)
  • Sample (Withreplacement, Frac, Seed):
    Random sampling of FRAC data based on a given seed seed
  • Union (Otherdataset): Returns a new DataSet, combined with the original data set and parameters
  • Groupbykey ([Numtasks]):
    Called on a dataset composed of (K,V) pairs that returns a (K,seq[v]) pair of data sets. Note: By default, 8 parallel tasks are used to group, you can pass in the Numtask optional parameter, set a different number of task according to the amount of data
  • Reducebykey (func, [Numtasks]): Used on a (k,v) pair of data sets, returns a (K,V) pair of data sets, the same value of key, all aggregated together using the specified reduce function. Similar to Groupbykey, the number of tasks can be configured with a second optional parameter.
  • Join (Otherdataset, [numtasks]):
    Called on a dataset of type (K,V) and (k,w) type, returns a DataSet (K, (v,w)) pair, with all the elements in each key together
  • Groupwith (Otherdataset, [Numtasks]): Called on datasets of type (K,V) and (k,w) type, returns a dataset consisting of elements (K, seq[v], seq[w]) tuples. This operation is in other frameworks, called Cogroup
  • Cartesian (Otherdataset): Cartesian product. But when called on the dataset T and U, returns a (t,u) pair of data sets, all elements interacting with the Cartesian product.
  • FlatMap (func):
    Similar to map, but each INPUT element is mapped to 0 to multiple output elements (so the return value of the Func function is a seq, not a single element)
Actions specific Content
    • Reduce (func): Aggregates all elements of a dataset by function Func. The Func function takes 2 parameters and returns a value. This function must be associative to ensure that it can be executed correctly and concurrently.
    • Collect (): Returns all elements of a dataset in the form of an array, in the driver program. This usually returns a subset of data that is small enough to be used after using filter or other operations, directly returning the entire RDD set collect, which is likely to cause the driver program to Oom
    • COUNT (): Returns the number of elements in the dataset
    • Take (N): Returns an array consisting of the first n elements of a dataset. Note that this operation is not currently performed in parallel on multiple nodes, but rather on the machine where the driver program is located, and all of the elements are computed (the gateway's memory pressure will increase and need to be used with caution)
    • First (): Returns the initial element of a dataset (similar to take (1))
    • Saveastextfile (PATH): Saves the elements of the dataset, in textfile form, to the local file system, HDFs, or any other Hadoop-supported file system. Spark will invoke the ToString method of each element and convert it to a line of text in the file
    • Saveassequencefile (PATH): Saves the elements of the dataset in Sequencefile format, to the specified directory, to the local system, to HDFs, or to any other Hadoop-supported file system. The elements of the RDD must be composed of key-value pairs and both implement the writable interface of Hadoop, or implicitly can be converted to writable (spark includes basic types of transformations, such as int,double,string, etc.)
    • foreach (func): Runs the function func on each element of a dataset. This is typically used to update an accumulator variable, or to interact with an external storage system
Operator classification

It can be broadly divided into three main types of operators:
1. The transformation operator of the value data type, which does not trigger the commit job, is data of value type for the processed data item.
2. The transfromation operator of the Key-value data type, which does not trigger the commit job, is the data pair of the Key-value type for the processed data item.
3. Action operators, which trigger sparkcontext to submit job jobs.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Spark" Rdd operation detailed 1--transformation and actions overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.