First, the classification of the spark operator is described in detail in http://www.cnblogs.com/zlslch/p/5723857.html
1. Transformation Transform/conversion operator
1. Map operator
2, Flatmap operator
3, mappartitions operator
4. Union operator
5, Cartesian operator
6, Grouby operator
7. Filter operator
8. Sample operator
9. Cache operator
10, persist operator
11, mapvalues operator
12, Combinebykey operator
13, Reducebykey operator
14. Join operator
2. Action operator
1. foreach operator
2, saveastextfile operator
3, collect operator
4. Count Count
To summarize briefly:
triggered by the action operator, Spark submits the job.
Caches the data to memory through the cache operator.
Figure 1 Spark operator and data space
Describes the input, run conversion, and output of spark. The RDD is converted by operator in the run transformation. operators are functions defined in the RDD and can be transformed and manipulated into the data in the RDD.
1) Input: When the Spark program is running, data is entered into spark from external data space (such as distributed storage: Textfile read HDFs, parallelize method input Scala collection or data), data enters spark runtime data space, Transform into a block of data in Spark, managed by Blockmanager.
2) Run: After the Spark data input form an RDD, the data can be transformed into a new rdd via a transform operator such as Fliter, triggering spark to submit the job via the action operator . If the data needs to be reused, the data can be cached to memory through the cache operator.
3) Output: Program run end data outputs spark runtime space, stored in distributed storage (e.g. Saveastextfile output to HDFs), or Scala data or collection (collect output to Scala collection, Count returns to Scala int type data). The core data model of Spark is the RDD, but the RDD is an abstract class that is implemented by subclasses such as Mappedrdd, Shuffledrdd, and so on. Spark translates common big data operations into a subclass of the RDD.
The role of the Apache spark operator