Spark RDD Operation

Source: Internet
Author: User
Keywords spark spark rdd spark rdd operation
There are two types of operations in RDD: transformation and action

Conversion: Convert RDD to another RDD through operation

Action: Evaluate or output an RDD

All these operations are mainly for two types of RDDs: numeric RDD and key-value pair RDD

All conversion operations of RDD are performed lazily, and spark will really run only when action operations occur

Common conversion operations:

1. def map[U: ClassTag](f: T => U): RDD[U] applies the function to each element of the RDD and returns a new RDD

2. def filter(f: T => Boolean): RDD[T] returns the new RDD that meets the result of True by providing the expression that generates the boolean condition

3. def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] Apply the function to each item in the RDD, for each item a set is generated, and the set of The elements are squashed into a collection.

4. def mapPartitions[U: ClassTag]( f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] Apply the function to each partition of RDD, each partition runs once, The function needs to be able to accept the Iterator type and then return Iterator.

5. def mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning:Boolean = false): RDD[U] applies the function to each partition in the RDD, each Once a partition is run, the function can accept the index value of a partition and an Iterator type representing all the data in the partition. It needs to return the Iterator type.

6. def sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T] Move seed as seed in RDD and return roughly RDD of fraction data sample with fraction ratio, withReplacement indicates whether to use Replacement sampling.

7. def union(other: RDD[T]): RDD[T] merges the elements in the two RDDs and returns a new RDD

8. def intersection(other:RDD[T]): RDD[T] Intersects two RDDs and returns a new RDD

9. def distinct(): RDD[T] After deduplicating the current RDD, return a new RDD

10. def partitionBy(partitioner:Partitioner): RDD[(K, V)] Re-partition the RDD according to the set partitioner and return the new RDD.

11. def reduceByKey(func: (V, V) => V): RDD[(K, V)] Calculate the value of the tuple of the same Key with func according to the Key value and return the new RDD

12. def groupByKey(): RDD[(K, Iterable[V])] aggregates the values of the same Key and outputs an RDD of type (K, Iterable[V])

13. def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)] Use CreateCombiner and mergeValue to aggregate the values of the same key according to the key, and use mergeCombiners to aggregate the final results of each partition.

14.Def aggregateByKey[U: ClassTag](zeroValue: U, partitioner:Partitioner)(seqOp: (U, V) => U,combOp: (U, U) => U): RDD[(K, U)] Through the seqOp function, iteratively bring the data and initial values in each partition into the function to return the final value, and comOp merges the final value returned by each partition according to the key.

15. def foldByKey(zeroValue:V,partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] The simplified operation of aggregateByKey, seqop and combop are the same,

16. def sortByKey(ascending:Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)] Called on a (K,V) RDD, K must implement the Ordered interface and return a RDD of (K, V) sorted by key

17. def sortBy[K](f: (T) => K,ascending: Boolean =true,numPartitions: Int =this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] The bottom implementation still uses sortByKey, but only uses the new key generated by fun for sorting.

18. def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] in types (K,V) and (K,W) Called on the RDD of the RDD, it returns an RDD of (K, (V, W)) where all elements corresponding to the same key are paired together, but it should be noted that he will only return the case where the key exists in both RDDs.

19. def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner):RDD[(K, (Iterable[V], Iterable[W]))] in the type (K,V) Called on the RDD of (K,W) and returns an RDD of type (K,(Iterable<V>,Iterable<W>)). Note that if the types of V and W are the same, they are not put together, or are separate Store.

20. def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] Do the Cartesian product of two RDDs and return the dual RDD

21. def pipe(command: String): RDD[String] For each partition, execute a perl or shell script to return the output RDD. Note that if you are in the local file system, you need to place the script on each node on.

22. def coalesce(numPartitions:Int, shuffle: Boolean = false, partitionCoalescer:Option[PartitionCoalescer] = Option.empty)(implicit ord: Ordering[T] = null): RDD[T] Reduce the number of partitions for big data After set filtering, improve the execution efficiency of small data sets.

23. def repartition(numPartitions: Int) (implicit ord: Ordering[T] = null): RDD[T] According to the number of partitions you pass in, re-partition all data through the network, heavy operation.

24. def repartitionAndSortWithinPartitions(partitioner:Partitioner): RDD[(K, V)] Performance is higher than repartition. Sort within a given partitioner

25. def glom(): RDD[Array[T]] forms each partition into an array to form a new RDD type RDD[Array[T]]

26. def mapValues[U](f: V => U): RDD[(K, U)] Apply the function to v in the result of (k, v) and return the new RDD

27. def subtract(other: RDD[T]): RDD[T] is a function that calculates the difference to remove the same element in two RDDs. Different RDDs will be retained.

Common actions:

1. def takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T] samples but returns a scala set.

2. def reduce(f: (T, T) => T): T gathers all the elements in the RDD through the func function

3. def collect(): Array[T] In the driver, all elements of the data set are returned in the form of an array

4. def count(): Long returns the number of elements in the RDD

5. def first(): T returns the first element in the RDD

6. def take(num: Int): Array[T] returns the first n elements in the RDD

7. def takeOrdered(num:Int)(implicit ord: Ordering[T]) returns the order of the first few

8. def aggregate[U: ClassTag](zeroValue:U)(seqOp: (U,T) =>U, combOp: (U,U) =>U): U aggregate function passes the elements in each partition through seqOp Aggregate with the initial value, and then use the combine function to combine the results of each partition and the initial value (zeroValue). The type returned by this function does not need to be the same as the element type in the RDD.

9. def fold(zeroValue:T)(op: (T,T) =>T):T Fold operation, simplified operation of aggregate, seqop and combop are the same.

10.Def saveAsTextFile(path:String): Unit Save RDD as a text file to local or HDFS

11. def saveAsObjectFile(path: String): Unit Save the elements in the RDD as a serialized object to the local or HDFS.

12. def countByKey(): Map[K, Long] For RDDs of type (K, V), return a map of (K, Int), indicating the number of elements corresponding to each key.

13. def foreach(f: T => Unit): Unit On each element of the data set, run the function func to update.

Note: When you use a class method or attribute in RDD, the class needs to inherit the java.io.Serializable interface, or you can assign the attribute to a local variable to prevent the transmission of the entire object.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.