Spark programming--transformations

Source: Internet
Author: User
Tags iterable unique id

Map

Each data item in the RDD, one-to-one mapping, the number of RDD, the number of partitions, and the same

Example:

Data set:

Map operation:

FlatMap

Same as map, but will split the list after each map, can be understood as a one-to-many (note: The string as an array and then split)

Example:

Distinct

Deduplication of data items for RDD

Example:

Coalesce

def COALESCE (numpartitions:int, Shuffle:boolean = False) (implicit ord:ordering[t] = null): Rdd[t]

Function: Re-partitioning the RDD, using Hashpartitioner.

Parameter 1: Number of re-partitions

Parameter 2: Whether shuffle, default is False (note: The general advantage of coalesce than repartition is that you can not need shuffle)

Example: (I gave two cores to the virtual machine, the default is two partitions, and more than 2 is not possible, perhaps the number of partitions and the amount of available processors, the Internet has said if the partition is larger than the current partition to set the parameter is true, but I set the partition of 1 to 2 also succeeded, big data do not know whether it is feasible

Repartition

def repartition (Numpartitions:int) (implicit ord:ordering[t] = null): Rdd[t]

is also repartitioning, will do shuffle work, give a partition number is OK (for a lot of small tasks, spark has its own partition mechanism, if the force is set to some smaller number of partitions, it might be possible to speed up the program)

Example

Randomsplit

def randomsplit (weights:array[double], Seed:long = Utils.random.nextLong): Array[rdd[t]]

Randomly cut an rdd into multiple rdd, slicing according to a double array

The second parameter is the seed of the random, which is basically negligible.

Example:

Union

Def union (Other:rdd[t]): Rdd[t]

Combine two rdd with no weight

Example:

Intersection

Returns the intersection of two Rdd and

def intersection (Other:rdd[t]): Rdd[t]
def intersection (Other:rdd[t], numpartitions:int): Rdd[t]
def intersection (Other:rdd[t], Partitioner:partitioner) (implicit ord:ordering[t] = null): Rdd[t]

Numpartitions: Specifies the number of partitions for the RDD returned.
Partitioner: Specifying partition functions

Example:

Subtract

Returns the collection of data items that appear in the RDD but not Otherrdd

def subtract (other:rdd[t]): Rdd[t]
def subtract (Other:rdd[t], numpartitions:int): Rdd[t]
def subtract (Other:rdd[t], Partitioner:partitioner) (implicit ord:ordering[t] = null): Rdd[t]

Example:

Mappartitionsmappartitionswithindexzip

def Zip[u] (Other:rdd[u]) (implicit arg0:classtag[u]): rdd[(T, U)]

Effect: Two RDD groups are synthesized key/value form of RDD

Note: The default two RDD has the same number of partition and the number of elements, otherwise it throws an exception

Example:

Zippartitionszipwithindex

Def zipwithindex (): rdd[(T, Long)]

Combines the elements in the RDD with the ID (index number) of this element in the RDD bonding/value pairs (can be used for numbering) (different machines may conflict)

Example:

Zipwithuniqueid

Def zipwithuniqueid (): rdd[(T, Long)]

This function combines an bonding/value pair with an element in the RDD and a unique ID, and the unique ID generation algorithm is as follows:

The unique ID value for the first element in each partition is: The partition index number,

The unique ID value for the nth element in each partition is: (The unique ID value of the previous element) + (the total number of partitions for the RDD)

Example: (although numbered separately in different machines, the entire rdd is not duplicated, note the difference between zipwithindex ())

Partitionby

def Partitionby (Partitioner:partitioner): rdd[(K, V)]

The function generates a new SHUFFLERDD based on the Partitioner function and re-partitions the original RDD

Mapvalues

def Mapvalues[u] (f: (V) = = u): rdd[(K, U)]

A map in the same basic conversion operation, except that Mapvalues is a map operation for the V value in [K,v]

Example:

Flatmapvalues

def Flatmapvalues[u] (f: (V) = Traversableonce[u]): rdd[(K, U)]

Flatmap in the basic conversion operation, except that Flatmapvalues is flatmap for the V value in [K,v]

Example:

Reducebykey

Combine RDD according to key in Key-value, equal to key

Def Reducebykey (func: (V, v) = v): rdd[(K, v)]

Def Reducebykey (func: (V, v) = V, numpartitions:int): rdd[(K, v)]

def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)]

This function is used to calculate the V value corresponding to each K in Rdd[k,v] According to the mapping function.

The parameter numpartitions is used to specify the number of partitions;

The parameter partitioner is used to specify the partition function;

Example:

Groupbykey

Def groupbykey (): rdd[(K, Iterable[v])

def groupbykey (numpartitions:int): rdd[(K, Iterable[v])

def groupbykey (Partitioner:partitioner): rdd[(K, Iterable[v])

This function is used to merge the V value of each K in Rdd[k,v] into a set of iterable[v],

The parameter numpartitions is used to specify the number of partitions;

The parameter partitioner is used to specify the partition function;

Example:

Reducebykeylocally

Def reducebykeylocally (func: (V, v) = v): map[k, V]

The function evaluates the V value corresponding to each K in Rdd[k,v] According to the mapping function, and the result of the operation is mapped to a map[k,v] instead of rdd[k,v]. (in Python, it's actually dictionary)

Example:

Combinebykeyfoldbykeysubtractbykey

Subtractbykey similar to subtract in basic conversion operations

This is for k, which returns elements that appear in the main Rdd and do not appear in the Otherrdd.

Parameter numpartitions the number of partitions used to specify the result

The parameter partitioner is used to specify the partition function

Cogroup

Failed to run

Joinleftouterjoinrightouterjoin

Spark programming--transformations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.