1.1RDD Interpretation (ii)

Source: Internet
Author: User

(6) Transformation operation, through the external different RDD expression form to achieve internal data processing process . This type of operation does not trigger the execution of the job, and is often referred to as a lazy operation.

Most operations generate and return a new RDD, in which case Sortbykey will not produce a new rdd.


1) Map function, a row of data after the Map function processing or a row of data

Functions the map function on all elements of the RDD and returns a new RDD

def Map[u:classtag] (f:t + U): rdd[u] = withscope {
Val cleanf = Sc.clean (f)
To function on every partition of the parent Rdd

New Mappartitionsrdd[u, T] (this, (context, PID, iter) = Iter.map (cleanf))
}

2) Flatmap function, similar to the map function, but a row of data after the FLATMAP function is processed by multiple rows of data

def Flatmap[u:classtag] (f:t = Traversableonce[u]): rdd[u] = withscope {
Val cleanf = Sc.clean (f)
New Mappartitionsrdd[u, T] (this, (context, PID, iter) = Iter.flatmap (cleanf))
}

3) filter function that filters out data that does not meet the criteria and returns a new RDD

def filter (f:t = Boolean): rdd[t] = withscope {
Val cleanf = Sc.clean (f)
New Mappartitionsrdd[t, T] (
This
(context, PID, iter) = Iter.filter (CLEANF),
Preservespartitioning = True)
}

4) Distinct function, remove duplicate elements, return different elements, and return a new RDD

def distinct (numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {
Map (x = (x, null)). Reducebykey ((x, y) = x, numpartitions). Map (_._1)
}

The process is as follows:

5) Repartition function, repartition the RDD and return a new RDD

This method is used to increase or decrease the degree of parallelism of the RDD, in effect distributing the data through shuffle

If you want to reduce the partition of the RDD, consider using the ' coalesce ' function to avoid shuffle

def repartition (Numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {
COALESCE (numpartitions, shuffle = True)
}

6) Coalesce function, repartition the RDD and return a new Rdd

   This operation is narrow dependent, for example, if you merge from 1000 partitions into 100 partitions, This merge process is not shuffle, but 100 new partitions will each partition be the original 10 partitions.

Def coalesce (numpartitions:int, Shuffle:boolean = False) (implicit ord:ordering[t] = null)
  &NB Sp : rdd[t] = withscope {
if (shuffle) {
///starts from a random partition, distributes the data evenly over the new partition

Val distributepartition = (Index:int, I Tems:iterator[t] = = {
var position = (new Random (Index)). Nextint (numpartitions)
    & nbsp Items.map {T =
Position = position + 1
(position, T)
     }
  &nb Sp }: iterator[(Int, T)]
New Coalescedrdd (
New shuffledrdd[int, T, T] (Mappartitionswithindex (distributepartition) ,
New Hashpartitioner (numpartitions)),
      numpartitions). Values
 } else {
New Coalescedrdd (this, numpartitions)
 }
}

7) sample function, which returns some sample data of the Rdd randomly

Def sample (
    Withreplacement:boolean,
&nb sp;   fraction:double,
    seed:long = Utils.random.nextLong): rdd[t] = withscope {
Req Uire (fraction >= 0.0, "negative fraction value:" + fraction)
if (withreplacement) {
New PARTITIONWISESAMPLEDRD D[t, T] (this, new poissonsampler[t] (fraction), true, Seed)
 } else {
New partitionwisesampledrdd[t, T] (this, New Bernoullisampler[t] (fraction), true, Seed)
 }
}

8) SortBy The RDD according to the given key function and returns itself, noting that not creating a new rdd, but also stating that not all transformation are creating a new Rdd

def Sortby[k] (
F: (T) = K,
Ascending:boolean = True,
Numpartitions:int = this.partitions.length)
(Implicit ord:ordering[k], ctag:classtag[k]): rdd[t] = withscope {
This.keyby[k] (f)
. Sortbykey (Ascending, numpartitions)
. values
}

9) Glom function, merges the elements of each partition into an array and returns a new RDD

Def glom (): rdd[array[t]] = withscope {
New Mappartitionsrdd[array[t], T] (this, (context, PID, iter) = Iterator (Iter.toarray))
}

The Groupbykey function returns a key and the value of the same key combined with the RDD.

This operation may be expensive, and if you want to ask for sum or mean, you will have a better effect with Pairrddfunctions.aggregatebykey or Pairrddfunctions.reducebykey.

def groupby[k] (f:t = K, P:partitioner) (implicit kt:classtag[k], ord:ordering[k] = null)
: rdd[(K, iterable[t])] = withscope {
Val cleanf = Sc.clean (f)
This.map (t = (cleanf (t), T)). Groupbykey (P)
}

(7) Action action, triggering the execution of the job and feedback the return value to the user program

1) A foreach function that applies this function to all elements of the RDD

def foreach (f:t = unit): unit = withscope {
Val cleanf = Sc.clean (f)
Sc.runjob (This, (iter:iterator[t]) = Iter.foreach (cleanf))
}

2) The Foreachpartition function, which acts on every partition of the RDD, such as a connection to a database, can share a single connection for a partition

def foreachpartition (f:iterator[t] = unit): unit = withscope {
Val cleanf = Sc.clean (f)
Sc.runjob (This, (iter:iterator[t]) = CLEANF (ITER))
}

3) The Collect function, which returns all elements contained in the RDD as an array

def collect (): array[t] = withscope {
Val results = Sc.runjob (this, (iter:iterator[t]) = Iter.toarray)
Array.concat (results: _*)
}

4) Count function, returns the number of elements in the RDD

def count (): Long = Sc.runjob (This, utils.getiteratorsize _). Sum

5) Take function, takes the first NUM element of the RDD. Take the element of a partition first, if it is not enough to fetch the elements of another partition.

def take (Num:int): array[t] = withscope {
if (num = = 0) {
New Array[t] (0)
} else {
Val buf = new Arraybuffer[t]
Val totalparts = this.partitions.length
var partsscanned = 0
while (Buf.size < num && partsscanned < totalparts) {
var numpartstotry = 1
if (partsscanned > 0) {
if (buf.size = = 0) {
Numpartstotry = partsscanned * 4
} else {
Numpartstotry = Math.max ((1.5 * num * partsscanned/buf.size). toint-partsscanned, 1)
Numpartstotry = Math.min (Numpartstotry, partsscanned * 4)
}
}
Val left = Num-buf.size
Val p = partsscanned until math.min (partsscanned + numpartstotry, totalparts)
Val res = Sc.runjob (this, (it:iterator[t)) = It.take (left). ToArray, p)
Res.foreach (buf ++= _.take (num-buf.size))
partsscanned + = Numpartstotry
}
Buf.toarray
}
}

6) The first function, taking the initial element in the RDD, is actually a take (1) operation

def first (): T = withscope {
Take (1) match {
Case Array (t) = + t
Case _ = + throw new Unsupportedoperationexception ("Empty Collection")
}
}

7) Top function, return top k in Rdd, implicitly sort by ordering[t], i.e. descending, just opposite [takeordered]

def top (Num:int) (implicit ord:ordering[t]): array[t] = withscope {
Takeordered (num) (ord.reverse)
}

8) Saveastextfile function to save the RDD as a text file

def saveastextfile (path:string): Unit = withscope {
Val Nullwritableclasstag = implicitly[classtag[nullwritable]]
Val Textclasstag = Implicitly[classtag[text]]
Val r = this.mappartitions {iter =
Val Text = new text ()
Iter.map {x =
Text.set (x.tostring)
(Nullwritable.get (), text)
}
}
Rdd.rddtopairrddfunctions (R) (Nullwritableclasstag, textclasstag, NULL)
. Saveashadoopfile[textoutputformat[nullwritable, Text]] (path)
}

9) Saveasobjectfile function to serialize the elements in the RDD and save them as files

def saveasobjectfile (path:string): Unit = withscope {
This.mappartitions (iter = iter.grouped) map (_.toarray)
. map (x = (Nullwritable.get (), New Byteswritable (Utils.serialize (x)))
. Saveassequencefile (Path)
}

(8) Implicit conversion

There are many implicit conversion functions defined in the Rdd object, which provide additional functionality that is not inherently

For example, the RDD is implicitly converted to Pairrddfunctions, then the RDD has Reducebykey and other functions.

Implicit def Rddtopairrddfunctions[k, V] (rdd:rdd[(K, v)])
(Implicit kt:classtag[k], vt:classtag[v], ord:ordering[k] = null): Pairrddfunctions[k, V] = {
New Pairrddfunctions (RDD)
}

1.1RDD Interpretation (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.