(6) Transformation operation, through the external different RDD expression form to achieve internal data processing process . This type of operation does not trigger the execution of the job, and is often referred to as a lazy operation.
Most operations generate and return a new RDD, in which case Sortbykey will not produce a new rdd.
1) Map function, a row of data after the Map function processing or a row of data
Functions the map function on all elements of the RDD and returns a new RDD
def Map[u:classtag] (f:t + U): rdd[u] = withscope {
Val cleanf = Sc.clean (f)
To function on every partition of the parent Rdd
New Mappartitionsrdd[u, T] (this, (context, PID, iter) = Iter.map (cleanf))
}
2) Flatmap function, similar to the map function, but a row of data after the FLATMAP function is processed by multiple rows of data
def Flatmap[u:classtag] (f:t = Traversableonce[u]): rdd[u] = withscope {
Val cleanf = Sc.clean (f)
New Mappartitionsrdd[u, T] (this, (context, PID, iter) = Iter.flatmap (cleanf))
}
3) filter function that filters out data that does not meet the criteria and returns a new RDD
def filter (f:t = Boolean): rdd[t] = withscope {
Val cleanf = Sc.clean (f)
New Mappartitionsrdd[t, T] (
This
(context, PID, iter) = Iter.filter (CLEANF),
Preservespartitioning = True)
}
4) Distinct function, remove duplicate elements, return different elements, and return a new RDD
def distinct (numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {
Map (x = (x, null)). Reducebykey ((x, y) = x, numpartitions). Map (_._1)
}
The process is as follows:
5) Repartition function, repartition the RDD and return a new RDD
This method is used to increase or decrease the degree of parallelism of the RDD, in effect distributing the data through shuffle
If you want to reduce the partition of the RDD, consider using the ' coalesce ' function to avoid shuffle
def repartition (Numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {
COALESCE (numpartitions, shuffle = True)
}
6) Coalesce function, repartition the RDD and return a new Rdd
This operation is narrow dependent, for example, if you merge from 1000 partitions into 100 partitions, This merge process is not shuffle, but 100 new partitions will each partition be the original 10 partitions.
Def coalesce (numpartitions:int, Shuffle:boolean = False) (implicit ord:ordering[t] = null)
  &NB Sp : rdd[t] = withscope {
if (shuffle) {
///starts from a random partition, distributes the data evenly over the new partition
Val distributepartition = (Index:int, I Tems:iterator[t] = = {
var position = (new Random (Index)). Nextint (numpartitions)
& nbsp Items.map {T =
Position = position + 1
(position, T)
}
&nb Sp }: iterator[(Int, T)]
New Coalescedrdd (
New shuffledrdd[int, T, T] (Mappartitionswithindex (distributepartition) ,
New Hashpartitioner (numpartitions)),
numpartitions). Values
} else {
New Coalescedrdd (this, numpartitions)
}
}
7) sample function, which returns some sample data of the Rdd randomly
Def sample (
Withreplacement:boolean,
&nb sp; fraction:double,
seed:long = Utils.random.nextLong): rdd[t] = withscope {
Req Uire (fraction >= 0.0, "negative fraction value:" + fraction)
if (withreplacement) {
New PARTITIONWISESAMPLEDRD D[t, T] (this, new poissonsampler[t] (fraction), true, Seed)
} else {
New partitionwisesampledrdd[t, T] (this, New Bernoullisampler[t] (fraction), true, Seed)
}
}
8) SortBy The RDD according to the given key function and returns itself, noting that not creating a new rdd, but also stating that not all transformation are creating a new Rdd
def Sortby[k] (
F: (T) = K,
Ascending:boolean = True,
Numpartitions:int = this.partitions.length)
(Implicit ord:ordering[k], ctag:classtag[k]): rdd[t] = withscope {
This.keyby[k] (f)
. Sortbykey (Ascending, numpartitions)
. values
}
9) Glom function, merges the elements of each partition into an array and returns a new RDD
Def glom (): rdd[array[t]] = withscope {
New Mappartitionsrdd[array[t], T] (this, (context, PID, iter) = Iterator (Iter.toarray))
}
The Groupbykey function returns a key and the value of the same key combined with the RDD.
This operation may be expensive, and if you want to ask for sum or mean, you will have a better effect with Pairrddfunctions.aggregatebykey or Pairrddfunctions.reducebykey.
def groupby[k] (f:t = K, P:partitioner) (implicit kt:classtag[k], ord:ordering[k] = null)
: rdd[(K, iterable[t])] = withscope {
Val cleanf = Sc.clean (f)
This.map (t = (cleanf (t), T)). Groupbykey (P)
}
(7) Action action, triggering the execution of the job and feedback the return value to the user program
1) A foreach function that applies this function to all elements of the RDD
def foreach (f:t = unit): unit = withscope {
Val cleanf = Sc.clean (f)
Sc.runjob (This, (iter:iterator[t]) = Iter.foreach (cleanf))
}
2) The Foreachpartition function, which acts on every partition of the RDD, such as a connection to a database, can share a single connection for a partition
def foreachpartition (f:iterator[t] = unit): unit = withscope {
Val cleanf = Sc.clean (f)
Sc.runjob (This, (iter:iterator[t]) = CLEANF (ITER))
}
3) The Collect function, which returns all elements contained in the RDD as an array
def collect (): array[t] = withscope {
Val results = Sc.runjob (this, (iter:iterator[t]) = Iter.toarray)
Array.concat (results: _*)
}
4) Count function, returns the number of elements in the RDD
def count (): Long = Sc.runjob (This, utils.getiteratorsize _). Sum
5) Take function, takes the first NUM element of the RDD. Take the element of a partition first, if it is not enough to fetch the elements of another partition.
def take (Num:int): array[t] = withscope {
if (num = = 0) {
New Array[t] (0)
} else {
Val buf = new Arraybuffer[t]
Val totalparts = this.partitions.length
var partsscanned = 0
while (Buf.size < num && partsscanned < totalparts) {
var numpartstotry = 1
if (partsscanned > 0) {
if (buf.size = = 0) {
Numpartstotry = partsscanned * 4
} else {
Numpartstotry = Math.max ((1.5 * num * partsscanned/buf.size). toint-partsscanned, 1)
Numpartstotry = Math.min (Numpartstotry, partsscanned * 4)
}
}
Val left = Num-buf.size
Val p = partsscanned until math.min (partsscanned + numpartstotry, totalparts)
Val res = Sc.runjob (this, (it:iterator[t)) = It.take (left). ToArray, p)
Res.foreach (buf ++= _.take (num-buf.size))
partsscanned + = Numpartstotry
}
Buf.toarray
}
}
6) The first function, taking the initial element in the RDD, is actually a take (1) operation
def first (): T = withscope {
Take (1) match {
Case Array (t) = + t
Case _ = + throw new Unsupportedoperationexception ("Empty Collection")
}
}
7) Top function, return top k in Rdd, implicitly sort by ordering[t], i.e. descending, just opposite [takeordered]
def top (Num:int) (implicit ord:ordering[t]): array[t] = withscope {
Takeordered (num) (ord.reverse)
}
8) Saveastextfile function to save the RDD as a text file
def saveastextfile (path:string): Unit = withscope {
Val Nullwritableclasstag = implicitly[classtag[nullwritable]]
Val Textclasstag = Implicitly[classtag[text]]
Val r = this.mappartitions {iter =
Val Text = new text ()
Iter.map {x =
Text.set (x.tostring)
(Nullwritable.get (), text)
}
}
Rdd.rddtopairrddfunctions (R) (Nullwritableclasstag, textclasstag, NULL)
. Saveashadoopfile[textoutputformat[nullwritable, Text]] (path)
}
9) Saveasobjectfile function to serialize the elements in the RDD and save them as files
def saveasobjectfile (path:string): Unit = withscope {
This.mappartitions (iter = iter.grouped) map (_.toarray)
. map (x = (Nullwritable.get (), New Byteswritable (Utils.serialize (x)))
. Saveassequencefile (Path)
}
(8) Implicit conversion
There are many implicit conversion functions defined in the Rdd object, which provide additional functionality that is not inherently
For example, the RDD is implicitly converted to Pairrddfunctions, then the RDD has Reducebykey and other functions.
Implicit def Rddtopairrddfunctions[k, V] (rdd:rdd[(K, v)])
(Implicit kt:classtag[k], vt:classtag[v], ord:ordering[k] = null): Pairrddfunctions[k, V] = {
New Pairrddfunctions (RDD)
}
1.1RDD Interpretation (ii)