1. Operator Classification
From the general direction, the Spark operator can be broadly divided into the following two types of transformation: The operation is deferred calculation, that is, the conversion from one RDD to another rdd is not executed immediately, it is necessary to wait until there is an action action to actually trigger the operation. Action: Triggers the Spark submission job (job) and outputs the data to the spark system.
From a small direction, the Spark operator can be broadly divided into the following three categories: The transformation operator of the value data type. The transfromation operator for the Key-value data type. Action operator 1.1 transformation operator of Value data type
type |
operator |
input partition and output partition one-to-one |
Map, FlatMap, Mappartitions, Glom |
input partition and output partition many-to-one type |
Union, Cartesian |
input partition and output partition Many-to-many types |
GroupBy |
Output partition as input partition subset type |
Filter, DISTINCT, subtract, sample, takesample |
Cache type |
Cache, persist |
1.2 transfromation operators for Key-value data types
type |
operator |
input partition and output partition one-to-one |
Mapvalues |
For a single Rdd |
Combinebykey, Reducebykey, Partitionby |
Two Rdd aggregation |
Cogroup |
Connection |
Join, Leftoutjoin, Rightoutjoin |
1.3 Action operator
type |
operator |
No output |
Foreach |
Hdfs |
Saveastextfile, Saveasobjectfile |
Scala collections and data types |
Collect, Collectasmap, reducebykeylocally, lookup, count, top, reduce, fold, aggregate |
2. Transformation
2.1 Map
2.1.1 Overview
Syntax (Scala):
def Map[u:classtag] (f:t = U): Rdd[u]
Description
Convert each data item of the original RDD to a new element through the user-defined function f-map in map 2.1.2 Java Example
/** * Map operator * <p> * map and foreach operator: * 1.
Loop map every element of the call element; * 2.
Executes the call function and returns. * </p> * * private static void map () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSim
Plename ()). Setmaster ("local");
Javasparkcontext sc = new Javasparkcontext (conf); list<string> datas = arrays.aslist ("{' id ': 1, ' name ': ' XL1 ', ' pwd ': ' xl123 ', ' Sex ': 2} ', ' {' id ': 2,
' Name ': ' Xl2 ', ' pwd ': ' xl123 ', ' Sex ': 1} ', ' {' id ': 3, ' name ': ' Xl3 ', ' pwd ': ' xl123 ', ' Sex ': 2} ';
javardd<string> Datasrdd = sc.parallelize (datas); javardd<user> Maprdd = Datasrdd.map (New function<string, user> () {public User CA
LL (String v) throws Exception {Gson Gson = new Gson ();
Return Gson.fromjson (V, user.class);
}
}); Maprdd.foreach (New voidfunction<user> () {public void call (user user) throws Exception {
SYSTEM.OUT.PRINTLN ("ID:" + user.id + "Name:" + User.Name + "PWD:
"+ user.pwd +" Sex: "+ user.sex);
}
});
Sc.close (); }//Results id:1 name:xl1 pwd:xl123 sex:2 id:2 name:xl2 pwd:xl123 sex:1 id:3 name:xl3 pwd:xl123 sex:2
2.1.3 Scala Sample
Private def map () {
val conf = new sparkconf (). Setappname (ScalaOperatorDemo.getClass.getSimpleName). Setmaster (" Local ")
val sc = new Sparkcontext (conf)
val datas:array[string] = Array (
" {' id ': 1, ' name ': ' XL1 ', ' pwd ': ' Xl123 ', ' Sex ': 2} ",
" {' id ': 2, ' name ': ' Xl2 ', ' pwd ': ' xl123 ', ' Sex ': 1} ', '
{' id ': 3, ' name ': ' Xl3 ', ' pwd ': ' xl123 ', ' Sex ': 2} ')
sc.parallelize (datas)
. Map (v = = {
new Gson (). Fromjson (V, Classof[user])
})
. foreach (user + = {
println ("ID:" + user.id
+ "Name:" + user.name + "
pwd:" + user.pwd
+ "Sex:" + user.sex)}
)
}
2.2 Filter
2.2.1 Overview
Syntax (Scala):
def filter (f:t = Boolean): rdd[t]
Description
The
Filters the elements, applies an f function to each element, returns an element with a value of true in the RDD, and returns false to filter out 2.2.2 Java Example
static void Filter () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
. Setmaster ("local");
Javasparkcontext sc = new Javasparkcontext (conf);
list<integer> datas = arrays.aslist (1, 2, 3, 7, 4, 5, 8);
javardd<integer> rdddata = sc.parallelize (datas); javardd<integer> Filterrdd = rdddata.filter (//jdk1.8//v1, v1 >= 3 NE
W Function<integer, boolean> () {public Boolean call (Integer V) throws Exception {
Return v >= 3;
}
}
); Filterrdd.foreach (///jdk1.8//V-System.out.println (v) New Voidfunction<inte
Ger> () {@Override public void call (Integer integer) throws Exception {
System.out.println (integer);
}
}
);
Sc.close (); }//Results 3 7 4 5 8
2.2.3 Scala Sample
def filter {
val conf = new sparkconf (). Setappname (ScalaOperatorDemo.getClass.getSimpleName). Setmaster ("local")
val sc = new Sparkcontext (conf)
val datas = Array (1, 2, 3, 7, 4, 5, 8)
sc.parallelize (datas)
. Filter (v =& Gt V >= 3)
. foreach (println)
}
2.3 FlatMap
2.3.1 Brief Introduction
Syntax (Scala):
def Flatmap[u:classtag] (f:t = Traversableonce[u]): Rdd[u]
Description
Similar to map, but each input RDD member can produce 0 or more output members
2.3.2 Java Sample
static void FlatMap () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
. Setmaster ("local");
Javasparkcontext sc = new Javasparkcontext (conf); list<string> data = arrays.aslist ("aa,bb,cc", "Cxf,spring,struts2", "java,c++,j
Avascript ");
javardd<string> Rdddata = sc.parallelize (data); javardd<string> flatmapdata = Rdddata.flatmap (V-arrays.aslist (V.split (",")). Iterator ()// New flatmapfunction<string, string> () {//@Override//Public iterator<string& Gt Call (String t) throws Exception {//list<string> list= arrays.aslist (T.split (","));//
return List.iterator ();
// }
// }
);
Flatmapdata.foreach (V-System.out.println (v));
Sc.close (); }//Results AA bb cc CXF Spring struts2 java C + + JavaScript
2.3.3 Scala Sample
Sc.parallelize (datas)
. FlatMap (line = Line.split (","))
. foreach (println)
2.4 mappartitions
2.4.1 Overview
Syntax (Scala):
def Mappartitions[u:classtag] (
f:iterator[t] = Iterator[u],
Preservespartitioning:boolean = false): RDD[ U
Description
Similar to map, but the Func function in map is for each element in the RDD, and the Func function in Mappartitions is an entire partition of the RDD. So the type of func is iterator<t> = Iterator<u>, where T is the type of the input RDD element. Preservespartitioning indicates whether to preserve the partitioner of the input function, false by default. 2.4.2 Java Sample
static void Mappartitions () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
. Setmaster ("local");
Javasparkcontext sc = new Javasparkcontext (conf); list<string> names = Arrays.aslist ("Zhang 31", "Li 41", "Wang 51", "Zhang 32", "Li 42", "Wang 52", "Zhang 33", "Li 43", "Wang 53", "Zhang 34")
;
javardd<string> Namesrdd = sc.parallelize (names, 3); javardd<string> Mappartitionsrdd = namesrdd.mappartitions (New flatmapfunction<iterator<string>
, string> () {int count = 0;
@Override Public iterator<string> Call (iterator<string> stringiterator) throws Exception {
list<string> list = new arraylist<string> ();
while (Stringiterator.hasnext ()) {List.add ("Partition index:" + count++ + "\ T" + stringiterator.next ());
} return List.iterator (); }
}
);
Fetch data from the cluster to local memory list<string> result = Mappartitionsrdd.collect ();
Result.foreach (System.out::p rintln);
Sc.close (); }//Result partition index: 0 31 Partition index: 1 li 41 partition index: 2 King 51 Partition index: 0 32 Partition index: 1 Li 42 partition index: 2 King 52 Partition index: 0 Zhang San 3 partition index: 1 John Doe 3 Partition index: 2 Harry 3 Partition index: 3 sheet 34
2.4.3 Scala Sample
Sc.parallelize (Datas, 3)
. mappartitions (
n = {
val result = arraybuffer[string] () while
( N.hasnext) {
result.append (N.next ())
}
result.iterator
}
)
. foreach (println)
2.5 Mappartitionswithindex
2.5.1 Overview
Syntax (Scala):
def Mappartitionswithindex[u:classtag] (
F: (Int, iterator[t]) = Iterator[u],
preservespartitioning: Boolean = False): Rdd[u]
Description
Similar to mappartitions, but the input provides an integer representing the number of the partition, so the type of Func is (int, iterator<t>) + iterator<r> an int 2.5.2 Java Example
private static void Mappartitionswithindex () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getS
Implename ()). Setmaster ("local");
Javasparkcontext sc = new Javasparkcontext (conf); list<string> names = Arrays.aslist ("Zhang 31", "Li 41", "Wang 51", "Zhang 32", "Li 42", "Wang 52", "Zhang 33", "Li 43", "Wang 53", "Zhang 34")
;
Initialized, divided into 3 partitions javardd<string> Namesrdd = sc.parallelize (names, 3); javardd<string> Mappartitionswithindexrdd = Namesrdd.mappartitionswithindex (new Function2<Integer, I
Terator<string>, iterator<string>> () {private static final long serialversionuid = 1L;
Public iterator<string> Call (Integer v1, iterator<string> v2) throws Exception {
list<string> list = new arraylist<string> ();
while (V2.hasnext ()) {List.add ("Partition index:" + v1 + "\ T" + v2.next ());
} return List.iterator ();
}}, True);
Fetch data from the cluster to local memory list<string> result = Mappartitionswithindexrdd.collect ();
Result.foreach (System.out::p rintln);
Sc.close (); }//Result partition index: 0 31 Partition Index: 0 Lee 41 Partition Index: 0 Wang 51 Partition index: 1 32 partition index: 1 Li 42 partition index: 1 King 52 Partition index: 2 sheet 33 partition index: 2 John Doe 3 Partition index: 2 Harry 3 partition index: 2 sheet 34
2.5.3 Scala Sample
Sc.parallelize (Datas, 3)
. Mappartitionswithindex (
(M, n) + = {
val result = arraybuffer[string] ()
while (N.hasnext) {
result.append ("Partition index:" + M + "\ T" + n.next ())
}
result.iterator
}
)
. foreach (println)
2.6 Sample
2.6.1 Overview
Syntax (Scala):
Def sample (
Withreplacement:boolean,
fraction:double,
seed:long = Utils.random.nextLong): rdd[t]
Description
The Rdd is sampled, where the parameter withreplacement is true to indicate that the sample is also put back, can be sampled multiple times, false means not put back, fraction represents the sampling scale, and seed is a random number seed, such as the current timestamp 2.6.2 Java Example
static void sample () {
sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
. Setmaster ("local");
Javasparkcontext sc = new Javasparkcontext (conf);
list<integer> datas = arrays.aslist (1, 2, 3, 7, 4, 5, 8);
javardd<integer> Datardd = sc.parallelize (datas);
javardd<integer> Samplerdd = Datardd.sample (False, 0.5, system.currenttimemillis ());
Samplerdd.foreach (V-System.out.println (v));
Sc.close ();
}
Results
7
4
5
2.6.3 Scala Sample
Sc.parallelize (datas)
. Sample (Withreplacement = False, 0.5, System.currenttimemillis)
. foreach (println)
2.7 Union
2.7.1 Overview
Syntax (Scala):
Def union (Other:rdd[t]): Rdd[t]
Description
Merge two Rdd, do not go heavy, require two RDD element type consistent 2.7.2 Java example
static void Union () {
sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
. Setmaster ("local");
Javasparkcontext sc = new Javasparkcontext (conf);
list<string> DATAS1 = arrays.aslist