Example of using Spark operators

Source: Internet
Author: User
Tags foreach arrays split
1. Operator Classification

From the general direction, the Spark operator can be broadly divided into the following two types of transformation: The operation is deferred calculation, that is, the conversion from one RDD to another rdd is not executed immediately, it is necessary to wait until there is an action action to actually trigger the operation. Action: Triggers the Spark submission job (job) and outputs the data to the spark system.

From a small direction, the Spark operator can be broadly divided into the following three categories: The transformation operator of the value data type. The transfromation operator for the Key-value data type. Action operator 1.1 transformation operator of Value data type

type operator
input partition and output partition one-to-one Map, FlatMap, Mappartitions, Glom
input partition and output partition many-to-one type Union, Cartesian
input partition and output partition Many-to-many types GroupBy
Output partition as input partition subset type Filter, DISTINCT, subtract, sample, takesample
Cache type Cache, persist
1.2 transfromation operators for Key-value data types
type operator
input partition and output partition one-to-one Mapvalues
For a single Rdd Combinebykey, Reducebykey, Partitionby
Two Rdd aggregation Cogroup
Connection Join, Leftoutjoin, Rightoutjoin
1.3 Action operator
type operator
No output Foreach
Hdfs Saveastextfile, Saveasobjectfile
Scala collections and data types Collect, Collectasmap, reducebykeylocally, lookup, count, top, reduce, fold, aggregate
2. Transformation 2.1 Map 2.1.1 Overview

Syntax (Scala):

def Map[u:classtag] (f:t = U): Rdd[u]

Description

Convert each data item of the original RDD to a new element through the user-defined function f-map in map 2.1.2 Java Example

/** * Map operator * <p> * map and foreach operator: * 1.
 Loop map every element of the call element; * 2.
 Executes the call function and returns. * </p> * * private static void map () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSim

    Plename ()). Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> datas = arrays.aslist ("{' id ': 1, ' name ': ' XL1 ', ' pwd ': ' xl123 ', ' Sex ': 2} ', ' {' id ': 2,

    ' Name ': ' Xl2 ', ' pwd ': ' xl123 ', ' Sex ': 1} ', ' {' id ': 3, ' name ': ' Xl3 ', ' pwd ': ' xl123 ', ' Sex ': 2} ';

    javardd<string> Datasrdd = sc.parallelize (datas); javardd<user> Maprdd = Datasrdd.map (New function<string, user> () {public User CA
                    LL (String v) throws Exception {Gson Gson = new Gson ();
                Return Gson.fromjson (V, user.class);

    }
            }); Maprdd.foreach (New voidfunction<user> () {public void call (user user) throws Exception {
            SYSTEM.OUT.PRINTLN ("ID:" + user.id + "Name:" + User.Name + "PWD:
        "+ user.pwd +" Sex: "+ user.sex);

    }
    });
Sc.close (); }//Results id:1 name:xl1 pwd:xl123 sex:2 id:2 name:xl2 pwd:xl123 sex:1 id:3 name:xl3 pwd:xl123 sex:2
2.1.3 Scala Sample
Private def map () {
    val conf = new sparkconf (). Setappname (ScalaOperatorDemo.getClass.getSimpleName). Setmaster (" Local ")
    val sc = new Sparkcontext (conf)

    val datas:array[string] = Array (
        " {' id ': 1, ' name ': ' XL1 ', ' pwd ': ' Xl123 ', ' Sex ': 2} ",
        " {' id ': 2, ' name ': ' Xl2 ', ' pwd ': ' xl123 ', ' Sex ': 1} ', '
        {' id ': 3, ' name ': ' Xl3 ', ' pwd ': ' xl123 ', ' Sex ': 2} ')

    sc.parallelize (datas)
        . Map (v = = {
            new Gson (). Fromjson (V, Classof[user])
        })
        . foreach (user + = {
            println ("ID:" + user.id
                + "Name:" + user.name + "
                pwd:" + user.pwd
                + "Sex:" + user.sex)}
        )
}
2.2 Filter 2.2.1 Overview

Syntax (Scala):

def filter (f:t = Boolean): rdd[t]

Description

The

Filters the elements, applies an f function to each element, returns an element with a value of true in the RDD, and returns false to filter out 2.2.2 Java Example

static void Filter () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())

    . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf);

    list<integer> datas = arrays.aslist (1, 2, 3, 7, 4, 5, 8);
    javardd<integer> rdddata = sc.parallelize (datas); javardd<integer> Filterrdd = rdddata.filter (//jdk1.8//v1, v1 >= 3 NE
                    W Function<integer, boolean> () {public Boolean call (Integer V) throws Exception {
                Return v >= 3;
    }
            }
    ); Filterrdd.foreach (///jdk1.8//V-System.out.println (v) New Voidfunction<inte
                    Ger> () {@Override public void call (Integer integer) throws Exception {
                System.out.println (integer);
    }
            }
    );
Sc.close (); }//Results 3 7 4 5 8
2.2.3 Scala Sample
def filter {
    val conf = new sparkconf (). Setappname (ScalaOperatorDemo.getClass.getSimpleName). Setmaster ("local")
    val sc = new Sparkcontext (conf)

    val datas = Array (1, 2, 3, 7, 4, 5, 8)

    sc.parallelize (datas)
        . Filter (v =& Gt V >= 3)
        . foreach (println)
}
2.3 FlatMap 2.3.1 Brief Introduction

Syntax (Scala):

def Flatmap[u:classtag] (f:t = Traversableonce[u]): Rdd[u]

Description

Similar to map, but each input RDD member can produce 0 or more output members

2.3.2 Java Sample

static void FlatMap () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())

    . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> data = arrays.aslist ("aa,bb,cc", "Cxf,spring,struts2", "java,c++,j
    Avascript ");
    javardd<string> Rdddata = sc.parallelize (data);            javardd<string> flatmapdata = Rdddata.flatmap (V-arrays.aslist (V.split (",")). Iterator ()// New flatmapfunction<string, string> () {//@Override//Public iterator<string& Gt                    Call (String t) throws Exception {//list<string> list= arrays.aslist (T.split (","));//
return List.iterator ();
    //                }
//            }
    );

    Flatmapdata.foreach (V-System.out.println (v));
Sc.close (); }//Results AA bb cc CXF Spring struts2 java C + + JavaScript
2.3.3 Scala Sample
Sc.parallelize (datas)
            . FlatMap (line = Line.split (","))
            . foreach (println)
2.4 mappartitions 2.4.1 Overview

Syntax (Scala):

def Mappartitions[u:classtag] (
      f:iterator[t] = Iterator[u],
      Preservespartitioning:boolean = false): RDD[ U

Description

Similar to map, but the Func function in map is for each element in the RDD, and the Func function in Mappartitions is an entire partition of the RDD. So the type of func is iterator<t> = Iterator<u>, where T is the type of the input RDD element. Preservespartitioning indicates whether to preserve the partitioner of the input function, false by default. 2.4.2 Java Sample

static void Mappartitions () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())

    . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> names = Arrays.aslist ("Zhang 31", "Li 41", "Wang 51", "Zhang 32", "Li 42", "Wang 52", "Zhang 33", "Li 43", "Wang 53", "Zhang 34")

    ;
    javardd<string> Namesrdd = sc.parallelize (names, 3); javardd<string> Mappartitionsrdd = namesrdd.mappartitions (New flatmapfunction<iterator<string>
                , string> () {int count = 0;
                    @Override Public iterator<string> Call (iterator<string> stringiterator) throws Exception {
                    list<string> list = new arraylist<string> ();
                    while (Stringiterator.hasnext ()) {List.add ("Partition index:" + count++ + "\ T" + stringiterator.next ());
          } return List.iterator ();      }
            }
    );
    Fetch data from the cluster to local memory list<string> result = Mappartitionsrdd.collect ();

    Result.foreach (System.out::p rintln);
Sc.close ();  }//Result partition index: 0  31 Partition index: 1 li 41 partition index: 2 King 51 Partition index: 0  32 Partition index: 1 Li 42 partition index: 2 King 52 Partition index: 0 Zhang San 3 partition index: 1 John Doe 3 Partition index: 2 Harry 3 Partition index: 3 sheet 34
2.4.3 Scala Sample
Sc.parallelize (Datas, 3)
        . mappartitions (
            n = {
                val result = arraybuffer[string] () while
                ( N.hasnext) {
                    result.append (N.next ())
                }
                result.iterator
            }
        )
        . foreach (println)
2.5 Mappartitionswithindex 2.5.1 Overview

Syntax (Scala):

def Mappartitionswithindex[u:classtag] (
      F: (Int, iterator[t]) = Iterator[u],
      preservespartitioning: Boolean = False): Rdd[u]

Description

Similar to mappartitions, but the input provides an integer representing the number of the partition, so the type of Func is (int, iterator<t>) + iterator<r> an int 2.5.2 Java Example

private static void Mappartitionswithindex () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getS

    Implename ()). Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> names = Arrays.aslist ("Zhang 31", "Li 41", "Wang 51", "Zhang 32", "Li 42", "Wang 52", "Zhang 33", "Li 43", "Wang 53", "Zhang 34")

    ;
    Initialized, divided into 3 partitions javardd<string> Namesrdd = sc.parallelize (names, 3); javardd<string> Mappartitionswithindexrdd = Namesrdd.mappartitionswithindex (new Function2<Integer, I

                Terator<string>, iterator<string>> () {private static final long serialversionuid = 1L;
                    Public iterator<string> Call (Integer v1, iterator<string> v2) throws Exception {
                    list<string> list = new arraylist<string> ();
                    while (V2.hasnext ()) {List.add ("Partition index:" + v1 + "\ T" + v2.next ());
 }                   return List.iterator ();

    }}, True);
    Fetch data from the cluster to local memory list<string> result = Mappartitionswithindexrdd.collect ();

    Result.foreach (System.out::p rintln);
Sc.close ();  }//Result partition index: 0  31 Partition Index: 0  Lee 41 Partition Index: 0 Wang 51 Partition index: 1 32 partition index: 1 Li 42 partition index: 1 King 52 Partition index: 2 sheet 33 partition index: 2 John Doe 3 Partition index: 2 Harry 3 partition index: 2 sheet 34
2.5.3 Scala Sample
Sc.parallelize (Datas, 3)
        . Mappartitionswithindex (
            (M, n) + = {
                val result = arraybuffer[string] ()
                while (N.hasnext) {
                    result.append ("Partition index:" + M + "\ T" + n.next ())
                }
                result.iterator
            }
        )
        . foreach (println)
2.6 Sample 2.6.1 Overview

Syntax (Scala):

Def sample (
      Withreplacement:boolean,
      fraction:double,
      seed:long = Utils.random.nextLong): rdd[t]

Description

The Rdd is sampled, where the parameter withreplacement is true to indicate that the sample is also put back, can be sampled multiple times, false means not put back, fraction represents the sampling scale, and seed is a random number seed, such as the current timestamp 2.6.2 Java Example

static void sample () {
    sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
            . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf);

    list<integer> datas = arrays.aslist (1, 2, 3, 7, 4, 5, 8);

    javardd<integer> Datardd = sc.parallelize (datas);
    javardd<integer> Samplerdd = Datardd.sample (False, 0.5, system.currenttimemillis ());
    Samplerdd.foreach (V-System.out.println (v));

    Sc.close ();
}

Results
7
4
5
2.6.3 Scala Sample
Sc.parallelize (datas)
        . Sample (Withreplacement = False, 0.5, System.currenttimemillis)
        . foreach (println)
2.7 Union 2.7.1 Overview

Syntax (Scala):

Def union (Other:rdd[t]): Rdd[t]

Description

Merge two Rdd, do not go heavy, require two RDD element type consistent 2.7.2 Java example

static void Union () {
    sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
            . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf);

    list<string> DATAS1 = arrays.aslist

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.