International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Example of using Spark operators

Last Update:2018-07-26 Source: Internet

Author: User

Tags foreach arrays split

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Operator Classification

From the general direction, the Spark operator can be broadly divided into the following two types of transformation: The operation is deferred calculation, that is, the conversion from one RDD to another rdd is not executed immediately, it is necessary to wait until there is an action action to actually trigger the operation. Action: Triggers the Spark submission job (job) and outputs the data to the spark system.

From a small direction, the Spark operator can be broadly divided into the following three categories: The transformation operator of the value data type. The transfromation operator for the Key-value data type. Action operator 1.1 transformation operator of Value data type

type	operator
input partition and output partition one-to-one	Map, FlatMap, Mappartitions, Glom
input partition and output partition many-to-one type	Union, Cartesian
input partition and output partition Many-to-many types	GroupBy
Output partition as input partition subset type	Filter, DISTINCT, subtract, sample, takesample
Cache type	Cache, persist

1.2 transfromation operators for Key-value data types

type	operator
input partition and output partition one-to-one	Mapvalues
For a single Rdd	Combinebykey, Reducebykey, Partitionby
Two Rdd aggregation	Cogroup
Connection	Join, Leftoutjoin, Rightoutjoin

1.3 Action operator

type	operator
No output	Foreach
Hdfs	Saveastextfile, Saveasobjectfile
Scala collections and data types	Collect, Collectasmap, reducebykeylocally, lookup, count, top, reduce, fold, aggregate

2. Transformation 2.1 Map 2.1.1 Overview

Syntax (Scala):

def Map[u:classtag] (f:t = U): Rdd[u]

Description

Convert each data item of the original RDD to a new element through the user-defined function f-map in map 2.1.2 Java Example

/** * Map operator * <p> * map and foreach operator: * 1.
 Loop map every element of the call element; * 2.
 Executes the call function and returns. * </p> * * private static void map () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSim

    Plename ()). Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> datas = arrays.aslist ("{' id ': 1, ' name ': ' XL1 ', ' pwd ': ' xl123 ', ' Sex ': 2} ', ' {' id ': 2,

    ' Name ': ' Xl2 ', ' pwd ': ' xl123 ', ' Sex ': 1} ', ' {' id ': 3, ' name ': ' Xl3 ', ' pwd ': ' xl123 ', ' Sex ': 2} ';

    javardd<string> Datasrdd = sc.parallelize (datas); javardd<user> Maprdd = Datasrdd.map (New function<string, user> () {public User CA
                    LL (String v) throws Exception {Gson Gson = new Gson ();
                Return Gson.fromjson (V, user.class);

    }
            }); Maprdd.foreach (New voidfunction<user> () {public void call (user user) throws Exception {
            SYSTEM.OUT.PRINTLN ("ID:" + user.id + "Name:" + User.Name + "PWD:
        "+ user.pwd +" Sex: "+ user.sex);

    }
    });
Sc.close (); }//Results id:1 name:xl1 pwd:xl123 sex:2 id:2 name:xl2 pwd:xl123 sex:1 id:3 name:xl3 pwd:xl123 sex:2

2.1.3 Scala Sample

Private def map () {
    val conf = new sparkconf (). Setappname (ScalaOperatorDemo.getClass.getSimpleName). Setmaster (" Local ")
    val sc = new Sparkcontext (conf)

    val datas:array[string] = Array (
        " {' id ': 1, ' name ': ' XL1 ', ' pwd ': ' Xl123 ', ' Sex ': 2} ",
        " {' id ': 2, ' name ': ' Xl2 ', ' pwd ': ' xl123 ', ' Sex ': 1} ', '
        {' id ': 3, ' name ': ' Xl3 ', ' pwd ': ' xl123 ', ' Sex ': 2} ')

    sc.parallelize (datas)
        . Map (v = = {
            new Gson (). Fromjson (V, Classof[user])
        })
        . foreach (user + = {
            println ("ID:" + user.id
                + "Name:" + user.name + "
                pwd:" + user.pwd
                + "Sex:" + user.sex)}
        )
}

2.2 Filter 2.2.1 Overview

Syntax (Scala):

def filter (f:t = Boolean): rdd[t]

Description

The

Filters the elements, applies an f function to each element, returns an element with a value of true in the RDD, and returns false to filter out 2.2.2 Java Example

static void Filter () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())

    . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf);

    list<integer> datas = arrays.aslist (1, 2, 3, 7, 4, 5, 8);
    javardd<integer> rdddata = sc.parallelize (datas); javardd<integer> Filterrdd = rdddata.filter (//jdk1.8//v1, v1 >= 3 NE
                    W Function<integer, boolean> () {public Boolean call (Integer V) throws Exception {
                Return v >= 3;
    }
            }
    ); Filterrdd.foreach (///jdk1.8//V-System.out.println (v) New Voidfunction<inte
                    Ger> () {@Override public void call (Integer integer) throws Exception {
                System.out.println (integer);
    }
            }
    );
Sc.close (); }//Results 3 7 4 5 8

2.2.3 Scala Sample

def filter {
    val conf = new sparkconf (). Setappname (ScalaOperatorDemo.getClass.getSimpleName). Setmaster ("local")
    val sc = new Sparkcontext (conf)

    val datas = Array (1, 2, 3, 7, 4, 5, 8)

    sc.parallelize (datas)
        . Filter (v =& Gt V >= 3)
        . foreach (println)
}

2.3 FlatMap 2.3.1 Brief Introduction

Syntax (Scala):

def Flatmap[u:classtag] (f:t = Traversableonce[u]): Rdd[u]

Description

Similar to map, but each input RDD member can produce 0 or more output members

2.3.2 Java Sample

static void FlatMap () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())

    . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> data = arrays.aslist ("aa,bb,cc", "Cxf,spring,struts2", "java,c++,j
    Avascript ");
    javardd<string> Rdddata = sc.parallelize (data);            javardd<string> flatmapdata = Rdddata.flatmap (V-arrays.aslist (V.split (",")). Iterator ()// New flatmapfunction<string, string> () {//@Override//Public iterator<string& Gt                    Call (String t) throws Exception {//list<string> list= arrays.aslist (T.split (","));//
return List.iterator ();
    //                }
//            }
    );

    Flatmapdata.foreach (V-System.out.println (v));
Sc.close (); }//Results AA bb cc CXF Spring struts2 java C + + JavaScript

2.3.3 Scala Sample

Sc.parallelize (datas)
            . FlatMap (line = Line.split (","))
            . foreach (println)

2.4 mappartitions 2.4.1 Overview

Syntax (Scala):

def Mappartitions[u:classtag] (
      f:iterator[t] = Iterator[u],
      Preservespartitioning:boolean = false): RDD[ U

Description

Similar to map, but the Func function in map is for each element in the RDD, and the Func function in Mappartitions is an entire partition of the RDD. So the type of func is iterator<t> = Iterator<u>, where T is the type of the input RDD element. Preservespartitioning indicates whether to preserve the partitioner of the input function, false by default. 2.4.2 Java Sample

static void Mappartitions () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())

    . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> names = Arrays.aslist ("Zhang 31", "Li 41", "Wang 51", "Zhang 32", "Li 42", "Wang 52", "Zhang 33", "Li 43", "Wang 53", "Zhang 34")

    ;
    javardd<string> Namesrdd = sc.parallelize (names, 3); javardd<string> Mappartitionsrdd = namesrdd.mappartitions (New flatmapfunction<iterator<string>
                , string> () {int count = 0;
                    @Override Public iterator<string> Call (iterator<string> stringiterator) throws Exception {
                    list<string> list = new arraylist<string> ();
                    while (Stringiterator.hasnext ()) {List.add ("Partition index:" + count++ + "\ T" + stringiterator.next ());
          } return List.iterator ();      }
            }
    );
    Fetch data from the cluster to local memory list<string> result = Mappartitionsrdd.collect ();

    Result.foreach (System.out::p rintln);
Sc.close ();  }//Result partition index: 0  31 Partition index: 1 li 41 partition index: 2 King 51 Partition index: 0  32 Partition index: 1 Li 42 partition index: 2 King 52 Partition index: 0 Zhang San 3 partition index: 1 John Doe 3 Partition index: 2 Harry 3 Partition index: 3 sheet 34

2.4.3 Scala Sample

Sc.parallelize (Datas, 3)
        . mappartitions (
            n = {
                val result = arraybuffer[string] () while
                ( N.hasnext) {
                    result.append (N.next ())
                }
                result.iterator
            }
        )
        . foreach (println)

2.5 Mappartitionswithindex 2.5.1 Overview

Syntax (Scala):

def Mappartitionswithindex[u:classtag] (
      F: (Int, iterator[t]) = Iterator[u],
      preservespartitioning: Boolean = False): Rdd[u]

Description

Similar to mappartitions, but the input provides an integer representing the number of the partition, so the type of Func is (int, iterator<t>) + iterator<r> an int 2.5.2 Java Example

private static void Mappartitionswithindex () {sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getS

    Implename ()). Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf); list<string> names = Arrays.aslist ("Zhang 31", "Li 41", "Wang 51", "Zhang 32", "Li 42", "Wang 52", "Zhang 33", "Li 43", "Wang 53", "Zhang 34")

    ;
    Initialized, divided into 3 partitions javardd<string> Namesrdd = sc.parallelize (names, 3); javardd<string> Mappartitionswithindexrdd = Namesrdd.mappartitionswithindex (new Function2<Integer, I

                Terator<string>, iterator<string>> () {private static final long serialversionuid = 1L;
                    Public iterator<string> Call (Integer v1, iterator<string> v2) throws Exception {
                    list<string> list = new arraylist<string> ();
                    while (V2.hasnext ()) {List.add ("Partition index:" + v1 + "\ T" + v2.next ());
 }                   return List.iterator ();

    }}, True);
    Fetch data from the cluster to local memory list<string> result = Mappartitionswithindexrdd.collect ();

    Result.foreach (System.out::p rintln);
Sc.close ();  }//Result partition index: 0  31 Partition Index: 0  Lee 41 Partition Index: 0 Wang 51 Partition index: 1 32 partition index: 1 Li 42 partition index: 1 King 52 Partition index: 2 sheet 33 partition index: 2 John Doe 3 Partition index: 2 Harry 3 partition index: 2 sheet 34

2.5.3 Scala Sample

Sc.parallelize (Datas, 3)
        . Mappartitionswithindex (
            (M, n) + = {
                val result = arraybuffer[string] ()
                while (N.hasnext) {
                    result.append ("Partition index:" + M + "\ T" + n.next ())
                }
                result.iterator
            }
        )
        . foreach (println)

2.6 Sample 2.6.1 Overview

Syntax (Scala):

Def sample (
      Withreplacement:boolean,
      fraction:double,
      seed:long = Utils.random.nextLong): rdd[t]

Description

The Rdd is sampled, where the parameter withreplacement is true to indicate that the sample is also put back, can be sampled multiple times, false means not put back, fraction represents the sampling scale, and seed is a random number seed, such as the current timestamp 2.6.2 Java Example

static void sample () {
    sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
            . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf);

    list<integer> datas = arrays.aslist (1, 2, 3, 7, 4, 5, 8);

    javardd<integer> Datardd = sc.parallelize (datas);
    javardd<integer> Samplerdd = Datardd.sample (False, 0.5, system.currenttimemillis ());
    Samplerdd.foreach (V-System.out.println (v));

    Sc.close ();
}

Results
7
4
5

2.6.3 Scala Sample

Sc.parallelize (datas)
        . Sample (Withreplacement = False, 0.5, System.currenttimemillis)
        . foreach (println)

2.7 Union 2.7.1 Overview

Syntax (Scala):

Def union (Other:rdd[t]): Rdd[t]

Description

Merge two Rdd, do not go heavy, require two RDD element type consistent 2.7.2 Java example

static void Union () {
    sparkconf conf = new sparkconf (). Setappname (JavaOperatorDemo.class.getSimpleName ())
            . Setmaster ("local");

    Javasparkcontext sc = new Javasparkcontext (conf);

    list<string> DATAS1 = arrays.aslist

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

perl logical operators example python logical operators example javascript logical operators example php logical operators example spark mllib example hierarchy of arithmetic operators precedence of relational operators

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Example of using Spark operators

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support