Spark_spark map, Mappartition, FlatMap, Flatmaptopair method Introduction, differences and examples

Source: Internet
Author: User
Tags foreach arrays

Research background:

The blogger has just contacted Spark development, the API is not particularly familiar with the above mentioned 4 kinds of APIs are often unclear usage, so write this article as a reference.

If there is a different opinion, I hope to be enthusiastic message ~ ~ ~


The main test scenario is to imitate the words in the statement to slice ~. (Word segmentation according to the space, frequency statistics of the previous step.)


Maven dependencies:

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactid>spark-core_2.11 </artifactId>
            <version>2.2.0</version>
        </dependency>


Method Introduction

MAP: (not recommended)

The map function specifies the operation for each input, and then returns an object for each input.

Example code:

javardd<string[]> Mapresult = Linesrdd.map (New function<string, string[]> () {

            @Override
            public String[] Call (String s) throws Exception {
                return S.split ("");
            }
        );


Mappartition:

The Mappartition function will operate on a set of data in each partition and eventually return an iterator to the specified object.

We recommend that you use Mappartition instead of the MAP function:

The reasons are as follows:

Advantage 1:

For some initialization operations, the use of the map function may require one call to each data, while using Mappartition can only invoke one initialization operation per partition, and resource usage is more efficient.

Advantage 2:

The mappartition can be very convenient to filter the return results (such as bad data filtering), map is more difficult to achieve.

Mappartition can also accomplish FlatMap similar functions (but the bottom implementation principle may not be the same), see the following article

Example code:

javardd<string[]> Mappartitionsresult = linesrdd.mappartitions (New flatmapfunction<iterator<string> , string[]> () {

            @Override public
            iterator<string[]> Call (iterator<string> stringiterator) Throws Exception {

                list<string[]> Resultarr = new arraylist<> ();
                while (Stringiterator.hasnext ()) {
                    String line = Stringiterator.next ();
                    string[] Tmpresult = Line.split ("");
                    Resultarr.add (Tmpresult);
                }
                return Resultarr.iterator ();
            }
        });


FLATMAP:

The Flatmap function is a set of two operations--it is "flattened after first mapping":

Action 1: The same as the map function: Specify the operation for each input and return an object for each input

Action 2: Finally merge all objects into one object

the main differences between FlatMap and Map:

The main conversion of a Map is a piece of data that returns a single piece of data

FlatMap Converts a piece of data into a set of data (iterators) that are primarily used to convert a record to multiple records, such as slicing the words in each line of article ,

returns all the words in each row.

Example code:

javardd<string> Flatmapresult = Linesrdd.flatmap (New flatmapfunction<string, String> () {

            @Override Public
            iterator<string> Call (String s) throws Exception {
                return arrays.aslist (S.split ("")). Iterator () ;
            }
        });

In fact: According to my understanding, mappartition can also complete FlatMap similar functions; (but the bottom of the implementation may not be the same principle)

Assume that the function is to slice the statement by "" (a space) and return a list of words:

SYSTEM.OUT.PRINTLN ("Mappartition simulate FlatMap operation"); javardd<string> Mappartitionlikeflatmapresult = linesrdd.mappartitions (New Flatmapfunction<itera Tor<string&gt, string> () {@Override public iterator<string> call (Ite Rator<string> stringiterator) throws Exception {list<string> resultlist = new Arrayli
                        St<> ();
                            while (Stringiterator.hasnext ()) {String tmpline = Stringiterator.next ();
                            string[] Tmpwords = Tmpline.split ("");
                            for (String tmpstring:tmpwords) {resultlist.add (tmpstring);
                    }} return Resultlist.iterator ();
        }
                }
        ); Mappartitionlikeflatmapresult.foreach (New VOIDFUNCTION&LT;STRING&Gt; () {@Override public void call (String s) throws Exception {System.out.println
            (s);
        }
        }); System.out.println ("\ n");


Flatmaptopair:

Flatmaptopair actually converts the returned data to 1 tuple, or key-value formatted data, based on the Flatmap function. Convenient for the same key data for subsequent statistics such as statistics and other operations.

Example code:

javapairrdd<string, integer> flatmaptopairresult = Linesrdd.flatmaptopair (

                new pairflatmapfunction< String, String, integer> () {
                    @Override public
                    iterator<tuple2<string, integer>> Call (string s) Throws Exception {

                        list<tuple2<string, integer>> resulttuple = new arraylist<> ();
                        string[] Tmplist = S.split ("");
                        for (String tmpstring:tmplist) {
                            resulttuple.add (new Tuple2<> (tmpstring, 1));
                        }
                        return Resulttuple.iterator ();
                    }
                });


The overall sample code:

Package com.spark.test.batch.job;
Import org.apache.spark.SparkConf;
Import Org.apache.spark.api.java.JavaPairRDD;
Import Org.apache.spark.api.java.JavaRDD;
Import Org.apache.spark.api.java.JavaSparkContext;
Import org.apache.spark.api.java.function.FlatMapFunction;
Import org.apache.spark.api.java.function.Function;
Import org.apache.spark.api.java.function.PairFlatMapFunction;
Import org.apache.spark.api.java.function.VoidFunction; Import Scala.

Tuple2;
Import java.util.ArrayList;
Import Java.util.Arrays;
Import Java.util.Iterator;

Import java.util.List;
 /** * Created by Szh on 2018/5/2.

        * * @author Szh * @date 2018/5/2 */public class Multimapcompare {public static void main (string[] args) {
        sparkconf sparkconf = new sparkconf ();
        Sparkconf.setappname ("Multimapcompare"). Setmaster ("local[2]");
        Javasparkcontext sparkcontext = new Javasparkcontext (sparkconf);

        Sparkcontext.setloglevel ("ERROR"); list<string> lineslist = new ArraYlist<> ();
        Lineslist.add ("You Were a Bad man");
        Lineslist.add ("Just a Test Job");


        javardd<string> Linesrdd = sparkcontext.parallelize (lineslist);
        SYSTEM.OUT.PRINTLN ("Map Result");
            javardd<string[]> Mapresult = Linesrdd.map (New function<string, string[]> () {@Override
            Public string[] Call (String s) throws Exception {return S.split ("");
        }
        }); Mapresult.foreach (New voidfunction<string[]> () {@Override public void call (string[] String
                s) throws Exception {for (String tmp:strings) {SYSTEM.OUT.PRINTLN (TMP);
        }
            }
        });


        System.out.println ("\ n");
        System.out.println ("Mappartitions Result"); javardd<string[]> Mappartitionsresult = linesrdd.mappartitions (New flatmapfunction<iterator<string>

           , string[]> () { @Override Public iterator<string[]> Call (iterator<string> stringiterator) throws Exception {
                list<string[]> Resultarr = new arraylist<> ();
                    while (Stringiterator.hasnext ()) {String line = Stringiterator.next ();
                    string[] Tmpresult = Line.split ("");
                Resultarr.add (Tmpresult);
            } return Resultarr.iterator ();
        }
        }); Mappartitionsresult.foreach (New voidfunction<string[]> () {@Override public void call (String
                [] strings) throws Exception {for (String tmp:strings) {SYSTEM.OUT.PRINTLN (TMP);
        }
            }
        });


        System.out.println ("\ n");
        System.out.println ("FlatMap Result"); javardd<string> Flatmapresult = Linesrdd.flatmap (New flatmapfunction<string, String> () {@OverRide public iterator<string> Call (String s) throws Exception {return arrays.aslist (s.sp
            Lit ("")). iterator ();
        }
        }); Flatmapresult.foreach (New voidfunction<string> () {@Override public void call (String s) thr
            oWS Exception {System.out.println (s);
        }
        });


        System.out.println ("\ n");
        System.out.println ("Flatmaptopair Result"); javapairrdd<string, integer> flatmaptopairresult = Linesrdd.flatmaptopair (New pairflatmapfunction& Lt String, String, integer> () {@Override public iterator<tuple2<string, Int Eger>> Call (String s) throws Exception {list<tuple2<string, integer>> Resulttu
                        ple = new arraylist<> ();
                        string[] Tmplist = S.split ("");
     for (String tmpstring:tmplist) {                       Resulttuple.add (New Tuple2<> (tmpstring, 1));
                    } return Resulttuple.iterator ();
        }
                }); Flatmaptopairresult.foreach (New voidfunction<tuple2<string, integer>> () {@Override PU Blic void Call (tuple2<string, integer> stringIntegerTuple2) throws Exception {System.out.println (st
            RINGINTEGERTUPLE2);
        }
        });

        System.out.println ("\ n");
    Sparkcontext.close (); }

}

========

Operation Result:

Map result You is a bad man
Just
a
test
job



mappartitions Result
Just
a
test job you were a bad man



flatMap Result
Just
A
Test
Job
You
is a bad man



flatmaptopair Result
(just,1)
(A, 1) (
test,1)
(job,1)
(you,1) (
are,1)
(a,1) (bad,1) (man,1)






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.