spark2.x deep into the end series six of the RDD Java API detailed one

Source: Internet
Author: User
Tags random seed time 0

The following is an elaboration of the Java API for three ways to create the RDD, the single-type RDD basic Transformation API, the sampling API, and the pipe operation.


Three ways to create an RDD

    1. Create an RDD from a stable file storage system, such as local filesystem or HDFS, as follows:

Create javardd<string> Textfilerdd = Sc.textfile ("Hdfs://master:9999/users/hadoop-twq/word.txt") from the HDFs file;// From the files in the local file system, note that file: The following must be at least three///, four rows, not two///If you specify the second parameter, the minimum number of partitions to create the RDD, if the number of chunks is greater than the specified partition// Number of words then the number of pieces of the file is quasi javardd<string> Textfilerdd = Sc.textfile ("2);

2. You can go through the transformation API to create a new rdd from an already existing RDD, here is the map conversion API

javardd<string> Maprdd = Textfilerdd.map (New function<string, string> () {@Override public String call (S    Tring s) throws Exception {return s + "test"; }}); System.out.println ("Maprdd =" + Maprdd.collect ());

3. Create an RDD from an in-memory list data, you can specify the number of partitions for the RDD, and if not specified, take all cores of all executor

Create a single-type javarddjavardd<integer> Integerjavardd = Sc.parallelize (Arrays.aslist (1, 2, 3, 3, 4), 2); System.out.println ("Integerjavardd =" + Integerjavardd.glom (). Collect ());//Create a single type and type double Javarddjavadoublerdd Doublejavadoublerdd = Sc.parallelizedoubles (arrays.aslist (2.0, 3.3, 5.6)); System.out.println ("Doublejavadoublerdd =" + Doublejavadoublerdd.collect ());//Create a Key-value type of Rddimport Scala. Tuple2; javapairrdd<string, integer> Javapairrdd = Sc.parallelizepairs (arrays.aslist (New Tuple2 ("Test", 3), New Tuple2 (" KKK ", 3))); System.out.println ("Javapairrdd =" + Javapairrdd.collect ());


Note: In the third case, Scala also provides the Makerdd API, which specifies the machine where each partition of the RDD is created, as the principle of this API is described in the Spark core Rdd Scala API


Two, single type RDD basic Transformation API

Create an RDD based on the in-memory data first

javardd<integer> Integerjavardd = Sc.parallelize (Arrays.aslist (1, 2, 3, 3), 2);
    1. The map operation, which represents the application of our custom function interface to each element of the Integerjavardd, adds 1 to each element as follows:

javardd<integer> Maprdd = Integerjavardd.map (New Function<integer, integer> () {@Override public Integer    Call (Integer Element) throws Exception {return element + 1; }});//results: [2, 3, 4, 4]system.out.println ("Maprdd =" + Maprdd.collect ());

It is important to note that the map operation can return data with different types of rdd, such as the following, which returns a custom user object:

public class user implements serializable {    private  String userid;    private integer amount;    public  user (String userid, integer amount)  {         this.userId = userId;        this.amount =  amount;    }    //getter setter....    @ Override    public string tostring ()  {         return  "user{"  +                  "userid="  + userId +  "\"  +                  ",  amount="  + amount +                  '} ';    }}javardd<user>  userjavardd = integerjavardd.map (new function<integer, user> ()  {      @Override     public user call (integer element)   throws exception {        if  (element < 3)  {            return new user ("Less than 3" ,  22);        } else {             return new user ("Greater than 3",  23);         }    });//Result: [user{userid= ' less than 3 ',  amount=22}, user{ Userid= ' less than 3 ',  amount=22}, user{userid= ' greater than 3 ',  amount=23}, user{userid= ' greater than 3 ',  amount=23 }]system.out.println ("Userjavardd =  " + userjavardd.collect ()); 

2.  flatmap operation, apply our custom flatmapfunction to each element of Integerjavardd, the output of this function is a list of data, Flatmap will flatten the data list of these outputs

Javardd<integer> flatmapjavardd = integerjavardd.flatmap (new FlatMapFunction< Integer, integer> ()  {     @Override     public  Iterator<integer> call (integer element)  throws Exception {         //output A list, the elements in this list are 0 to element         List<integer> list = new arraylist<> ();         int i = 0;        while  (i <=  element)  {            list.add (i);             i++;         }        return list.iterator ();     });//Result: &NBSP;[0,&NBSP;1,&NBSp;0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3]system.out.println (" flatmapjavardd =  " + flatmapjavardd.collect ());

3. Filter operation, apply our custom filter function to each element of Integerjavardd, filter out the elements we do not need, like below, filter out elements not equal to 1:

javardd<integer> Filterjavardd = integerjavardd.filter (New Function<integer, Boolean> () {@Override Publ    IC Boolean call (Integer integer) throws Exception {return integer! = 1; }});//The result is: [2, 3, 3]system.out.println ("Filterjavardd =" + Filterjavardd.collect ());

4. Glom operation, view the element data corresponding to each partition of Integerjavardd

javardd<list<integer>> Glomrdd = Integerjavardd.glom ();//Results: [[1, 2], [3, 3]], indicating that Integerjavardd has two partitions, The first partition has data 1 and 2, and the second partition has data 3 and 3system.out.println ("Glomrdd =" + Glomrdd.collect ());


5. Mappartitions operation, apply our Custom Function interface method to the data of each partition of Integerjavardd, suppose we need to add an initial value to each element, and the acquisition of this initial value is very time consuming, This time with Mappartitions will have a very big advantage, as follows:

This is an initial value acquisition method and is a more time-consuming method Public static integer getinitnumber (String source)  {     system.out.println ("get init number from "  + source +   ",  may be take much time ...");    try {         timeunit.seconds.sleep (2);    } catch  (interruptedexception e)  {        e.printstacktrace ();     }    return 1;} Javardd<integer> mappartitiontestrdd = integerjavardd.mappartitions (new  Flatmapfunction<iterator<integer>, integer> ()  {     @Override      public iterator<integer> call (Iterator<integer> integeriterator)  throws exception {        //each oneThe partition gets the initial value, and the Integerjavardd has two partitions, then the Getinitnumber method is called two times         // Therefore, the corresponding time-consuming operations that need to be initialized, such as initializing database connections, are generally used mappartitions to initialize each partition, rather than using map operations          integer initnumber = getinitnumber ("Mappartitions");         List<Integer> list = new ArrayList<> ();         while  (Integeriterator.hasnext ())  {             list.add (Integeriterator.next ()  + initnumber);         }        return list.iterator ();     });//The result is:  [2, 3, 4, 4]system.out.println ("Mappartitiontestrdd  =  " + mappartitiontestrdd.collect ()); Javardd<integer> mapinitnumberrdd = integerjavardd.map (new&nbsp Function<integer, integer> ()  {     @Override     public  integer call (Integer integer)  throws Exception {         //traversal of each element will go to get the initial value, the Integerjavardd contains 4 elements, then this getinitnumber method will be called 4 times, seriously affect the time, Better mappartitions Performance         Integer initNumber =  Getinitnumber ("map");        return integer +  initnumber;    });//The result is: [2, 3, 4, 4]system.out.println ("MapInitNumberRDD  =  " + mapinitnumberrdd.collect ());

6.  mappartitionswithindex operation, apply our Custom Function interface method to the data of each partition of Integerjavardd, and take the partition information when the function interface method is applied. You know that you are currently working on the data for the first partition

Javardd<integer> mappartitionwithindex = integerjavardd.mappartitionswithindex (new  Function2<integer, iterator<integer>, iterator<integer>> ()  {      @Override     public iterator<integer> call (integer  Partitionid, iterator<integer> integeriterator)  throws Exception {         //partitionid indicates the number of partitions currently being processed          system.out.println ("partition id = "  + partitionid);         List<Integer> list = new ArrayList<> ();         while  (Integeriterator.hasnext ())  {             list.add (Integeriterator.next ()  + partitionId);         }         return list.iterator ();    }},  FALSE);//Result  [1, 2, 4, 4]system.out.println ("mappartitionwithindex = "  +  mappartitionwithindex.collect ());

Third, sampling API

Create an RDD based on the in-memory data first

javardd<integer> Listrdd = Sc.parallelize (Arrays.aslist (1, 2, 3, 3), 2);
    1. Sample

The first parameter is withreplacement//if the withreplacement=true means that there is a sample of the put back, using the Poisson sampling algorithm to achieve//if the withreplacement=false means that there is no back-up sampling, The use of the Bernoulli sampling algorithm///The second parameter is: fraction, the probability that each element is extracted as a sample is not a factor that indicates the amount of data to be extracted//such as sampling from 100 data, fraction=0.2, does not mean to extract 100 * 0.2 = 20 data,///= 100 elements are extracted as a sample probability of 0.2, the size of the sample is not fixed, but subject to two distribution//when Withreplacement=true fraction>=0//when withreplacement =false Time 0 < fraction < 1//The third parameter is: Reed represents the seed that generates a random number, that is, a random seed is generated based on the reed for each partition of the Rdd javardd<integer> Samplerdd = Listrdd.sample (False, 0.5, 100);//results: [1, 3]system.out.println ("Samplerdd =" + Samplerdd.collect ());

2. Randomsplit

According to the weight of the rdd random sampling segmentation, there are several weights are divided into several rdd//random sampling using the Bernoulli sampling algorithm to achieve, the following is a two weight, will be cut into two rddjavardd<integer>[] Splitrdds = Listrdd.randomsplit (New double[]{0.4, 0.6});//result is 2system.out.println ("splitrdds.length =" + splitrdds.length);//result is [2, 3] The result is uncertain System.out.println ("splitrdd (0) =" + Splitrdds[0].collect ());//results for [1, 3] The result is uncertain System.out.println (" Splitrdd (1) = "+ Splitrdds[1].collect ());

3. Takesample

Random sampling of the specified number of sample data//The first parameter is withreplacement//if the withreplacement=true means that there is a sample of the put back, using the Poisson sampling algorithm to achieve//if the withreplacement= False if the sample is not put back, using the Bernoulli sampling algorithm to implement the//second parameter specified how much, then the number of samples returned is [2, 3]system.out.println (Listrdd.takesample (False, 2));

4. Stratified sampling, sampling the key-value type of RDD

//create key value of type Rddimport scala. Tuple2; javapairrdd<string, integer> javapairrdd =         sc.parallelizepairs (Arrays.aslist (New tuple2 ("Test",  3),                 new tuple2 ("KKK",  3), new  Tuple2 ("KKK",  3));//define a sampling factor for each key map<string, double> fractions = new  Hashmap<> (); Fractions.put ("Test",  0.5), Fractions.put ("KKK",  0.4);//sampling each key//result is  [( test,3),  (kkk,3)]//samplebykey  does not filter the full amount of data, so only the approximate value System.out.println (Javapairrdd.samplebykey (True,  fractions). Collect ());//The result is  [(test,3),  (kkk,3)]//samplebykeyextra  will sample the full amount of data, thus consuming a lot of computational resources , but the results will be more accurate. System.out.println (Javapairrdd.samplebykeyexact (true, fractions). Collect ()); 

The principle of sampling can be found in detail: Spark core RDD API. These principles are not very well expressed in words.

Pipe, which represents a step in the Rdd execution stream to execute other scripts, such as Python or shell scripts

javardd<string> Datardd = sc.parallelize (arrays.aslist ("Hi", "Hello", "how", "is", "You"), 2);// Start echo.py Required environment variable map<string, string> env = new hashmap<> (), Env.put ("env", "envtest"); list<string> commands = new arraylist<> () commands.add ("Python");//If it is in a real spark cluster, Then ask echo.py to have commands.add under the same directory for each machine in the cluster ("/users/tangweiqun/spark/source/spark-course/spark-rdd-java/src/ Main/resources/echo.py "); javardd<string> result = datardd.pipe (commands, env);//The result is: [Slave1-hi-envtest, Slave1-hello-envtest, Slave1-how-envtest, Slave1-are-envtest, Slave1-you-envtest]system.out.println (Result.collect ());

The contents of echo.py are as follows:

Import sysimport os#input = "Test" input = Sys.stdinenv_keys = Os.environ.keys () env = "" If "env" in env_keys:env = Os.en viron["env"]for ele in input:output = "slave1-" + ele.strip (' \ n ') + "-" + env print (output) input.close

For the principle of pipe and how to do it, refer to the Spark core RDD API, which also clearly describes how to eliminate the manual copy of the script to each machine work

spark2.x deep into the end series six of the RDD Java API detailed one

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.