Various action operator operations in Spark (Java edition)

Source: Internet
Author: User

In my opinion, the action operator in spark programming acts like a trigger to trigger the previous transformation operator. The transformation operation has lazy loading, and you do not load it immediately after you have defined it, and all of the preceding transformation operators are executed only when an action operator executes. The usual action operators are listed in the following code: (Java Edition)
Package Cn.spark.study.core;

Import Java.util.Arrays;
Import java.util.List;
Import Java.util.Map;

Import org.apache.spark.SparkConf;
Import Org.apache.spark.api.java.JavaPairRDD;
Import Org.apache.spark.api.java.JavaRDD;
Import Org.apache.spark.api.java.JavaSparkContext;
Import org.apache.spark.api.java.function.Function;
Import Org.apache.spark.api.java.function.Function2;

Import Scala. Tuple2;

/**
* Action Operation Combat
* @author DD
*
*/
public class Actionoperation {
public static void Main (string[] args) {
Reducetest ();
Collecttest ();
Counttest ();
Taketest ();
Countbykeytest ();
}

/** * Reduce operator * Case: additive and */private static void Reducetest () {sparkconf conf = new sparkconf (). Setap    PName ("reduce"). Setmaster ("local");    Javasparkcontext sc = new Javasparkcontext (conf);    list<integer> numberlist = arrays.aslist (1,2,3,4,5,6,7,8,9,10);    javardd<integer> Numbersrdd = sc.parallelize (numberlist);         Use the reduce operation to accumulate numbers in the collection int sum = numbersrdd.reduce (new Function2<integer, Integer, integer> () {@Override        Public integer Call (integer arg0, integer arg1) throws Exception {return arg0+arg1;    }    });    SYSTEM.OUT.PRINTLN (sum); Sc.close ();} /** * Collect operator * The data on the cluster can be pulled to a local traversal (deprecated) */private static void Collecttest () {sparkconf conf = new sparkconf (). s    Etappname ("collect"). Setmaster ("local");    Javasparkcontext sc = new Javasparkcontext (conf);    list<integer> numberlist = arrays.aslist (1,2,3,4,5,6,7,8,9,10); javardd<integer> Numbersrdd = Sc.parallElize (numberlist); javardd<integer> doublenumbers = Numbersrdd.map (New Function<integer, integer> () {@Override Pub        Lic integer call (integer arg0) throws Exception {//TODO auto-generated method stub return arg0*2;    }    });    The action action of foreach is to traverse the elements in the RDD on a remote cluster, and the collect action is to pull the RDD//data on the distributed cluster locally, which is generally not recommended, because if the amount of data in the RDD is large, such as more than 10,000, then performance will Poor, because to go from remote to a large number of network transmission, to get the data locally, sometimes there may be an oom exception, memory overflow//So it is recommended to use the foreach operation to process the final Rdd list<integer> doublenumlist    = Doublenumbers.collect ();    for (Integer num:doublenumlist) {System.out.println (num); } sc.close ();} /** * Count operator * Can count the number of elements in the Rdd */private static void Counttest () {sparkconf conf = new sparkconf (). Setappname ("Count    "). Setmaster (" local ");    Javasparkcontext sc = new Javasparkcontext (conf);    list<integer> numberlist = arrays.aslist (1,2,3,4,5,6,7,8,9,10);    javardd<integer> Numbersrdd = sc.parallelize (numberlist); Use the count operation for RddCount the number of elements in the rdd long = Numbersrdd.count ();    System.out.println (count); Sc.close ();} /** * Take operator * pulls the first n data of the remote RDD to local */private static void Taketest () {sparkconf conf = new sparkconf (). Setappname ("Tak    E "). Setmaster (" local ");    Javasparkcontext sc = new Javasparkcontext (conf);    list<integer> numberlist = arrays.aslist (1,2,3,4,5,6,7,8,9,10);    javardd<integer> Numbersrdd = sc.parallelize (numberlist); The take operation is similar to the collect operation, and it also obtains the RDD data from the remote cluster, but the collect operation obtains the RDD//all data, taking only the first n data list<integer> top3number = Nu    Mbersrdd.take (3);    for (Integer num:top3number) {System.out.println (num); } sc.close ();} /** * saveastextfile operator * */private static void Saveastextfiletest () {sparkconf conf = new sparkconf (). Setappname ("    Saveastextfile ");    Javasparkcontext sc = new Javasparkcontext (conf);    list<integer> numberlist = arrays.aslist (1,2,3,4,5,6,7,8,9,10);    javardd<integer> Numbersrdd = sc.parallelize (numberlist); Javardd<integer> doublenumbers = Numbersrdd.map (New Function<integer, integer> () {@Override Publ        IC integer call (integer arg0) throws Exception {//TODO auto-generated method stub return arg0*2;    }    });    The saveastextfile operator can store the data in the RDD directly in HDFs//, but we can only specify the saved folder, which is the directory, in fact, it will be saved as the///double_number.txt/part-00000 file in the directory.    Doublenumbers.saveastextfile ("Hdfs://spark1:9000/double_number.txt"); Sc.close ();} /** * Countbykey operator */private static void Countbykeytest () {sparkconf conf = new sparkconf (). Setappname ("Take").    Setmaster ("local");    Javasparkcontext sc = new Javasparkcontext (conf); list<tuple2<string, string>> studentslist = arrays.aslist (New tuple2<string, String> ("Class1 "," Leo "), New tuple2<string, string> (" Class2 "," Jack "), New tuple2<string, string> (" Class 1 "," Marry "), new tuple2<string, string> (" Class2 "," Tom "), NEW tuple2<string, String> ("Class2", "David"));    javapairrdd<string, string> Studentsrdd = Sc.parallelizepairs (studentslist); The Countbykey operator can count the number of each key corresponding element//countbykey the returned type is directly map<string,object> map<string, object>    Studentscounts = Studentsrdd.countbykey (); For (map.entry<string, object> studentsCount:studentsCounts.entrySet ()) {System.out.println (studentscount.ge    TKey () + ":" +studentscount.getvalue ()); } sc.close ();}

}

Various action operator operations in Spark (Java edition)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.