Topic:
A file that is approximately 100G in size. Each line of the file is a number that requires sorting all the numbers in the file.
For this topic, the students who have learned about Hadoop can laugh without words. Even using spark is a very simple thing to accomplish.
Let's start with Hadoop. In fact, there's nothing to say: The map task reads the numbers row by line and then outputs it in reduce, which is simply outrageous.
Look at the code, OK:
PackageCom.zhyea.dev;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;Importjava.io.IOException; Public classNumbersort { Public Static classSplittermapperextendsMapper<object, Text, intwritable, intwritable> { Private Static FinalIntwritable intwritable =Newintwritable (); @Override Public voidmap (Object key, Text value, context context) {Try { intnum =integer.valueof (value.tostring ()); Intwritable.set (num); Context.write (intwritable, intwritable); } Catch(Exception e) {e.printstacktrace (); } } } Public Static classIntegratereducerextendsReducer<intwritable, Intwritable, intwritable, intwritable>{@Override Public voidReduce (intwritable key, iterable<intwritable>values, Context context) { Try{context.write (key, key); } Catch(Exception e) {e.printstacktrace (); } } } Public Static voidMain (string[] args)throwsIOException, ClassNotFoundException, interruptedexception {Configuration conf=NewConfiguration (); Job Job= Job.getinstance (conf, "Number-sort"); Job.setjarbyclass (Numbersort.class); Job.setmapperclass (splittermapper.class); Job.setreducerclass (integratereducer.class); Job.setoutputkeyclass (intwritable.class); Job.setoutputvalueclass (intwritable.class); Fileinputformat.addinputpath (Job,NewPath (args[0])); Fileoutputformat.setoutputpath (Job,NewPath (args[1])); System.exit (Job.waitforcompletion (true) ? 0:1); }}
In the map method, the value portion of the output values I selected a value of intwritable. The types of value values can also be set to nullwritable, but this makes the map task slow to execute, although the reduce task executes faster, but ultimately it is not worth the candle.
There is no sort of action in our program, but the result of the output is orderly, because in the shuffle phase the sorting has been completed (one quick sort, one merge sort).
Take a look at how Spark is done:
object Numsortjob { = { = args (0) = args (1) New Sparkconf (). Setappname ("Num Sort") new sparkcontext (conf) = Sc.hadoopfile[longwritable, Text, Textinputformat] (inputpath) true). Saveastextfile ( OutputPath) }}
Spark needs to be actively sequenced. Even if you choose to use Sortbasedshuffle, its sorting only ends at the mapper end of the sort, and the result set is not necessarily ordered.
#########
12-way Mr Exercises –1– sorting