Background:
In the big data field, for various reasons. sometimes you need to generate a test dataset by yourself. Because the test dataset is large, you can use map/reduce to generate a test dataset. in this section (mumuxinfei), combined with some of its own practical experience, details on how to write the MAP/reduce program that generates the test dataset?
Scenario structure:
Assume that a specific service in the mobile telecom industry records the call information (including the caller/receiver/call time point/base station ). real user data cannot be provided for testing, but basic data specifications are provided. the specific business scenario is as follows:
Num1 varchar (13) -- mobile phone number (130 XXXX ~ 139 XXXX) num2 varchar (13) -- mobile phone number (130 xxxx ~ 139 XXXX) Lac varchar (16) -- Base Station Information timestamp varchar (128) -- yyyymmdd hh: mm: SS format
Comments:The distribution of data on the time latitude is relatively easy to fabricate. In other dimensions, it is still difficult to simulate real user behavior data.
Theoretical Basis of MAP/reduce:
1 ).Principle architecture of MAP/reduce
Note: The operation and process of MAP/reduce are basically (from the network). Here we will not elaborate on the principle in detail.
2 ).Class architecture of MAP/reduce
For details, see the following basic articles on the class system architecture of MAP/reduce.
Solution Analysis:
After reviewing the basic architecture of MAP/reduce, we provide the following two solutions for data generation.
1 ).Traditional data generation scheme of MAP/reduce
2 ).Only data generation solutions with MAP/No reduce are available.
What is the difference between the two? How to Control and set in job?
1 ). the output result of the map stage passes through sort/shuffle to reduce, so the data after the reduce stage has a certain sequence. data that stops at the map stage is random. smart do you guess? Bingo: If the generated data requires a certain sort combination, the traditional scheme is required. If the generated data is random, it is better to adopt the 2 scheme.
2) configure the job. You only need to configure numreducetasks.
job.setNumReduceTasks(0);
Comments: Isn't it very easy? Sorry to surprise you... ^_^!
Based on the actual case analysis, our test data is randomly distributed, so we choose solution 2.
Solution:
The selected scheme is roughly as follows:
Use the map stage to generate test data and customize the inputformat rule.
Our goal is to run the mapreduce program to generate a data file in CSV format. The content is organized as follows:
#num1,num2,lac,timestamp1380001234,13800005678,1,2014-08-27 10:30:001380002058,13800005678,1,2014-08-28 11:30:00
1 ).Custom inputforamt and inputsplit and recordreader
The class definition of myinputsplit is as follows:
// *) Inherit from inputsplit. Implement the writable interface public static class myinputsplit extends inputsplit implements writable {private int number; // a non-argument constructor public myinputsplit () is required () {} public myinputsplit (INT number) {This. number = Number ;}@ overridepublic long getlength () throws ioexception, interruptedexception {return 0 ;}@ overridepublic string [] getlocations () throws ioexception, interruptedexception {return New String [] {};} public int getnumber () {return number ;}// *) deserialization of Public void readfields (datainput in) throws ioexception {number = writableutils. readvint (in);} // *) serialize public void write (dataoutput out) throws ioexception {writableutils. writevint (Out, number );}}
Comments:Myinputsplit must implement the writable interface, because inputsplit needs to be serialized/deserialized in the MAP/reduce process. At the same time, the inputsplit implementation class must provide a no-argument constructor, because reflection is required to instantiate this object.Please do not ask why I know this. I just want to say, "Please let me live Lei Feng! ".
The definition of myrecordreader is as follows:
Public static class myrecordreader extends recordreader <nullwritable, text> {private int current = 0; private int number = 0; private text valuetext = new text ();//*) initialization @ override public void initialize (inputsplit split, taskattemptcontext context) throws ioexception, interruptedexception {This. number = (myinputsplit) split ). getnumber () ;}@ override public Boolean nextkeyvalue () throws ioexception, interruptedexception {If (current ++ <number) {valuetext. set (datageneratorutility. genetatedata (); Return true;} return false;} @ override public nullwritable getcurrentkey () throws ioexception, interruptedexception {return nullwritable. get () ;}@ override public text getcurrentvalue () throws ioexception, interruptedexception {return valuetext ;}// *) report progress @ override public float getprogress () throws ioexception, interruptedexception {Return Current * 1.0f/number;} @ override public void close () throws ioexception {}}
Comments:Myrecordreader is relatively simple. Because map is generated by a single thread by default, the stateful functions nextkeyvalue (), getcurrentkey (), and getcurrentvalue () are used. unreasonable, teacher !!!
Finally, we will show the implementation of myinputformat, which integrates the inputsplit and recordreader above.
public class MyInputFormat extends InputFormat<NullWritable, Text> {@Overridepublic List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException {int splitNumber = Integer.parseInt(context.getConfiguration().get("data.split_number"));int dataNumber = Integer.parseInt(context.getConfiguration().get("data.data_number"));List<InputSplit> results = new ArrayList<InputSplit>();for ( int i = 0; i < splitNumber; i++ ) {results.add(new MyInputSplit(dataNumber));}return results;}@Overridepublic RecordReader<NullWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {return new MyRecordReader();}}
Note: The implementation of myinputformat is to obtain the part information and provide the corresponding recordreader, which serves as a bridge for the MAP/reduce program.
2 ).Map definition Processing
public class MyMap extends Mapper<NullWritable, Text, NullWritable, Text> { @Override protected void map(NullWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(key, value); }}
Note: mymap is very simple, that is, a simple write key/value pair
3 ).Job configuration options
Public class myjob extends configured implements tool {@ overridepublic int run (string [] ARGs) throws exception {job = job. getinstance (getconf (); Path outputdir = New Path (ARGs [0]); fileoutputformat. setoutputpath (job, outputdir); job. setjobname ("myjob"); job. setjarbyclass (myjob. class); job. setmapperclass (mymap. class); // *) set CER task to 0 job. setnumreducetasks (0); job. setoutputkeyclass (null Writable. class); job. setoutputvalueclass (text. class); // *) set myinputformat job. setinputformatclass (myinputformat. class); // *) input related parameter job. getconfiguration (). set ("data. split_number ", argS [1]); job. getconfiguration (). set ("data. data_number ", argS [2]); Return job. waitforcompletion (true )? 0: 1;} public static void main (string [] ARGs) throws exception {int res = toolrunner. run (new configuration (), new myjob (), argS); system. exit (RES );}}
Note: Some parameter verification is omitted here. The general point is to set numreducetasks (0), and then set inputformatclass myinputformat. OK let it go !!!
Test:
After compilation into jar, run
Result: mapreduce runs successfully. There are two maps in total, and each map generates 10 rows of records.
Verify the number of Map Files
Commentary: part-m-00000, part-m-00001 indicates the output file generated in the MAP Phase
Verify the file content:
Comment: data results meet expectations
Summary:
This section describes how to use map/reduce to generate a test set. At the same time, we also write it to ourselves. We hope we can have a clearer understanding of the internal mechanism of mapced CED.
MAP/reduce practice-generate a Data Test Set