1. Hadoop programming Read-write HDFs:
Hadoop file API starting point: FileSystem.
By calling the Factory method Filesystem.get (Configuration conf) to the filesystem instance.
Specific wording:
Configuration conf = new configuration ();
FileSystem HDFs = filesystem.get (conf);
Obtain a FileSystem object dedicated to the local file system, which can be filesystem.getlocal with the factory method (Configuration conf);
FileSystem local = filesystem.getlocal (conf);
The Hadoop file API uses path to prepare file and directory names. Use the Filestatus object to store metadata, and the Putmerge program merges all files from a local directory.
Path InputDir = new Path (args[0]);
filestatus[] Inputfiles = Local.liststatus (InputDir);
The Putmerge program accesses the path through the Fsdatainputstream object to read the file.
Fsdatainputstream in = Local.open (Inputfiles[i].getpath ());
Fsdatainputstream is a Java standard input stream.
public class putmerge{
public void Test (string[] args) {
Configuration conf = new configuration ();
FileSystem HDFs = filesystem.get (conf);
FileSystem local = filesystem.getlocal (conf);
Set input directory and output file
Path InputDir = new Path (args[0]);
Path hdfsfile = new Path (args[1]);
try{
Get a list of local files
filestatus[] Inputfiles = Local.liststatus (InputDir);
Generate HDFs output stream
Fsdataoutputstream out = Hdfs.create (Hdfsfile ();
for (int i=0; i<inputfiles.length; i++) {
System.out.println (Inputfiles[i].getpath (). GetName ());
Open Local input stream
Fsdatainputstream in = Local.open (Inputfiles[i].getpath ());
byte buffer[] = new byte[256];
int bytesread = 0;
while ((bytesread = in.read (buffer)) >0)}
Out.write (buffer, 0, bytesread);
}
In.close ();
}
}catch (Exception e) {
E.printstacktrace ();
}
}
}
Program Answer process: 1. Set the local directory and the HDFs target file according to user-defined parameters.
2. Extract the information for each file in the local input directory.
3. Create an output stream to write to the HDFs file.
4. Traverse each file in the local directory to open an input stream to read the file.
5. A standard Java file copy process.
2. MapReduce Program
The MapReduce program processes data by manipulating key-value pairs, in the general form:
Map: (K1, V1), List (K2, v2)
Reduce (K2, List (v2)), list (K3, v3)
Hadoop data type:
The MapReduce framework supports only serialized classes that act as keys or values in the framework.
Specifically, the class that implements the writable interface can be a value, and the class that implements the Writablecomparable<t> interface can be either a key or a value.The keys are sorted in the reduce phase, and the values are simply passed.
Classes that implement the Writablecomparable interface
public class Edge Implements writablecomparable<edge>{
Private String Departurenode;
Private String Arrivalnode;
Public String Getdeparturenode () {return departurenode;}
/**
* Explains how to read data into
*/
@Override
public void ReadFields (Datainput in) throws ioexception{
Departurenode = In.readutf ();
Arrivalnode = In.readutf ();
}
/**
* Explains how to write data
*/
@Override
public void Write (DataOutput out) throws ioexception{
Out.writeutf (Departurenode);
Out.writeutf (Arrivalnode);
}
/**
* Defining Data Ordering
*/
@Override
public int compareTo (Edge o) {
Return (Departurenode.compareto (o.departurenode)! = 0)?Departurenode.compareto (O.departurenode):Arrivalnode.cmpareto (O.arrivalnode);
}
}
3. Mapper
Mapper need to inherit Mapreducebase base class and implement Mapper interface. The base classes for mapper and reducer are mapreducebase classes.
The mapreducebase provides
void Configure (jobconf job)---The function extracts the parameters in the XML configuration file or in the main class, and calls before data processing
void Close ()---Map task is executed before the end. Complete all end jobs, such as closing the database connection, etc.
The method--map in the Mapper class is used to process each map corresponding to the data shard, which is a separate key/value pair.
4. Reducer
Reducer implementations need to inherit the Mapreducebase base class and implement the reduce interface, allowing configuration and cleanup.
When the Reducer task accepts output from each mapper, the input data is sorted by the key in the key/value pair, and the same values are merged. Then call the reduce () method. And by iterating over those values associated with the specified key, generate a (possibly empty) list (K3, V3). Finally, the output of the reduce phase is received by the Outputcollector and written to the output file.
5. Partitioner
Partitioner: Outputs the results of the mapper to different reducer. This is the work of Partitioner. Partitioner is to send the value of the output mapper application-tailored partitioner to the same reducer. Otherwise, the mapper output data will be processed two times, and both are wrong.
Classes that implement Partitioner
public class Edge Implements writablecomparable<edge>{
@Override
public int getpartition (Edge key, writable value, int numpartitions) throws ioexception{
Return Key.getdeparturenode (). Hashcode ()% Numpartitions;
}
@Override
public void Configure (jobconf conf) {}
}
6. The basic principles of mapreduce Processing:
1. Split the input data into blocks and run in parallel in the Hadoop server cluster. Which is called input Shard.
2. Parallel processing of the principle of slicing input data. Fsdatainputstream supports random reads, divides each chunk into a shard, and then processes it by a machine that resides on each shard, and can be automatically implemented in parallel.
3. The MapReduce operation is based on key-value/pair, but Hadoop also supports a variety of other data and allows for custom formatting.
Summary: The MapReduce framework contains operations: Data splitting (segmentation), shuffling (shuffle), partitioning (grouping), combining (merging), and of course, the most core map and reduce operations.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Hadoop--mapreduce Run processing Flow