The way Hadoop cuts and reads input files is defined in an implementation of the InputFormat interface. Textinputformat is the default implementation when you want to get a row of content at once as input data without a certain key. The key returned from Textinputformat is the byte offset of each line, but I don't see any use at this moment.
Once in Mapper, longwritable (key) and text (value) were used, in Textinputformat, because the key is a byte offset. Can be a longwritable type, and when using Keyvaluetextinputformat, the first delimiter is the text type before and after, so you have to change the implementation of the Mapper and the map () method to accommodate the new key type
A mapreduce input is not necessarily an external data, often a number of other mapreduce output data, but also the ability to define the output format, The default output format remains consistent with the data format that Keyvaluetextinputformat can read (each row in the record is a tab-delimited key and value), but Hadoop provides a more efficient binary compressed file format. Called a sequence file, this sequence file is optimized for Hadoop processing. When multiple mapreduce jobs are connected, it is preferred that the class reading the sequence file is Sequencefileinputformat, and the key and value objects of the sequence file can be defined by the user itself. Output and input types must match
Define InputFormat yourself and implement two methods:
Getsplit () determines all files for input data and cuts input data into input shards, each map task processes a shard
Getrecordreader () loop extracts the records in a given Shard and resolves each record as a key and value of the type defined in advance
In practice, a shard is always sized in chunks, and the default block in HDFs is 64MB
Fileinputformat in the Issplitable () method. Check if you can shard the given file, and return to True by default. Sometimes you might want a file to be a chunk of itself, and then be able to set the return to False
Linerecordreader implementation Recordreader, implementation-based encapsulation, most operations are stored in next
We build our InputFormat class by extending Fileinputformat and implement a factory method to return the Recordreader
In addition to building classes, Timeurlrecordreader implements 6 methods in Recordreader, which are mainly in a package outside of Keyvalueinputformat, but the text type of the record is converted to urlwritable
When outputting data to a file, the OutputFormat is used. Since each reducer only needs to write its output to its own file, the output does not need to be fragmented.
The output file is placed in a public folder . It is usually named part-nnnnn. The nnnnn here is the partition ID of the reducer. Recordwriter format the output results. The input format is parsed by the Recordreader
Nulloutputformat simple implementation of the OutputFormat, no output. There is no need to inherit fileoutputformat. More fundamentally, OutputFormat (InputFormat) is dealing with a database. Not a file
The personalization output is able to inherit the write () method from the encapsulated inheritance Recordreader class in the Fileoutputformat class. Assuming it's not just about exporting to a file.
JAR-XVF. /example.jar Unpacking the JAR package
Migrating local files to HDFs can be, the address in the program is not written wrong, not written on other unrelated machine
Complete the program in Eclipse and hit the jar package. In the Hadoop directory, execute the Hadoop instructions to see the results
If you use a third-party plug-in Fatjar, combine the MapReduce jar package and the Jedis jar package into Hadoop. This does not require changes to the manifest configuration information
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvbmvlzgthbmu=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
We export the jar package (without the jar package including Hadoop) into the Hadoop directory, execute the Hadoop command, class with long name
Package Com.kane.hdfs;
Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.fs.BlockLocation;
Import Org.apache.hadoop.fs.FileStatus;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.hdfs.DistributedFileSystem;
Import Org.apache.hadoop.hdfs.protocol.DatanodeInfo;
public class Findfileonhdfs {
/**
* @param args
* @throws IOException
*/
public static void Main (string[] args) throws IOException {
TODO auto-generated Method Stub
Gethdfsnodes ();
Getfilelocal ();
}
public static void Gethdfsnodes () throws IOException {
HDFS cluster node points
Configuration conf=new configuration ();
FileSystem fs=filesystem.get (conf);
Get Distributed File System
Distributedfilesystem hdfs= (Distributedfilesystem) FS;
Get the total number of nodes
Datanodeinfo[] Datanodestats=hdfs.getdatanodestats ();
Cycle printing
for (int i = 0; i < datanodestats.length; i++) {
System.out.println ("Datanode_" +i+ "_name:" +datanodestats[i].gethostname ());
}
}
/**
* Find the location of a file in the HDFs cluster
* @throws IOException
*/
public static void Getfilelocal () throws IOException {
Configuration conf=new configuration ();
FileSystem hdfs=filesystem.get (conf);
Path fpath=new path ("user/hadoop/20120722");//word.txt
Get the file information inside the file system
Filestatus Filestatus=hdfs.getfilestatus (Fpath);
Get block information for a file
Blocklocation[] Blklocations=hdfs.getfileblocklocations (filestatus, 0, 1000);
int blocklen=blklocations.length;
for (int i = 0; i < Blocklen; i++) {
String[] Hosts=blklocations[i].gethosts ();
System.out.println ("Block_" +i+ "_location" +hosts[0]);
}
}
}
set up three modes, the general default stand-alone mode: Do not use HDFS, nor load no matter what daemon, mainly for development debugging
Pseudo-Distribution mode performs Hadoop on a single node cluster, where all daemons are on one machine, adding code debugging capabilities. Agree to check memory usage, HDFS input and output. and other daemon interactions.
Fully distributed mode. The real situation in this mode. Emphasizing distributed storage and distributed computing, it is clear that the host name where the Namenode and Jobtracker daemons reside is declared.
Increased HDFs backup count for distributed storage benefits
Hadoop learning, self-defined Input/outputformat; class reference mapreduce.mapper; three modes