Hadoop learning, self-defined Input/outputformat; class reference mapreduce.mapper; three modes

Source: Internet
Author: User

The way Hadoop cuts and reads input files is defined in an implementation of the InputFormat interface. Textinputformat is the default implementation when you want to get a row of content at once as input data without a certain key. The key returned from Textinputformat is the byte offset of each line, but I don't see any use at this moment.

Once in Mapper, longwritable (key) and text (value) were used, in Textinputformat, because the key is a byte offset. Can be a longwritable type, and when using Keyvaluetextinputformat, the first delimiter is the text type before and after, so you have to change the implementation of the Mapper and the map () method to accommodate the new key type



A mapreduce input is not necessarily an external data, often a number of other mapreduce output data, but also the ability to define the output format, The default output format remains consistent with the data format that Keyvaluetextinputformat can read (each row in the record is a tab-delimited key and value), but Hadoop provides a more efficient binary compressed file format. Called a sequence file, this sequence file is optimized for Hadoop processing. When multiple mapreduce jobs are connected, it is preferred that the class reading the sequence file is Sequencefileinputformat, and the key and value objects of the sequence file can be defined by the user itself. Output and input types must match

Define InputFormat yourself and implement two methods:

Getsplit () determines all files for input data and cuts input data into input shards, each map task processes a shard

Getrecordreader () loop extracts the records in a given Shard and resolves each record as a key and value of the type defined in advance

In practice, a shard is always sized in chunks, and the default block in HDFs is 64MB

Fileinputformat in the Issplitable () method. Check if you can shard the given file, and return to True by default. Sometimes you might want a file to be a chunk of itself, and then be able to set the return to False

Linerecordreader implementation Recordreader, implementation-based encapsulation, most operations are stored in next

We build our InputFormat class by extending Fileinputformat and implement a factory method to return the Recordreader

In addition to building classes, Timeurlrecordreader implements 6 methods in Recordreader, which are mainly in a package outside of Keyvalueinputformat, but the text type of the record is converted to urlwritable

When outputting data to a file, the OutputFormat is used. Since each reducer only needs to write its output to its own file, the output does not need to be fragmented.

The output file is placed in a public folder . It is usually named part-nnnnn. The nnnnn here is the partition ID of the reducer. Recordwriter format the output results. The input format is parsed by the Recordreader

Nulloutputformat simple implementation of the OutputFormat, no output. There is no need to inherit fileoutputformat. More fundamentally, OutputFormat (InputFormat) is dealing with a database. Not a file

The personalization output is able to inherit the write () method from the encapsulated inheritance Recordreader class in the Fileoutputformat class. Assuming it's not just about exporting to a file.


JAR-XVF. /example.jar Unpacking the JAR package

Migrating local files to HDFs can be, the address in the program is not written wrong, not written on other unrelated machine


Complete the program in Eclipse and hit the jar package. In the Hadoop directory, execute the Hadoop instructions to see the results

If you use a third-party plug-in Fatjar, combine the MapReduce jar package and the Jedis jar package into Hadoop. This does not require changes to the manifest configuration information

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvbmvlzgthbmu=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

We export the jar package (without the jar package including Hadoop) into the Hadoop directory, execute the Hadoop command, class with long name

Package Com.kane.hdfs;


Import java.io.IOException;


Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.fs.BlockLocation;
Import Org.apache.hadoop.fs.FileStatus;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.hdfs.DistributedFileSystem;
Import Org.apache.hadoop.hdfs.protocol.DatanodeInfo;


public class Findfileonhdfs {


/**
* @param args
* @throws IOException
*/
public static void Main (string[] args) throws IOException {
TODO auto-generated Method Stub
Gethdfsnodes ();
Getfilelocal ();
}
public static void Gethdfsnodes () throws IOException {
HDFS cluster node points
Configuration conf=new configuration ();
FileSystem fs=filesystem.get (conf);
Get Distributed File System
Distributedfilesystem hdfs= (Distributedfilesystem) FS;
Get the total number of nodes
Datanodeinfo[] Datanodestats=hdfs.getdatanodestats ();
Cycle printing
for (int i = 0; i < datanodestats.length; i++) {
System.out.println ("Datanode_" +i+ "_name:" +datanodestats[i].gethostname ());
}
}
/**
* Find the location of a file in the HDFs cluster
* @throws IOException
*/
public static void Getfilelocal () throws IOException {
Configuration conf=new configuration ();
FileSystem hdfs=filesystem.get (conf);

Path fpath=new path ("user/hadoop/20120722");//word.txt
Get the file information inside the file system
Filestatus Filestatus=hdfs.getfilestatus (Fpath);
Get block information for a file
Blocklocation[] Blklocations=hdfs.getfileblocklocations (filestatus, 0, 1000);
int blocklen=blklocations.length;
for (int i = 0; i < Blocklen; i++) {
String[] Hosts=blklocations[i].gethosts ();
System.out.println ("Block_" +i+ "_location" +hosts[0]);
}
}


}



set up three modes, the general default stand-alone mode: Do not use HDFS, nor load no matter what daemon, mainly for development debugging

Pseudo-Distribution mode performs Hadoop on a single node cluster, where all daemons are on one machine, adding code debugging capabilities. Agree to check memory usage, HDFS input and output. and other daemon interactions.

Fully distributed mode. The real situation in this mode. Emphasizing distributed storage and distributed computing, it is clear that the host name where the Namenode and Jobtracker daemons reside is declared.

Increased HDFs backup count for distributed storage benefits

Hadoop learning, self-defined Input/outputformat; class reference mapreduce.mapper; three modes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.