The way Hadoop splits and reads input files is defined in an implementation of the InputFormat interface, Textinputformat is the default implementation, and when you want to fetch a row of content as input data at a time, there is no definite key. The key returned from Textinputformat is the byte offset of each row, but I don't see any use at this time.
Previously used in Mapper, longwritable (key) and text (value), in Textinputformat, because the key is a byte offset, can be a longwritable type, When using Keyvaluetextinputformat, the first delimiter is the text type before and after, so you have to modify the mapper implementation and the map () method to accommodate the new key type.
A mapreduce input is not necessarily an external data, often some other mapreduce output data, you can also customize the output format, The default output format is consistent with the data format that Keyvaluetextinputformat can read (each row in the record is a tab-delimited key and value), but Hadoop provides a more efficient binary compressed file format called a sequence file, This sequence file is optimized for Hadoop processing, it is preferred when connecting multiple MapReduce jobs, the class reading the sequence file is Sequencefileinputformat, the key and value objects of the sequence file can be customized by the user, the output and input types must match
To customize the InputFormat, implement two methods:
Getsplit () identifies all files used for input data and splits the input data into input shards, each map task processing a shard
Getrecordreader () loop extracts records from a given Shard and resolves each record as a predefined type of key and value
In practice, a shard is always sized in chunks, and the default block in HDFs is 64MB
Fileinputformat the Issplitable () method, check if you can shard the given file, return to True by default, sometimes you may want a file for its own chunking, you can set the return to False
Linerecordreader implementation Recordreader, implementation-based encapsulation, most operations are stored in next
We build our InputFormat class by extending Fileinputformat and implement a factory method to return the Recordreader
In addition to building classes, Timeurlrecordreader implements 6 methods in Recordreader, which are mainly in a package outside of Keyvalueinputformat, but the text type of the record is converted to urlwritable
When outputting data to a file, OutputFormat is used because each reducer only needs to write its output to its own file, and the output does not require sharding.
The output file is placed in a common directory , usually named part-nnnnn, where nnnnn is the reducer partition Id,recordwriter the output is formatted, and Recordreader parses the input format
Nulloutputformat simple implementation of the OutputFormat, no output, do not need to inherit fileoutputformat. More important is that OutputFormat (InputFormat) is dealing with a database, not a file
The personalization output can inherit the write () method in the encapsulated inheritance Recordreader class in the Fileoutputformat class, if you want to output not only to the file
JAR-XVF. /example.jar Unpacking the JAR package
Migrating local files to HDFs Yes, the address in the program is not written wrong, do not write on other unrelated machine
Complete the program in Eclipse, make a jar package, put it in the Hadoop folder, run the Hadoop command to see the results
If you use a third-party plug-in Fatjar, combine the MapReduce jar package and the Jedis jar package into Hadoop so that you do not need to modify the manifest configuration information
set up three modes, the general default stand-alone mode: Do not use HDFS, nor load any daemons, mainly for development debugging
Pseudo-distribution mode runs Hadoop on a "single node Cluster" where all daemons are on one machine, adding code debugging capabilities that allow checking memory usage, HDFS input and output, and other daemon interactions
Fully distributed mode, the real situation in this mode, emphasizing distributed storage and distributed computing, explicitly declares the host name of the Namenode and Jobtracker daemons. Increased HDFs backup parameters for distributed storage benefits