In Hadoop, commonly used TextInputFormat uses line breaks as the Record separator. In practice, we often see a Record containing multiple rows, for example, doc....doc. At this time, we need to expand TextInputFormat to complete this function. Let's take a look at the original implementation: publicclassTextInputFormatexte
In Hadoop, commonly used TextInputFormat uses line breaks as the Record separator. In practical applications, we often see a Record containing multiple rows, for example, doc.../doc. At this time, we need to expand TextInputFormat to complete this function. Let's take a look at the original implementation: public class TextInputFormat exte
In Hadoop, commonly used TextInputFormat uses line breaks as the Record separator.
In practice, we often see a Record containing multiple rows, for example:
....
In this case, you need to expand TextInputFormat to complete this function.
Let's take a look at the original implementation:
public class TextInputFormat extends FileInputFormat
{ @Override public RecordReader
createRecordReader(InputSplit split, TaskAttemptContext context) {// By default,textinputformat.record.delimiter = ‘/n’(Set in configuration file) String delimiter = context.getConfiguration().get( "textinputformat.record.delimiter"); byte[] recordDelimiterBytes = null; if (null != delimiter) recordDelimiterBytes = delimiter.getBytes(); return new LineRecordReader(recordDelimiterBytes); } @Override protected boolean isSplitable(JobContext context, Path file) { CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(file); return codec == null; }}
According to the above Code, it is not difficult to find that line breaks are actually determined by the configuration of "textinputformat. record. delimiter.
Therefore, we have a solution:
(1) directly configure textinputformat. record. delimiter"\ N ", this scheme is relatively Hack, it is easy to affect the normal execution of other code.
(2) inherit TextInputFormat. When return LineRecordReader is returned, a custom separator is used.
This article uses the second solution, the Code is as follows:
public class DocInputFormat extends TextInputFormat {private static final String RECORD_DELIMITER = "\n";@Overridepublic RecordReader
createRecordReader(InputSplit split, TaskAttemptContext tac) {byte[] recordDelimiterBytes = null;recordDelimiterBytes = RECORD_DELIMITER.getBytes();return new LineRecordReader(recordDelimiterBytes);}@Overridepublic boolean isSplitable(JobContext context, Path file) {CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(file);return codec == null;}}
It should be noted that InputFormat only splits the original HDFS file into a String record. If your If there are other structured data, you need to implement the deserilize business logic in map.
?
Original article address: how to extend Hadoop's InputFormat to other delimiters, thanks to the original author for sharing.