Hadoop processes HDF files
1. Preface
HDF files are a common data format in remote sensing applications. Due to their highly structured features, I have been troubled for a long time by using Hadoop to process HDF files. Therefore, Google's various solutions did not find an ideal solution. I have also referred to a post officially published by HDFGroup (the website is here), which provides the idea of using Hadoop to process large, medium, and small HDF files. Although the solution provided by Alibaba Cloud is illustrated as follows, it will certainly solve the problem of how to use Hadoop to process HDF files. However, I personally feel that the method is complicated and requires a deep understanding of the HDF data format, it is not easy to implement. As a result, I continue to look for a solution and finally found a method. The following describes the method in detail.
2. MapReduce main program
Here we mainly use the netcdf library for deserialization of hdf data streams (from the netcdf library ). Unlike the Java library officially provided by HDF, netcdf only uses Java to read and write HDF files. The Library supports multiple scientific data formats, including HDF4 and HDF5. In the official HDF Java library, the underlying layer still uses C for HDF file operations.
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)
Below is the Mapper Function Code of MapReduce:
Package example;
Import java. io. ByteArrayInputStream;
Import java. io. File;
Import java. io. FileWriter;
Import java. io. IOException;
Import java.net. URI;
Import java. util. List;
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. FSDataOutputStream;
Import org. apache. hadoop. fs. FileSystem;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. BytesWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapreduce. Mapper;
Import ucar. ma2.ArrayShort;
Import ucar. nc2.Dimension;
Import ucar. nc2.Group;
Import ucar. nc2.NetcdfFile;
Import ucar. nc2.Variable;
Public class ReadMapper extends
Mapper <Text, BytesWritable, Text, BytesWritable> {
Public void map (Text key, BytesWritable value, Context context)
Throws IOException, InterruptedException {
String fileName = key. toString ();
NetcdfFile file = NetcdfFile. openInMemory ("hdf4", value. get ());
Group groups roup = (file. findGroup ("MOD_Grid_monthly_1km_VI"). findGroup ("Data_Fields ");
// Read the variable javaskm_monthly_red_reflectance.
Variable redVar = primary roup. findVariable ("javaskm_monthly_red_reflectance ");
Short [] [] data = new short [1200] [1200];
If (Rule roup! = Null ){
ArrayShort. D2 dataArray;
// Read the image data in redVar
DataArray = (ArrayShort. D2) redVar. read ();
List <Dimension> dimList = file. getDimensions ();
// Obtain the number of pixels in the y direction of the image
Dimension ydim = dimList. get (0 );
// Obtain the number of pixels in the x direction of the image
Dimension xdim = dimList. get (1 );
// Traverse the entire image and read the pixel value
For (int I = 0; I <xdim. getLength (); I ++ ){
For (int j = 0; j <ydim. getLength (); j ++ ){
Data [I] [j] = dataArray. get (I, j );
}
}
}
System. out. print (file. getDetailInfo ());
}
}
Note the NetcdfFile. openInMemory method in the program. This static method supports constructing HDF files from byte [] to implement deserialization of HDF files. The following is the sample code of the main program:
Package example;
Import java. io. IOException;
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. BytesWritable;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. mapreduce. lib. output. NullOutputFormat;
Import example. WholeFileInputFormat;
Public class ReadMain {
Public boolean runJob (String [] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration ();
// Conf. set ("mapred. job. tracker", Utils. JOBTRACKER );
String rootPath = "/opt/hadoop-2.3.0/etc/hadoop ";
// String rootPath = "/opt/hadoop-2.3.0/etc/hadoop /";
Conf. addResource (new Path (rootPath + "yarn-site.xml "));
Conf. addResource (new Path (rootPath + "core-site.xml "));
Conf. addResource (new Path (rootPath + "hdfs-site.xml "));
Conf. addResource (new Path (rootPath + "mapred-site.xml "));
Job job = new Job (conf );
Job. setJobName ("Job name:" + args [0]);
Job. setJarByClass (ReadMain. class );
Job. setMapperClass (ReadMapper. class );
Job. setMapOutputKeyClass (Text. class );
Job. setMapOutputValueClass (BytesWritable. class );
Job. setInputFormatClass (WholeFileInputFormat. class );
Job. setOutputFormatClass (NullOutputFormat. class );
FileInputFormat. addInputPath (job, new Path (args [1]);
FileOutputFormat. setOutputPath (job, new Path (args [2]);
Boolean flag = job. waitForCompletion (true );
Return flag;
}
Public static void main (String [] args) throws ClassNotFoundException,
IOException, InterruptedException {
String [] inputPaths = new String [] {"normalizeJob ",
"Hdfs: // 192.168.168.101: 9000/user/hduser/hdf/MOD13A3. A2005274.h00v10.005.2008079143041. hdf ",
"Hdfs: // 192.168.168.101: 9000/user/hduser/test /"};
ReadMain test = new ReadMain ();
Test. runJob (inputPaths );
}
}
There are several points worth noting about the MapReduce main program:
1. The input format of MapReduce data is WholeFileInputFormat. class, that is, data is not split. For details about this format, refer to another article: How to submit a Yarn computing task () through a Java program.
2. I use Yarn2.3.0 to execute computing tasks. If I use hadoop of earlier versions, such as 1.2.0, I can delete the conf. addResource code in the above main program.
3. In the above MapReduce program, only the Map function is used, and the Reduce function is not set.
4. The above program uses data in the HDF4 format. It is reasonable to say that data in the HDF5 format should also be supported.
For more details, please continue to read the highlights on the next page: