Hadoop processes HDF files

Last Update:2014-11-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Preface

HDF files are a common data format in remote sensing applications. Due to their highly structured features, I have been troubled for a long time by using Hadoop to process HDF files. Therefore, Google's various solutions did not find an ideal solution. I have also referred to a post officially published by HDFGroup (the website is here), which provides the idea of using Hadoop to process large, medium, and small HDF files. Although the solution provided by Alibaba Cloud is illustrated as follows, it will certainly solve the problem of how to use Hadoop to process HDF files. However, I personally feel that the method is complicated and requires a deep understanding of the HDF data format, it is not easy to implement. As a result, I continue to look for a solution and finally found a method. The following describes the method in detail.

2. MapReduce main program

Here we mainly use the netcdf library for deserialization of hdf data streams (from the netcdf library ). Unlike the Java library officially provided by HDF, netcdf only uses Java to read and write HDF files. The Library supports multiple scientific data formats, including HDF4 and HDF5. In the official HDF Java library, the underlying layer still uses C for HDF file operations.

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

Below is the Mapper Function Code of MapReduce:

Package example;

Import java. io. ByteArrayInputStream;
Import java. io. File;
Import java. io. FileWriter;
Import java. io. IOException;
Import java.net. URI;
Import java. util. List;

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. FSDataOutputStream;
Import org. apache. hadoop. fs. FileSystem;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. BytesWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapreduce. Mapper;

Import ucar. ma2.ArrayShort;
Import ucar. nc2.Dimension;
Import ucar. nc2.Group;
Import ucar. nc2.NetcdfFile;
Import ucar. nc2.Variable;

Public class ReadMapper extends
Mapper <Text, BytesWritable, Text, BytesWritable> {

Public void map (Text key, BytesWritable value, Context context)
Throws IOException, InterruptedException {
String fileName = key. toString ();
NetcdfFile file = NetcdfFile. openInMemory ("hdf4", value. get ());
Group groups roup = (file. findGroup ("MOD_Grid_monthly_1km_VI"). findGroup ("Data_Fields ");
// Read the variable javaskm_monthly_red_reflectance.
Variable redVar = primary roup. findVariable ("javaskm_monthly_red_reflectance ");
Short [] [] data = new short [1200] [1200];
If (Rule roup! = Null ){
ArrayShort. D2 dataArray;
// Read the image data in redVar
DataArray = (ArrayShort. D2) redVar. read ();
List <Dimension> dimList = file. getDimensions ();
// Obtain the number of pixels in the y direction of the image
Dimension ydim = dimList. get (0 );
// Obtain the number of pixels in the x direction of the image
Dimension xdim = dimList. get (1 );
// Traverse the entire image and read the pixel value
For (int I = 0; I <xdim. getLength (); I ++ ){
For (int j = 0; j <ydim. getLength (); j ++ ){
Data [I] [j] = dataArray. get (I, j );
}
}
}
System. out. print (file. getDetailInfo ());
}
}

Note the NetcdfFile. openInMemory method in the program. This static method supports constructing HDF files from byte [] to implement deserialization of HDF files. The following is the sample code of the main program:

Package example;

Import java. io. IOException;

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. BytesWritable;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. mapreduce. lib. output. NullOutputFormat;

Import example. WholeFileInputFormat;

Public class ReadMain {
Public boolean runJob (String [] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration ();
// Conf. set ("mapred. job. tracker", Utils. JOBTRACKER );
String rootPath = "/opt/hadoop-2.3.0/etc/hadoop ";
// String rootPath = "/opt/hadoop-2.3.0/etc/hadoop /";
Conf. addResource (new Path (rootPath + "yarn-site.xml "));
Conf. addResource (new Path (rootPath + "core-site.xml "));
Conf. addResource (new Path (rootPath + "hdfs-site.xml "));
Conf. addResource (new Path (rootPath + "mapred-site.xml "));
Job job = new Job (conf );

Job. setJobName ("Job name:" + args [0]);
Job. setJarByClass (ReadMain. class );

Job. setMapperClass (ReadMapper. class );
Job. setMapOutputKeyClass (Text. class );
Job. setMapOutputValueClass (BytesWritable. class );

Job. setInputFormatClass (WholeFileInputFormat. class );
Job. setOutputFormatClass (NullOutputFormat. class );
FileInputFormat. addInputPath (job, new Path (args [1]);
FileOutputFormat. setOutputPath (job, new Path (args [2]);
Boolean flag = job. waitForCompletion (true );
Return flag;
}

Public static void main (String [] args) throws ClassNotFoundException,
IOException, InterruptedException {
String [] inputPaths = new String [] {"normalizeJob ",
"Hdfs: // 192.168.168.101: 9000/user/hduser/hdf/MOD13A3. A2005274.h00v10.005.2008079143041. hdf ",
"Hdfs: // 192.168.168.101: 9000/user/hduser/test /"};
ReadMain test = new ReadMain ();
Test. runJob (inputPaths );
}

}

There are several points worth noting about the MapReduce main program:

1. The input format of MapReduce data is WholeFileInputFormat. class, that is, data is not split. For details about this format, refer to another article: How to submit a Yarn computing task () through a Java program.

2. I use Yarn2.3.0 to execute computing tasks. If I use hadoop of earlier versions, such as 1.2.0, I can delete the conf. addResource code in the above main program.

3. In the above MapReduce program, only the Map function is used, and the Reduce function is not set.

4. The above program uses data in the HDF4 format. It is reasonable to say that data in the HDF5 format should also be supported.

For more details, please continue to read the highlights on the next page:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop processes HDF files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop processes HDF files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support