Hadoop processes HDF files

Source: Internet
Author: User

Hadoop processes HDF files

1. Preface

HDF files are a common data format in remote sensing applications. Due to their highly structured features, I have been troubled for a long time by using Hadoop to process HDF files. Therefore, Google's various solutions did not find an ideal solution. I have also referred to a post officially published by HDFGroup (the website is here), which provides the idea of using Hadoop to process large, medium, and small HDF files. Although the solution provided by Alibaba Cloud is illustrated as follows, it will certainly solve the problem of how to use Hadoop to process HDF files. However, I personally feel that the method is complicated and requires a deep understanding of the HDF data format, it is not easy to implement. As a result, I continue to look for a solution and finally found a method. The following describes the method in detail.

2. MapReduce main program

Here we mainly use the netcdf library for deserialization of hdf data streams (from the netcdf library ). Unlike the Java library officially provided by HDF, netcdf only uses Java to read and write HDF files. The Library supports multiple scientific data formats, including HDF4 and HDF5. In the official HDF Java library, the underlying layer still uses C for HDF file operations.

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

Below is the Mapper Function Code of MapReduce:

Package example;

Import java. io. ByteArrayInputStream;
Import java. io. File;
Import java. io. FileWriter;
Import java. io. IOException;
Import java.net. URI;
Import java. util. List;

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. FSDataOutputStream;
Import org. apache. hadoop. fs. FileSystem;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. BytesWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapreduce. Mapper;

Import ucar. ma2.ArrayShort;
Import ucar. nc2.Dimension;
Import ucar. nc2.Group;
Import ucar. nc2.NetcdfFile;
Import ucar. nc2.Variable;

Public class ReadMapper extends
Mapper <Text, BytesWritable, Text, BytesWritable> {

Public void map (Text key, BytesWritable value, Context context)
Throws IOException, InterruptedException {
String fileName = key. toString ();
NetcdfFile file = NetcdfFile. openInMemory ("hdf4", value. get ());
Group groups roup = (file. findGroup ("MOD_Grid_monthly_1km_VI"). findGroup ("Data_Fields ");
// Read the variable javaskm_monthly_red_reflectance.
Variable redVar = primary roup. findVariable ("javaskm_monthly_red_reflectance ");
Short [] [] data = new short [1200] [1200];
If (Rule roup! = Null ){
ArrayShort. D2 dataArray;
// Read the image data in redVar
DataArray = (ArrayShort. D2) redVar. read ();
List <Dimension> dimList = file. getDimensions ();
// Obtain the number of pixels in the y direction of the image
Dimension ydim = dimList. get (0 );
// Obtain the number of pixels in the x direction of the image
Dimension xdim = dimList. get (1 );
// Traverse the entire image and read the pixel value
For (int I = 0; I <xdim. getLength (); I ++ ){
For (int j = 0; j <ydim. getLength (); j ++ ){
Data [I] [j] = dataArray. get (I, j );
}
}
}
System. out. print (file. getDetailInfo ());
}
}

Note the NetcdfFile. openInMemory method in the program. This static method supports constructing HDF files from byte [] to implement deserialization of HDF files. The following is the sample code of the main program:

Package example;

Import java. io. IOException;

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. BytesWritable;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. mapreduce. lib. output. NullOutputFormat;

Import example. WholeFileInputFormat;


Public class ReadMain {
Public boolean runJob (String [] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration ();
// Conf. set ("mapred. job. tracker", Utils. JOBTRACKER );
String rootPath = "/opt/hadoop-2.3.0/etc/hadoop ";
// String rootPath = "/opt/hadoop-2.3.0/etc/hadoop /";
Conf. addResource (new Path (rootPath + "yarn-site.xml "));
Conf. addResource (new Path (rootPath + "core-site.xml "));
Conf. addResource (new Path (rootPath + "hdfs-site.xml "));
Conf. addResource (new Path (rootPath + "mapred-site.xml "));
Job job = new Job (conf );

Job. setJobName ("Job name:" + args [0]);
Job. setJarByClass (ReadMain. class );

Job. setMapperClass (ReadMapper. class );
Job. setMapOutputKeyClass (Text. class );
Job. setMapOutputValueClass (BytesWritable. class );

Job. setInputFormatClass (WholeFileInputFormat. class );
Job. setOutputFormatClass (NullOutputFormat. class );
FileInputFormat. addInputPath (job, new Path (args [1]);
FileOutputFormat. setOutputPath (job, new Path (args [2]);
Boolean flag = job. waitForCompletion (true );
Return flag;
}

Public static void main (String [] args) throws ClassNotFoundException,
IOException, InterruptedException {
String [] inputPaths = new String [] {"normalizeJob ",
"Hdfs: // 192.168.168.101: 9000/user/hduser/hdf/MOD13A3. A2005274.h00v10.005.2008079143041. hdf ",
"Hdfs: // 192.168.168.101: 9000/user/hduser/test /"};
ReadMain test = new ReadMain ();
Test. runJob (inputPaths );
}

}

There are several points worth noting about the MapReduce main program:

1. The input format of MapReduce data is WholeFileInputFormat. class, that is, data is not split. For details about this format, refer to another article: How to submit a Yarn computing task () through a Java program.

2. I use Yarn2.3.0 to execute computing tasks. If I use hadoop of earlier versions, such as 1.2.0, I can delete the conf. addResource code in the above main program.

3. In the above MapReduce program, only the Map function is used, and the Reduce function is not set.

4. The above program uses data in the HDF4 format. It is reasonable to say that data in the HDF5 format should also be supported.

For more details, please continue to read the highlights on the next page:

  • 1
  • 2
  • Next Page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.