The collector assists Java in processing the HDFs of a diverse data source

Source: Internet
Author: User

It is not difficult for Java to access HDFs through the APIs provided by Hadoop, but the computation of the files on it is cumbersome. such as grouping, filtering, sorting and other calculations, using Java to achieve are more complex. The Esproc is a good way to help Java solve computing problems, but also encapsulates the access of HDFs, with the help of Esproc to enhance the computing power of HDFS files, structured semi-structured data can be easily computed. Let's take a look at the concrete examples below.

The employee data is saved in the text file employee.gz in HDFs. We want to read employee information and find out who was born after January 1, 1981 (inclusive). The text file is zipped in HDFs and cannot be loaded into memory at one time.

The data for the text file empolyee.gz is as follows:

EID NAME SURNAME GENDER State BIRTHDAY hiredate DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000
7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000
Ten Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 13000
Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000
Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000
Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000
...

The idea is to use a Java program to call the collector script, read and calculate the data, and then return the result to the Java program in a resultset way.

First, to write and debug the program in the integrated development environment of the collector, the preparation is to copy the Hadoop core package and configuration package into the "Collector installation directory \esproc\lib", such as: Commons-configuration-1.6.jar, Commons-lang-2.4.jar, Hadoop-core-1.0.4.jar (Hadoop1.0.4).

Because the collector supports dynamic expression parsing and evaluation, Java programs can flexibly filter the data in the HDFs file as you would with SQL. For example, we need to inquire about female employees born after January 1, 1981 (including), ESPROC program can obtain an input parameter "where" as a condition externally, such as:

Where is a string with the value: Birthday>=date (1981,1,1) && gender== "F".

The Esproc code for the collector is as follows:

A1: Defines an HDFs file object cursor, the first row is the caption, and the field delimiter is tab by default. The compression method is determined by the file suffix, which is the gzip format, and the other compression methods are supported by the collector. UTF-8 is a character set, using the JVM's character set by default.

A2: Filters cursors by condition. In this case, a macro is used to implement a dynamic parse expression, where it is the passed-in parameter. The collector calculates the ${first ...} Expression in the ${, substituting the result of the calculation as a macro string value ...} Then explain the execution. In this example, the final execution is: =a1.select (birthday>=date (1981,1,1) && gender== "F").

A3: Returns a cursor.

Change the filter condition without changing the code, just change the where parameter. For example, the condition becomes: query for female employees born after January 1, 1981 (inclusive), or name+surname equals "rebeccamoore" employees. Where parameter values can be written as: Birthday>=date (1981,1,1) && gender== "F" | | name+surname== "Rebeccamoore".

The code for using Esproc JDBC to invoke this program in a Java program is as follows: (Save the Esproc program as TEST.DFX and put the Hadoop jar package that HDFs needs into Java classpath):
Establishing a Esproc JDBC connection
Class.forName ("Com.esproc.jdbc.InternalDriver");
con= drivermanager.getconnection ("jdbc:esproc:local://");
Call the ESPROC program (stored procedure), where test is the file name of DFX
St = (com.esproc.jdbc.InternalCStatement) con.preparecall ("Call Test (?)");
Setting parameters
St.setobject (1, "Birthday>=date (1981,1,1) && gender==\" f\ "| | Name+surname==\ "rebeccamoore\");//parameter is the dynamic filter condition
Executing Esproc stored Procedures
St.execute ();
Get result set: Eligible Employee collection
ResultSet set = St.getresultset ();

The collector assists Java in processing the HDFs of a diverse data source

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.