The collector assists with the grouping of structured text in Java

Source: Internet
Author: User

The following problems can be found in the implementation of the stylistic file grouping summary directly in Java:

1. The file is not a database and cannot be accessed with SQL. When grouping and summarizing expressions change, you can only rewrite the code. To realize the flexible expression, it is necessary to implement dynamic expression parsing and evaluation, and the programming work is very large.

2, the traversal process records the results of grouping, the result is small can also exist in memory, if the grouping results are too large to be cached in the interim file to merge, the implementation process is very complex.

With the use of the collector-assisted Java programming, these problems have a ready-made class library to solve. Let's take a look at the concrete examples below.

The text file Employee.txt has employee information, we want to follow the Dept group, find out the number of employees per group count and the total compensation salary.

The format of the text file Empolyee.txt is as follows:

EID NAME SURNAME GENDER State BIRTHDAY hiredate DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000
7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000
Ten Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 13000
Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000
Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000
Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000
...

The idea is to use a Java program to call the collector script, read and calculate the data, and then return the result to the Java program in a resultset way. Because the collector supports dynamic expression parsing and evaluation, Java programs can flexibly manipulate the data in a text file as you would with SQL.

For example, we need to dept the number of employees per group and the total compensation salary, Esproc the program can pass in an input parameter "GroupBy" from the outside as a dynamic grouping and aggregation condition, such as:

The value of "GroupBy" is: Dept:dept;count (~): Count,sum (SALARY): SALARY. The Esproc code is as follows:

A1: Defines a file cursor object, the first row is the caption, and the field delimiter is tab by default. The integrated development environment of ESPROC can visually display the imported data, such as the right part.

A2: Group totals by the specified fields. Here, a macro is used to implement the dynamic parsing expression, where the groupby is the passed-in parameter. The collector calculates the ${first ...} Expression in the ${, substituting the result of the calculation as a macro string value ...} Then explain the execution. In this example, the final execution is: =a1.groups (Dept:dept;count (~): Count,sum (SALARY): SALARY).

A3: Returns a qualifying result set to an external program.

Change the Group field without changing the code, just change the groupby parameter. For example, grouping becomes: Group by Dept and Gender two fields to find the number of employees per group count and the total compensation salary. The GroupBy parameter value can be written as: Dept:dept,gender:gender;count (~): Count,sum (SALARY): SALARY.

A simple summary calculation for all data can be considered a special case of grouping summaries. For example, to count the number of employees and the total number of compensation, you can write the GroupBy parameter value as:; count (~): Count,sum (SALARY): SALARY, which is the part of the group, is the equivalent of dividing all the data into one group. The benefit of this is that you can iterate over multiple summary values of this batch of data at once.

In a Java program, the code for this Esproc program (saved as a test.dfx file) is called by the hub JDBC, as follows:

    //Establish esproc JDBC Connection
    class.forname (" Com.esproc.jdbc.InternalDriver ");
    con= drivermanager.getconnection ("jdbc:esproc:local://");
    //calls the ESPROC program (stored procedure), where test is the file name of DFX
     Com.esproc.jdbc.InternalCStatement St;
    st = (com.esproc.jdbc.InternalCStatement) con.preparecall ("Call Test (?)");
    //Setting Parameters
    st.setobject (1, "Dept:dept,gender:gender;count (~)" : Count,sum (SALARY): SALARY ");//parameter is a dynamic Group Rollup field
    //execute esproc stored procedure
     st.execute ();
    //Get result set
    resultset set = St.getresultset ();

For scripts with simpler code, you can also write code directly in the Java program that calls the test.dfx JDBC, without having to write a script file specifically:
St= (com. esproc.jdbc.InternalCStatement) con.createstatement ();
ResultSet set=st.executequery ("=file (\" d:/employee.txt\ ") [email protected] (). Groups (Dept:dept,
Gender:gender;count (~): Count,sum (SALARY): SALARY) ");

This Java code directly invokes a script from the collector: when data processing is obtained from a text file, the result set is returned to the ResultSet object set.

If the grouping result set is still large and cannot be loaded in memory, the GROUPX statement is used to return the grouped results in the form of a file cursor. The collector code will be adjusted as follows:

The groups function puts the results of the grouping summary in memory completely, while the GROUPX writes the results to the temporary file and re-uses the memory when the result of the grouping summary is greater than the number of buffer rows. The groupx then merges the resulting temporary files. The parameter 1000000 here refers to the number of buffer rows, the principle is to make full use of memory to minimize the number of cached files. This amount is related to the size of the physical memory and the size of the record itself, which needs to be estimated at the time of programming and is generally recommended at hundreds of thousands of to millions of levels.

Although A3 cells return a cursor to Java instead of a result set, the program called by Java does not have to be modified. When Java uses ResultSet to traverse data, the collector automatically reads the contents of the cursor.

This program can be perfected to support filtering before and after grouping, similar to where and having in SQL. For example: The statistic object becomes only the female employee (gender== "F"), and after the group summary, only the department with a female employee number greater than 10 is retained. The specific code is as follows:

For ease of understanding, there is no further use of the grid parameters, which is actually the same as the preceding code: A2.GROUPX (${groupby}). The parameters of the Select function can also be written as macros, which are passed in from a Java program.

The collector assists with the grouping of structured text in Java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.