The Set calculator assists Java in processing grouping and summarizing structured text, and java structuring
Using Java to group and summarize style files may cause the following troubles:
1. files are not databases and cannot be accessed using SQL. When the grouping and summary expressions change, you can only rewrite the code. To implement flexible expressions, You need to parse and evaluate dynamic expressions by yourself. The programming workload is very large.
2. Record the group results in the traversal process. If the results are small, they can still be stored in the memory. If the group results are too large, cache the intermediate results into temporary files and merge them. The implementation process is very complicated.
Java programming is assisted by the cube. These problems can be solved by ready-made class libraries. Next, let's take a look at the specific practices through examples.
Employee information is stored in the employee file employee.txt. We need to obtain the number of employees in each group and the total SALARY according to the DEPT group.
The format of the example file empolyee.txt is as follows:
EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 R & D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New Mexico 1970-12-17 Sales 9000
4 Emily Smith F Texas 1985-03-07 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 R & D 16000
6 Matthew Johnson M California 1984-07-07 Sales 11000
7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000
10 Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 R & D 13000
11 Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000
12 Jessica Davis F New York Sales 7000
13 Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000
...
The idea of implementation is: Use a Java program to call the Set Computing script, read and compute data, and then return the results to the Java program in ResultSet mode. Because the set calculator supports dynamic expression parsing and evaluation, Java programs can flexibly process data in text files as they use SQL.
For example, we need to group the employees in each group according to DEPT, and find the number of employees in each group, COUNT and total SALARY. The esProc program can input an input parameter "groupBy" from outside as a dynamic grouping and summarizing condition, for example:
The value of "groupBy" is: DEPT: dept; count (~) : Count, sum (SALARY): salary. The esProc code is as follows:
A1: defines a file cursor object. The first line is the title, and the field separator is tab by default. EsProc's integrated development environment can intuitively display imported data, such as the right part.
A2: group by specified fields. Here, a macro is used to implement a dynamic parsing expression. The groupBy is the input parameter. The set operator calculates $ {…} first {...} Replace the calculation result with the macro string value $ {...} Then explain and execute. In this example, the final execution is: = A1.groups (DEPT: dept; count (~) : Count, sum (SALARY): salary ).
A3: returns a qualified result set to an external program.
The Code does not need to be changed when the group field changes. You only need to change the groupBy parameter. For example, the number of employees in each group is calculated by grouping the DEPT and GENDER fields. The value of the groupBy parameter can be written as: DEPT: dept, GENDER: gender; count (~) : Count, sum (SALARY): salary.
Simple summary calculation for all data can be seen as a special case of group summary. For example, to count the total number of employees and the total number of compensation, you can enter the groupBy parameter value as:; count (~) : Count, sum (SALARY): salary, that is, the part of the group is filled in, which is equivalent to dividing all data into only one group. The advantage of doing so is that multiple summary values of this batch of data can be computed at one traversal.
The code for calling this esProc Program (saved as the test. dfx file) through jdbc in a Java program is as follows:
// Establish an esProc jdbc connection
Class. forName ("com. esproc. jdbc. InternalDriver ");
Con = DriverManager. getConnection ("jdbc: esproc: local ://");
// Call the esProc Program (stored procedure), where test is the file name of dfx
Com. esproc. jdbc. InternalCStatement st;
St = (com. esproc. jdbc. InternalCStatement) con. prepareCall ("call test (?)");
// Set parameters
St. setObject (1, "DEPT: dept, GENDER: gender; count (~) : Count, sum (SALARY): salary "); // The parameter is a dynamic grouping summary field.
// Execute the esProc Stored Procedure
St.exe cute ();
// Obtain the result set
ResultSet set = st. getResultSet ();
For simple scripts, you can also directly write the code in the Java program of the call set calculator JDBC without having to write the script file (test. dfx ):
St = (com. esproc. jdbc. InternalCStatement) con. createStatement ();
ResultSet set‑st.exe cuteQuery ("= file (\" D:/employee.txt \ "). cursor @ t (). groups (DEPT: dept,
GENDER: gender; count (~) : Count, sum (SALARY): salary )");
This Java code directly calls a script of the set calculator: After obtaining data from a text file, the result set is returned to the ResultSet object set.
If the group result set is still large and cannot be loaded in the memory, you must use the groupx statement to return the Group result using a file cursor. The dataset code will be adjusted as follows:
The groups function stores the group summary results in the memory, while groupx writes the results to a temporary file when the group summary result is greater than the number of buffered rows, and re-uses the memory. Then, groupx merges the generated temporary files. The parameter 1000000 here refers to the number of cached files. The value principle is to make full use of the memory to minimize the number of cached files. This quantity is related to the size of the physical memory and the size of the record itself. It is recommended to estimate the number of records in the order of hundreds of thousands to millions.
Although cell A3 does not return a result set but a cursor to Java, the program called by Java does not need to be modified. When Java uses ResultSet to traverse data, the set operator automatically reads the content corresponding to the cursor.
This program can be further improved to support filtering Before and After grouping, similar to where and having in SQL. For example, the statistical object is changed to count only female employees (GENDER = "F"). After grouping and summarizing, only departments with more than 10 female employees are retained. The Code is as follows:
For ease of understanding, grid parameters are not used here. In fact, the writing method is the same as the preceding code: A2.groupx ($ {groupBy }). The parameters of the select function can also be written into macros and passed in from Java programs.