The Set calculator assists java in processing the set operation of structured text, and java structuring

Source: Internet
Author: User

The Set calculator assists java in processing the set operation of structured text, and java structuring

JAVA does not directly support set operations. Therefore, you must use nested loops to perform set operations such as intersection, union, and difference sets between text files. If there are many files, or if the file is large and cannot be directly computed in the memory, or you need to perform Set Operations Based on Multiple Fields, the corresponding code will be more complex. The Set calculator directly supports set operations, which can help JAVA easily implement such algorithms. The following example shows the specific practices.

There are two small files: f1.txtand f2.txt. The first line is the column Name. Now we need to perform the intersection operation on the Name field in the file. Some data is as follows:

File f1.txt:

File f2.txt:

Code of the Set calculator:

A1 and B1: Use the import function to read files = [A1. (Name), B1. (Name)]. isect () into the memory. The default Delimiter is tab. Here, the function option @ t indicates that the first row is read as the column Name, so that the Name and Dept can be directly used to reference the corresponding column for subsequent calculations. If the first row is not the column Name, the default column names such as _ 1 and _ 2 should be referenced.

After calculation, the values of A1 and B1 are as follows:

Function import can read specified columns. For example, in this case, only Name is involved in calculation. Therefore, only the Name column can be read. The corresponding code is file ("E: \ f1.txt "). import @ t (Name ).

A2 = function isect can perform the intersection operation between sets. A1. (Name) indicates to retrieve the Name column of A1. B1. (Name) indicates to retrieve the Name column of B1. The final result of this case is as follows:

A3: result A2. This means to output the computing result to the JDBC interface. A3 can be combined with A2 for one step: result [A1. (Name), B1. (Name)]. isect ().

The above is the process of intersection. To obtain the union, you only need to change the function [A1. (Name), B1. (Name)]. union (). The calculation result is as follows:

Code for calculating the difference set: [A1. (Name), B1. (Name)]. diff (). The calculation result is as follows:

There is also a special set algorithm: set, that is, the duplicate elements are retained in the Union, and the Set code is: [A1. (Name), B1. (Name)]. conj (). The calculation result is as follows:

You can directly use operators to replace functions. The writing method is more concise, such as intersection. The Union, difference set, and collection can be rewritten:

A1. (Name) ^ B1. (Name)
A1. (Name) & B1. (Name)
A1. (Name) \ B1. (Name)
A1. (Name) | B1. (Name)

You can also collect multiple files, such as f1.txt0000f2.txt0000f3.txt. The corresponding variables are A1, B1, and C1 after the files are read into the memory. The Code is as follows: A1. (Name) ^ B1. (Name) ^ C1. (Name) or [A1. (Name), B1. (Name), C1. (Name)]. isect ().

Sometimes a large file will affect the performance of the set operation. You can use the sort function to sort the data in advance and use the merge function to perform the set operation. This will significantly improve the performance. Here, the function option @ I should be used for the intersection, the Union uses @ u, and the difference sets uses @ d. The corresponding code is as follows:

= [A1. (Name). sort (), B1. (Name). sort ()]. merge @ I ()
= [A1. (Name). sort (), B1. (Name). sort ()]. merge @ u ()
= [A1. (Name). sort (), B1. (Name). sort ()]. merge @ d ()

The merge function can also perform multi-field set operations. Assume that different Dept instances have the same Name. Now we need to use Dept and Name as a whole to perform intersection operations. The corresponding code is as follows: [A1.sort (Dept, Name), B1.sort (Dept, Name)]. merge @ I (Dept, Name ).

The calculation result is as follows:

For large files that cannot be stored in the memory, you can use the cursor function to read files and use the merge function to perform set operations. The code for intersection calculation is as follows:
A1 = file ("e: \ f1.txt"). cursor ()
B1 = file ("e: \ f2.txt"). cursor ()
A2 = [A1.sortx (Name), B1.sortx (Name)]. merge @ xi (Name)

Note that the function cursor does not read all data into the memory, but opens the file in a cursor (or stream) mode. The computing engine automatically allocates a suitable buffer. Each time a part of data is read for calculation, the computing is completed cyclically.

Unlike memory computing, the cursor operation requires a cursor function. For example, the sortx function should be used for sorting. Here, the merge function uses two function options. @ I indicates the intersection. @ x indicates that the object involved in calculation is not the memory data, but the cursor. In addition, functions such as union can only perform set operations on memory data and cannot be used for large files.

The preceding script has completed all data processing, and then integrates the setloader script in JAVA through JDBC. The JAVA code is as follows:

// Establish an esProc jdbc connection
Class. forName ("com. esproc. jdbc. InternalDriver ");
Con = DriverManager. getConnection ("jdbc: esproc: local ://");
// Call esProc, where test is the script file name
St = (com. esproc. jdbc. InternalCStatement) con. prepareCall ("call test ()");
St.exe cute (); // execute the esProc Stored Procedure
ResultSet set = st. getResultSet (); // obtain the calculation result.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.