Sometimes we need to query the large text instead of the database, then we need to stream the file and implement the query algorithm, but also parallel processing to improve performance. However, Java itself lacks the corresponding class library, which requires hard coding to achieve structured file computation, complex code and poor readability, which makes it difficult to achieve efficient parallel processing.
Use the free collector to compensate for this deficiency. The collector encapsulates a rich set of structured file read-write and cursor calculation functions, writes simple code to achieve parallel computing, and provides an easy-to-use JDBC interface. Java applications can execute the collector script file as a database stored procedure, passing in parameters and getting the returned results using JDBC.
The integration structure of the collector and the Java application is as follows:
The following example illustrates the basic process by which the collector assists Java in querying large text. The source data is sOrder.txt as follows:
To query between StartDate and EndDate, the amount is greater than argamount order, just use the following code:
A1: Opens the file as a cursor. @t indicates that the 1th row is read as a column name.
A2: A structured query with the result of a cursor.
A3: Executes the cursor and reads the result into memory as follows:
The Java main program can invoke the scheduler script in a JDBC way, with the following code:
class.forname ("Com.esproc.jdbc.InternalDriver");
Con=drivermanager.getconnection ("jdbc:esproc:local://");
// Call the Collector script (similar to a stored procedure) where Searchbig is a DFX the file name
St= (com. esproc.jdbc.InternalCStatement) con.preparecall ("Callsearchbig ()");
// Setting Parameters
St.setobject (1, "2010-01-01");
St.setobject (2, "2010-12-31");
St.setobject (3,2000);
// Execute Script
St.execute ();
// Get result set
Resultsetrs = St.getresultset ();
......
The return value is the ResultSet object that conforms to the JDBC standard, and the method of invoking the collector script is exactly the same as accessing the database, and programmers familiar with JDBC can quickly master it.
For the simpler code above, you can also write the script directly in the JDBC call, separated by \ n between the multiline statements, similar to executing a more complex SQL, so that you do not have to save a script file.
St = (com.esproc.jdbc.InternalCStatement) con.createstatement ();
ResultSet rs1 =st.executequery ("=file (\" d:\\sorder.txt\ ") [email protected] () \ n" + "=a1.select (Orderdate>=date (\") 2010-01-01\ ") &&orderdate<=date (\" 2010-12-31\ ") && amount>2000) \ n" +
"=a2.fetch ()");
The collector returns the value of the last expression.
If the query results memory does not fit, you can directly return the cursor in the collector (that is, remove the A3 code), in Java just set the number of records read per batch to read normally, the code is as follows:
St.setfetchsize (1000)
More detailed information on the deployment and invocation of the collector JDBC can be referenced by Java calls to the integrated application of the collector .
The collector can also implement multithreaded parallel computations , the simplest way is to use @m in the cursor function of the preceding code, which means that multithreaded reads the file.
can also be manually segmented, both in the Read and compute sections using multi-threaded parallel computing, the code is as follows:
A1: Open the file with 8 cursors, reading the specified portion of the file each time. ~ Represents the loop variable, which is 1, 2 ... 8,@z means that the file is roughly divided in bytes and read only a portion of it, and the collector automatically goes through the header to ensure that the data being fetched is the entire line.
A2: Executes a query against each cursor.
A3: Executes the cursor in parallel and merges the results. @x indicates that the merged object is a cursor, and @m represents a parallel calculation. It is important to note that the function CONJ cannot guarantee that the result order is consistent with the source data.
The code above uses a parallel computational function built into the collector, which is suitable for
explicit parallel computation if the computational process is more complex, or if the memory can load the results of the calculation. The code is as follows:
A1: Sets the number of parallel.
A2: Executes code in parallel, scoped to the b2-b3 of the indentation. To (A1) =[1,2 ... 8] represents the entry parameters for each thread. A2 can be used internally by the thread to get the ingress parameters, and the thread can use A2 to get the computed results for all threads.
B3: Queries the cursor, reads the result into memory, and returns it to the main thread.
A4: Merges the calculated results of each thread sequentially.
for ordered data, the binary method can be used
to improve the performance of the query. For example, the data is sorted by client and OrderID, now to find the corresponding records according to the parameters argclient and Argorder, you can use the following code:
Begin,end is the starting and ending position of the dichotomy, and M is the middle position.
B4: to the middle position by the number of bytes, open the cursor read into a record, the collector will automatically implement the end of the head, take out the full record. @x indicates that the cursor is closed immediately after the record is taken.
B5-C6: If the location is successful, the current record is stored in C5.
B7-c8: If the location is unsuccessful, continue to compare the collection size and reset the begin,end.
A9: Explicitly returns the result of the calculation in C5 to JDBC.
How to deal with large text files in Java query