Use the collector to assist Java in reading text

Source: Internet
Author: User
Tags new set

Java provides the most basic file processing functions, you can simply unstructured to read into small text files, if you encounter the need for structured, diverse format, require special files or memory cannot fit large files, the corresponding code will be very complex, readability and reusability is difficult to protect.

Use the free collector to compensate for this deficiency. The collector encapsulates a rich set of structured file read-write and compute functions, and provides a JDBC interface. Java applications can execute the collector script file as a database stored procedure, passing in parameters and getting the returned results using JDBC. The details Reference collector is used as the application structure for the Java Computing Class library .

The following are common examples of Java read-in text, as well as the corresponding solution for the collector.

Read in the specified column
Read the 3 columns in sOrder.txt by column name: OrderID, Client, Amount. The source data is as follows:

The Collector code:

A
1 =file ("D: \\sOrder.txt") [email protected] (Orderid,client,amount)

Results:

1. @t indicates that the 1th row is read as a column name. When a file does not contain a column name, you can refer to each column by ordinal, such as in the 1th, 2, and 4 columns, using this code: file ("D: \\sOrder.txt"). Import (#1, #2, #4) with the following results:

2. If you want to output computed columns, such as the year and OrderID into Neworderid, and with the client, amount with the output, you can use the following code:

A
1 =file ("D: \\sOrder.txt") [email protected] ()
2 =a1.new (String (Year (OrderDate)) + "_" +string (OrderID): Neworder,client,amount)

The function import reads all fields by default, and the new function creates a newly created two-dimensional table with the following results:

3. The default delimiter is tab, you can also use other characters, such as reading a comma-delimited CSV file, you can use this code file ("D: \\sOrder.txt") [email protected] (; ",").

4. If you only output some lines, you can specify by line number, such as output 2-100 lines, code is a1.to (2,100), starting from line 3rd, the code is a1.to (3,).

5. Individual cases will be read in columns, such as Orderid,client,amount vertically into a 1-column output, after reading the data can be implemented by the following code: Create (All). Record (A1. OrderID) | A1. (Client) | A1. (Amount)) 。

Read large files

For large files that exceed memory, the collector cursor can be used to read the file and Java is accessed by the JDBC stream.

The Collector code:

A
1 =file ("D: \\sOrder.txt") [email protected] (Orderid,client,amount)

1. If you want to speed up the file read speed, you can use multi-threaded parallel processing technology, simply add the @m option, the code is =file ("D: \\sOrder.txt") [email protected] (orderid,client,amount). However, because multithreading is read in parallel, this usage will not guarantee the order in which the data is read.

2. Sometimes it is necessary to manually segment and then parallel computing, then read into a certain section of the file, in code can be implemented: file ("D:\\sorder.txt") [email protected]@t (;, 2:24)

@z means that the file is roughly divided into 24 parts in bytes, and only the 2nd part is read, and the collector automatically fetches the header to ensure that the data being fetched is the entire line.

If the memory still does not fit after fragmentation, you can change the import function to cursor, which is the output as a cursor.

Read in file by column width

File Data.txt no delimiter, as follows:

A two-dimensional table of 4 columns needs to be read in the specified width and output to the Java,id column to take the first 3 bits, the flag column takes 10-11 bits, the D1 column takes 14-24 bits, and the D2 column takes 25-33 bits. If the 1th row of 4 columns are: 001, DT, 100000000000, 3210XXXX.

The Collector code:

A
1 =file ("D:\\data.txt") [email protected] ()
2 =a1.new (Mid (~,1,3): Id,mid (~,10,2): Flag,mid (~,14,11):d 1,mid (~,25,9):d 2)

A1:@i indicates that a file is returned as a sequence (collection) when there is only one column.

A2: Create a new two-dimensional table based on A1, the mid function intercepts strings, ~ represents each row of data.

Results:

Text with special characters

File Data.csv contains quotation marks, some quotes affect the normal use of data, now to remove the quotation marks and then output to Java, the source data is as follows:

The Collector code:

A
1 =file ("D:\\data.csv"). import (; ",")
2 =a1.new (replace (_1, "\" "," "): _1,replace (_2," \ "", ""): _2,
Replace (_3, "\" "," "): _3,replace (_4," \ "", ""): _4)

Results:

Text with mathematical formulas

The formula in the text needs to be parsed into an expression, calculated and then output, with the following source data:

The Collector code:

A
1 =file ("D:\\equations.txt") [email protected] ()
2 =as1.new (~:equations,eval (String (~)): result)

function eval to dynamically parse a string into an expression and execute it.

Results:

Multiple rows of records
The following file represents one record per three lines, for example the first record is: JFS 3 468.0 2009-08-13 39, you now need to export the file as a two-dimensional table.

The Collector code:

A
1 =file ("D:\\data.txt") [email protected] ()
2 =a1.group ((#-1) \3)
3 =a2.new (~ (1): OrderID, (line=~ (2). Array ("\ T")) (1): Client,line (2): Sellerid,line (3): amount,~ (3): OrderDate

The file is read as a sequence, and @s indicates that the field is not split. Each of the three rows is divided into a group. "#" represents the line number, and "\" represents the division of integers. Finally, a new order table is created based on each set of results, ~ (1) represents the 1th member of the current group, and the function array splits the string into a sequence with the following results:

If the file is too large to fit into memory, you should open the file with a cursor and then batch compute. The first step is to establish a sub.dfx, which is to read a batch of data and return it when there is an external request, until the file ends with the following code:

A B
1 =file ("D:\\data.txt") [email protected] ()
2 For a1,3000 =a2.group ((#-1) \3)
3 =b2.new (~ (1): OrderID, (line=~ (2). Array ("\ T")) (1): Client,line (2): Sellerid,line (3): amount,~ (3): OrderDate
4 Result B3

Loop A1, read 3,000 data each time, and follow the previous algorithm processing.

B4 indicates that the B3 is passed back to the main script. The code for the main script (that is, the DFX file called by Java) is as follows:

A
1 =pcursor ("sub.dfx")

The function pcursor can request data from the SUB.DFX and switch to the cursor output.

Indeterminate row Record

Each record in the file data.txt is a variable number of rows, but each field has its own fixed mark, namely "Object Type:", "Left:", "Top", "line Color:" Until the end of the text, 1th record is: Symbol1, 14, 11, RGB (1 0 0). Now read it as a structured two-dimensional table.

The Collector code:

A
1 =file ("Data.txt"). Read ()
2 =a1.array ("Object Type:"). to (2,)
3 =a2.new (~.array ("\ r \ n") (1): Otype,mid (~,s=pos (~, "Left:") +len ("Left:"), POS (~, "\ r \ n", s)-s): L,mid (~,s=pos (~, "Top:" ) +len ("Top:"), POS (~, "\ r \ n", s)-s): T,mid (~,s=pos (~, "line color:") +len ("Line Color:"), if (R=pos (~, "\ r \ n", s), R,len (~) )-s+1): Lcolor)

The read function can read a file as a large string. Then split the string with a delimiter, removing the first blank line. Finally, create a New order table, using the String function array, POS, Len, mid to find the desired field. Note that the last line may not have a carriage return, so the if is judged. Final Result:

A string function is used when looking for a field, but you can also use regular expressions.

If the file is too large for memory to fit, you can read it in batches using the function Pcursor.

Records grouped by tags

File Data.txt the records in groups, the list tag is the group name (such as Aro, BDR, BSF), you need to combine the group name and the fields in the group output. The source data is as follows:

The Collector code:

A
1 =file ("Mutiline2.txt") [email protected] ()
2 [Email protected] (Like (~, "list:*"))
3 =a2.conj (~.to (2,). New (Mid (a2.~ (1), 6): Client, (T=~.array ("T")) (1): C1,t (2): C2,t (3): C3,t (4): C4)

The file is read as a sequence of strings, and then grouped by record delimited tags, @i indicates that the condition is true and is divided into a new set, * is a wildcard character. A2 as follows:

After that, the fields are removed by ordinal, and then the records are merged, the results are as follows:

Use the collector to assist Java in reading text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.