How the Collector handles text computing challenges

Source: Internet
Author: User

Text can be said to be in addition to the database is almost the most common form of data storage, for the calculation of text is very important. However, the text itself does not have the ability to compute, unlike the database has SQL syntax, so that the calculation of the text needs to use programming language coding, and most of the programming language for text processing is not aggregated, writing batch operations is cumbersome. For example, in Java to write a very simple sum operation will be a lot of rows, if it involves filtering grouping this operation requires hundreds of lines of code. In recent years, the new scripting languages such as perl,python,r have improved in these areas, but support for batch structured computing is still inadequate, and the integration is poor.

Another option is to import text data into the database and then use SQL computing, but the text often lacks the strong data type characteristics required by the database, and the import process is often accompanied by tedious data collation, with one more step, and the processing efficiency of the transaction can be severely affected.

As a centralized dynamic scripting language, the collector compensates for this shortcoming to some extent. Here are some common examples of text calculations that illustrate the benefits of this kind of computing by the collector.

No structure operation

Text parsing

Inline data items for text T.txt are separated by an indeterminate number of spaces:

20010-8-13 991003 3166.63 3332.57 3166.63 3295.11

2010-8-10 991003 3116.31 3182.66 3084.2 3140.2

......

Now you want to calculate the average list of the last four items of data for each row. Use the collector for just one sentence:

A

1

=file ("T.txt") [email protected] (). ([Email protected] (""). to ( -4). AVG ())

[Email protected] () reads the text into a string collection, [email protected] ("") splits the string into a set of substrings in an indefinite number of whitespace characters, and the @p is automatically parsed into the appropriate data type for further computation (averaging is calculated here).

Writes the first 8 entries of the comma-delimited text t.csv rows with no less than 8 items in the row to another text R.txt, with separators replaced by | (separators used by some banking systems):

A

1

=file ("T.csv") [email protected] (). (~.array (",")). Select (~.len () >=8)

2

>file ("R.txt"). Write (A1. ( ~.to (8). String ("|"))

The string () function sets the collection to a string by the specified delimiter.

Text T.txt is a string in the form of a line that needs to be split into multiple files by the state name (LA) of the character us before.

coop:166657, ' NEW IBERIA AIRPORT Acadiana regional LA US ', 200001,177,553

......

A

1

=file ("T.txt") [email protected] ()

2

=a1.group (Mid (~,pos (~, "US") -2,2): state;~:d ATA)

3

>a2.run (File (state+ ". txt"). Export (data))

The collector also provides support for regular expressions to deal with complex disassembly requirements. However, due to the difficulty of using regular expressions and poor performance, general recommendations are still implemented using conventional methods.

Structure of

Each of the 3 lines in the log S.log constitutes a complete piece of information that needs to be parsed into structured data and then to T. txt:

A

B

1

=file ("S.log") [email protected] ()

2

=create (...)

Set up a target result set

3

For A1.group ((#-1) \3)

...

Group by line number, one unit per 3 lines

...

...

Resolves field values from A3 (3 rows)

...

>a2.insert (...)

Insert to target result set

...

>file ("T.txt"). Export (A2)

Write the results

With the mechanism of grouping by line number, you can use loops to process a set of data each time, simplifying the difficulty.

Clearly, a simpler single-line case is its exception.

If the S.log is too large to read into memory, you can also use a cursor to read in and write out:

A

B

1

=file ("S.log") [email protected] ()

Creating cursors with streaming read-in Files

2

=file ("T.txt")

Result file

3

For a1,3

...

Performs a round of loops per read in 3 lines

...

...

Resolves field values from A3 (3 rows)

...

>[email protected] (...)

Append write to File

Familiar users can also optimize the code, so that parsing multiple records once written out, there will be better performance.

The full information in the log S.log begins with "---start---" and contains an indeterminate number of rows. Just change the front A3 to:

3

for [email protected] (~== "---start---")

A new grouping occurs when---start---appears

Similarly, large text can also be handled with cursors, and the above A3 lattice is changed to:

3

For a1;~== "---start-": 0

Another cycle occurs---start---

There is also a situation, the same paragraph of information in each row has the same prefix (such as the user number that the log belongs to), when the prefix changes will indicate the beginning of another piece of information, it is still as long as simply modify the A3 code can be processed:

3

for [e-mail protected] (left (~,6))

A new group is generated when the first 6 characters change

3

For A1;left (~,6)

The first 6 characters change when another round of cycle

The operations in the previous section can also be modified to support large text using cursors.

Find statistics

Find the file containing the specified word in all the text in the directory, and list the line contents and line number:

A

1

[Email protected] ("*.txt")

2

=a1.conj (file (~) [email protected] (). ( if (POS (~, "xxx"), [A1.~,#,~].string ())). Select (~))

grep is a common Unix command, but some operating systems are not, and are not easy to implement in a program. The collector provides the traversal of the file system, combined with the ability to compute text, as long as two lines of code can be completed.

Lists all occurrences of words and times in the text T.txt, ignoring case:

A

1

=lower (File ("T.txt"). Read ()). Words (). Groups (~:word;count (1): Count)

WordCount is a well-known practice, and the Collector provides the words () function to split the string into words, which can be done with just one sentence.

Lists the text T.txt all words that include the letter a,b,c, ignoring the case:

A

1

=lower (File ("T.txt"). Read ()). Words (). Select (~.array (""). POS (["A", "B", "C"]))

Because of the order problem, the judgment letter contains cannot use the substring to find, wants to use the array ("") to divide the string into the single character set, then uses the set subordinate to judge. The collector with the set operation support is just one sentence.

These operations can be easily modified in the form of segments or cursors to support large text.

How the Collector handles text computing challenges

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.