All those years I used to use awk when I stepped on the pit--awk precautions

Source: Internet
Author: User
Tags integer division

Because of the project experience, awk is often used to process some textual data. I even downloaded a awk:gawk.exe on Windows to enjoy the convenience of awk processing data on Windows.

As the saying goes, "often in the river Walk, where not wet shoes," the use of awk encountered a lot of pits, here a little summary, I hope to help.

1 FS Issues
Take a look at these two awk scripts:

Catdemo_1.txt demo_2.txt1|2|3|4|1|@|2|@|3|@|4|@|awk-F'|' '{print $}'demo_1.txt; # script 1awk-F'|@|' '{print $}'Demo_2.txt; # script 2

The original purpose of the script is to achieve the purpose of pressing ' | ' separately and separate ' |@| ', Output demo.txt second column. But in fact, the first script is right, but the second one is wrong.

Why is it?

Because a vertical bar is a special character in a regular expression, it represents one of the character groups that match the vertical bar. If you want to use the vertical bar itself, you need to use the escape character.

But why does the first command also use a vertical bar without a problem?

This involves awk in a provision:

If FS sets more than one character as the field delimiter, it is interpreted as a regular expression, or the line is split directly by the character as a delimiter.

So the first command uses a vertical bar to do the delimiter, and the second command goes wrong.

2 Regular Expressions vs. backslash numbers
Continue with the above question discussion if Demo.txt is pressed "|@|" As a delimiter, to output demo.txt the second column, the correct answer should be how to write it?
The answer is:

awk ' \\|@\\| ' ' {print $} ' demo.txt;

Note here that the value of FS is ' \\|@\\| ', not simply ' \|@\| ' (This will be an error, hint: awk: Warning: Escape sequence "\|") Be treated as a simple "|" )。 Why do you write it like this?

Let's start with an experiment:

Echo|awk-F'|@|' '{print FS}'# script 1Echo|awk-F'\|@\|' '{print FS}'# script 2Echo|awk-F'\\|@\\|' '{print FS}'# script 3

You can see the first and second scripts, the FS values are the same. The reason is that awk first parses the user input string, assigns the parse result to FS, and then calls the split class function to pass FS as a function parameter.

Instead, split needs to parse the FS once and compile them into regular expressions. When awk parses a string to assign a value to the FS variable, the ' \| ' The delimiter is considered ' | ', resulting in the split function being passed in.

Therefore, if you want awk to split the record correctly, you need to make fs= ' \\|@\\| ', when awk resolves \ \ to the escape character ' \ ', so that the vertical bar can be treated as a normal character processing country.


3 Associated Array Access issues

Once there was a scenario where the file A.txt contains a small amount of user balance (Userid|amt), about 100 lines of records, and the file B.txt contains all the user's balances (userid|amt), about 1 million lines of records.

Now ask to connect A.txt and B.txt (using UserID) to find the UserID that exists in both A.txt and b.txt, and output a record where B.amt is greater than A.amt.

I wrote the following script first:

awk ' | ' ' begin{while (Getline < "a.txt") {v_user_map[$1] = $;}} {    = v_user_map[$1];     if "") && V_amt_a < $2) print $0;}

It seemed that the logic was fine and began to run. But running up finds efficiency far lower than you think, and finds that the process consumes more and more memory.

This is obviously problematic, and it should theoretically be that the phrase "begin" consumes some memory and should not be consumed again.

Having written the C + + code, there are data structures similar to associative arrays, and I quickly guess and experiment to prove the reason: V_amt_a = v_user_map[$1]; This sentence.

Although this is not assigned to v_user_map[$1], awk defaults to NULL, resulting in more and more v_user_map array elements, increasing memory footprint, and less efficient lookups.

Knowing the problem is a good solution, check out the awk help manual and find that you can write:

awk ' | ' ' begin{while (Getline < "a.txt") {v_user_map[$1] = $;}} {    if ($1inif (v_user_map[$1] < $2) Print $0;}}

Use the in operator to determine whether an element is inside an associative array, so there is no default assignment.

4 Memory Limit Issues

If Awk is a 32-bit program (which can be judged using the file command), then script 1 above will probably run on the core. Because by default, 32-bit awk can consume up to 256M of memory.

An unexpected exit occurs if the requested memory exceeds this number.

The workaround is to use a 64-bit program, or modify the environment variable "export ldr_cntrl=maxdata=0x80000000". (Valid AIX4.3 above)

5 Getline return value problem

Note the Getline usage upstairs, while (Getline < "a.txt") loops through the file until the end. This writing is actually not very normative, there are hidden dangers.

Once I thought getline read the end of the file will be empty, and later practice found that this is not the case. Geline returns 0 when it comes to the end of a file, but it still keeps the last row of records intact. So it is changed to this kind of wording.

However, this kind of writing, sometimes encounter problems, Reason: getline return value There are three cases: 1 normal read to a record 0 to reach the end of the file 1 file does not exist or other errors.

If A.txt does not exist, Getline returns-1, causing a dead loop. I have been in the past because of this cause the program hangs dead, so special put forward to let everybody notice.

It is recommended that you take a look at the function descriptions in the Help document before using the function.

6 Piping problems

First look at this script:

ls -1rtdemo.txtlist.txtecho"\ n"awk " {while ("Ls-1rt" | getline) {print NR ":" $ > "List.txt";}} '

Guess: What is the content of List.txt after the script is finished running?


Answer:

Cat List.txt 1 : Demo.txt 1 : List.txt

I believe a lot of friends will feel surprised:

Some people think that List.txt should have only one row of data, which is the last line of the LS-1RT command output.

Some people think there should be 6 data, because LS-1RT executes three times.

People who have this idea, mostly do not know awk a rule: by default, the same file or pipe is opened only once, if you need to open repeatedly, you need to close first.

The above script because there is no explicit close file and pipe, List.txt and LS-1RT are only opened/executed once, so the output is as above.

Guess again: What is the content of List.txt after this script is finished running?

Echo " \ n " awk ' {while ("Ls-1rt" | getline) {print NR ":" $ > "List.txt";} Close ("List.txt"); Close ("Ls-1rt");} '

7 Output Single Quote problem

As you know, awk scripts are generally enclosed in single quotes, such as: awk ' {print ' Do something ";} '.

Therefore, it is more troublesome to use single quotes in awk. Find awk output single quotation marks on the Internet the following methods are generally found:

Echo awk ' {print "' \ '"

A lot of people misunderstand, because the awk script uses single quotation marks as the beginning of the script to end the flag, so in the awk script can not directly use single quotation marks.

In fact, this is a misunderstanding, look at the following script you will know.

Cat Demo. awk  "'";} Echo awk -F Demo. awk '


Asyou can see, the awk script is able to use single quotes directly, and does not need to enclose the script in single quotes. The reason the command line needs to be so awkward is because of the Shell's relationship: the use of single quotes is not handled by the shell as a special character.

Because the awk script often requires $n to get the contents of the first few fields, the $ in the shell has a special meaning, which represents the start of the variable. If you don't enclose it in single quotes, you'll get a problem.

' {print ' ' \ ' ";} ' This paragraph can understand: The script is divided into three sections

1 ' {print "'  ;2, \';3'";} ';

This is the case after each segment is parsed by the shell.

1 " ; 2 ' ; 3 " ;};

The three-segment combination is the script content passed to awk: {print "'";}. After understanding this, you'll know how to solve the following problems with awk in Windows:

c:\users\hch>awk'{print ';} ' awk ' {printawkchar' in expression

8 Auto-implicit conversion problem

In the C language, we are used to dividing integers, and the result is integers. So the 5/2 result is 2, not 2.5.

in awk, however, because the variable type is not explicitly specified, implicit conversions are often found in the variable calculation process, and integer division results may be decimals.
Example:

Echo awk ' {v_result = 5/2; print V_result} ' 2.5

What if we want to achieve the integer division effect of the C language? You can use the INT function, as follows:

Echo awk ' {v_result = int (5/2); print V_result} ' 2

9 Chinese vertical line problem

The actual work, often encountered in the file in each row of records with a vertical line ' | ' As a delimiter, such as "A|b|c|d". If there is no Chinese in the file, it is no problem.

But if there is Chinese, especially GBK encoded in Chinese, it is easy to do something wrong.

In GBK encoding, Chinese is made up of two bytes, the first byte value range is [128, 256), and the second byte value range is [0, 256).

If the second byte value is exactly ' 124 ', i.e. ' | ' The ASSCII code of the character, which, when awk is processed, mistakenly thinks that this byte is a delimiter, causing confusion when splitting the string.

What kind of Chinese is that? You can use the following script to output a special Chinese with a vertical bar in the GBK encoding: (similar to other encodings)

echo| awk ' {for (i = n, i < i++) {printf ("%c|", I);}} ' #终端编码要是GBK
€| Sanctioned 倈 Yi confess 厊 fl 噟 坾 mattress Lim 媩 寍 tiers Protection Mingtao 恷 憒 抾 Faith 攟 晐 East Kaji 榺 檤 殀 泑 渱 潀 Seto Sakaemachi 焲 爘        

In this case I did not find a good way to deal with, it is recommended to use a longer delimiter, reduce the probability of encounter problems, such as ' |@| '.

If the delimiter is immutable, consider using the Iconv conversion encoding and converting it back after processing.

10 function name conflicts with variable names

Awk has a lot of functions built into it, and if you accidentally get the name of the variable as the name of the function, the program will give you an error. The hint is not clear, just say wrong, don't say reason, special pit.
For example, the following error:

awk ' {if (NR = = FNR) {sub[$1] = $;} else {print sub[$1];}} ' subsid_amt.txt subsid.txt awk if (NR = = FNR) {sub[$1] = $2else {print sub[$1];}} awk: ^ syntax error

Because this script is late at night to work overtime to write, when the mind is not clear, see the error mask for a long time: how to see the grammar is right, but the operation is always prompt grammar wrong.
So now I'm writing more complex awk scripts, and the variable names are used to precede the v_ suffix, which reduces the probability of name collisions.

For the time being, this is summed up. If you have a problem with awk, you might want to send it together to discuss it:)

All those years I used to use awk when I stepped on the pit--awk precautions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.