Awk basic knowledge summary page 1/2

Last Update:2018-12-08 Source: Internet

Author: User

Tags processing text string to number

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Use Rules

Awk is suitable for text processing and report generation. It also has many well-designed features that allow programming that requires special skills.
Awk syntax is more common. It draws on some essential parts of some languages, such as C, python, and bash.

First awk
Let's continue and start using awk to understand how it works. Enter the following command in the command line:
$ Awk '{print}'/etc/passwd
You will see the content of the/etc/passwd file in front of you. Now, explain what awk has done. When calling awk, we specify/etc/passwd as the input file. When awk is executed, it runs the print command on each line in/etc/passwd in sequence. All outputs are sent to stdout, and the result is exactly the same as that of running catting/etc/passwd.

Now, explain the {print} code block. In awk, curly braces are used to combine several pieces of code, which is similar to the C language. There is only one print command in the code block. In awk, if only the print command appears, all contents of the current row will be printed.

Here is another awk example, which serves exactly the same purpose as the previous example:
$ Awk '{print $0}'/etc/passwd

In awk, the $0 variable indicates the entire current row, so print and print $0 play the same role.

Create an awk program to output data that is completely irrelevant to the input data.
Example 1:
$ Awk '{print ""}'/etc/passwd
If you pass the "" string to the print command, it prints a blank line. Test the script and awk outputs a blank line for each row in the/etc/passwd file. We can see that awk executes this script for each row in the input file.

Example 2:
$ Awk '{print "hiya"}'/etc/passwd
Running this script will fill your screen with hiya.

2. process multiple fields
Awk is very good at processing text that is divided into multiple logical fields. It can also reference each independent field in the awk script.
Print the list of all user accounts on the system:
$ Awk-F ":" '{print $1}'/etc/passwd
In the preceding example, when awk is called, use the-F option to specify ":" As the field separator. When awk processes the print $1 command, it prints the first field that appears in each row of the input file.
The following is another example:
$ Awk-F ":" '{print $1 $3}'/etc/passwd

The following is an excerpt from the Script output:
Halt7
Operator11
Root0
Shutdown6
Sync5
Bin1
... Etc.

As you can see, awk prints the first and third fields of the/etc/passwd file, which are the username and user identity fields respectively. Now, when the script runs, it is not ideal-there is no space between two output fields! If you are used to programming using bash or python, you will expect the print $1 $3 command to insert spaces between two fields. However, when two strings are adjacent to each other in the awk program, the awk connects them but does not add spaces between them. The following command inserts spaces in these two fields:
$ Awk-F ":" '{print $1 "" $3}'/etc/passwd

When print is called in this way, it connects $1, "", and $3 to create readable output.
You can also insert some text labels:
$ Awk-F ":" '{print "username:" $1 "ttuid:" $3 "}'/etc/passwd

This will generate the following output:
Username: halt uid: 7
Username: operator uid: 11
Username: root uid: 0
Username: shutdown uid: 6
Username: sync uid: 5
Username: bin uid: 1
... Etc.

3. Call external scripts
Passing scripts as command line arguments to awk is very easy for small single-line programs.
For multi-line programs, you can write scripts in the external file, and then pass the-f option to awk to provide it with calls to the external script file:
$ Awk-f myscript. awk myfile. in

You can also use the awk function to add scripts to text files. For example:
BEGIN {
FS = ":"
}
{Print $1}
Print the first field of each line in/etc/passwd.

In this script, the field separator is specified in the Code itself (by setting the FS variable ).
Set the field separator in the script itself. You can enter less than one command line independent variable.

4. begin and end blocks

BEGIN and END blocks
Generally, for each input line, awk executes each script code block once. However, the initialization code may need to be executed before awk starts to process the text in the input file. In this case, awk allows you to define a BEGIN block. We used the BEGIN block in the previous example. Because awk executes the BEGIN block before processing the input file, it initializes the FS (field separator) variables, print headers, or initialize excellent positions of other global variables that will be referenced in the program.

Awk also provides another special block called the END block. Awk executes this block after processing all rows in the input file. Generally, the END block is used to execute the final calculation or print the summary information that should appear at the END of the output stream.

5. Regular Expression
Awk allows you to use a regular expression to select whether to execute an independent code block based on whether the regular expression matches the current row.
Output rows containing the Character Sequence foo:
/Foo/{print}

For complex points, only the rows containing floating point numbers are printed:
/[0-9] +. [0-9] */{print}

You can place any Boolean expression before a code block to control when a specific block is executed. Awk executes the code block only when the previous Boolean expression is evaluated as true. The following sample script output outputs the first field equal to the third field in all rows of fred. If the first field in the current row is not the same as fred, awk will continue to process the file without executing the print statement for the current row:
$1 = "fred" {print $3}

Awk provides a complete set of comparison operators, including "=", "<", ">", "<=", "> =", and "! = ". In addition, awk also provides "~ "And "!~ "Operators, which indicate" match "and" mismatch "respectively ".
They are used to specify variables on the left side of the operator and regular expressions on the right side. If the fifth field of a row contains the Character Sequence root, the following example prints only the third field in the row:
$5 ~ /Root/{print $3}

6. conditional statements
Awk also provides very good if statements similar to the C language. If statement example:
{
If ($5 ~ /Root /){
Print $3
}
}
Execute the code block for each input line and use the if statement to select to execute the print command.
More complex examples of awk if statements.
{
If ($1 = "foo "){
If ($2 = "foo "){
Print "uno"
} Else {
Print "one"
}
} Else if ($1 = "bar "){
Print "two"
} Else {
Print "three"
}
}

Using the if statement, you can also run the following code:
! /Matchme/{print $1 $3 $4}
Convert:
{
If ($0 !~ /Matchme /){
Print $1 $3 $4
}
}
Both scripts only output rows that do not contain the matchme character sequence.

Awk also allows the use of boolean operators "|" (logical and) and "&" (logical or) to create more complex Boolean expressions:
($1 = "foo") & ($2 = "bar") {print}
In this example, only the row with the first field equal to foo and the Second Field equal to bar is printed.

7. Variables
Awk variables, numeric variables, and string variables.

Numeric variable
So far, we do not print strings, the whole line is a specific field. However, awk can also perform integer and floating point operations. By using mathematical expressions, you can easily write scripts with the number of blank rows in the computing file.
BEGIN {x = 0}
/^ $/{X = x + 1}
END {print "I found" x "blank lines .:}"}
In the BEGIN block, initialize the integer variable x to zero. Then, every time awk encounters a blank row, awk will execute the x = x + 1 Statement, increasing x.
After all rows are processed, run the END block. awk prints the final summary and specifies the number of blank rows it finds.

String variable
One of the advantages of awk is "simple and stringized ". I think the awk variable is "stringized" because all awk variables are stored internally in the string format. At the same time, the awk variable is "simple", because it can be mathematical operations, and as long as the variable contains a valid numeric string, awk will automatically process the conversion steps from string to number. To understand my point of view, consider the following example:
X = "1.01"
# We just set x to contain the * string * "1.01"
X = x + 1
# We just added one to a * string *
Print x
# Incidentally, these are comments
Awk will output:
2.01

Although the string value 1.01 is assigned to the variable x, you can still add one to it. But it cannot be done in bash or python.
Bash does not support floating-point operations. Moreover, if bash has "stringized" variables, they are not "simple"; to perform any mathematical operations, bash requires us to put the numbers in an ugly $ () structure.
If python is used, it must be converted to a floating point value before any mathematical operation is performed on the 1.01 string. Although this is not difficult, it is still an additional step.
If awk is used, it is fully automated, and it will make our code clean and tidy. If you want to multiply the first field of each input line and add one, you can use the following script:
{Print ($1 ^ 2) + 1}

In a small experiment, we can find that if a specific variable does not contain valid numbers, awk treats the variable as a numerical zero when evaluating the mathematical expression.

8. Operators
Awk has a complete set of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk also allows the aforementioned exponential operators "^", modulo (remainder) the "%" operator and many other easy-to-use value assignment operators borrowed from the C language.

These operators include addition, subtraction, and subtraction (I ++, -- foo), add/subtract/multiply/divide the value assignment operator (a + = 3, B * = 2, c/= 2.2, d-= 6.2 ). More than that -- we also have easy-to-use modulo/exponential value assignment operators (a ^ = 2, B % = 4 ).

Field Separator
Awk has its own special variable set. Some of them allow you to adjust the running mode of the awk, while other variables can be read to collect useful information about the input. We have been exposed to one of these special variables, FS. As mentioned above, this variable allows you to set the Character Sequence between fields to be searched by awk. When we use/etc/passwd as the input, set FS ":". When there is a problem with this, we can use FS more flexibly.

The FS value is not limited to a single character. You can set it to a regular expression by specifying the Character Mode of any length. If you are processing fields separated by one or more tabs, you may want to set FS as follows:
FS = "t +"

In the preceding example, we use a special "+" rule expression character, which represents "one or more previous characters ".

If fields are separated by spaces (one or more spaces or tabs), you may want to set FS to the following rule expression:
FS = "[[: space:] +]"

This assignment expression is not necessary. Why? By default, FS is set to a single space character. awk interprets this as "one or more spaces or tabs ". In this special example, the default FS setting is exactly what you want most!

Complex rule expressions are not a problem. Even if your record is separated by the word "foo" and followed by three numbers, the following rule expression still allows correct data analysis:

FS = "foo [0-9] [0-9] [0-9]"

Field quantity
The two variables we will discuss are generally used to read useful information about input instead of assigning values. The first is the NF variable, also known as the "field quantity" variable. Awk automatically sets this variable to the number of fields in the current record. You can use the NF variable to display only some input rows:
NF = 3 {print "this participating record has three fields:" $0}
Of course, you can also use the NF variable in the Condition Statement as follows:
{
If (NF> 2 ){
Print $1 "" $2 ":" $3
}
}

9. processing records
Record Number
The record number (NR) is another convenient variable. It always contains the number of the current record (awk counts the first record as Record Number 1 ). So far, we have processed an input file containing a record for each row. In these cases, NR will also tell you the current row number. However, this issue does not occur when we start to process multiple rows of records later in this series! You can use NR to print only some input rows like using NF variables:
(NR <10) | (NR> 100) {print "We are on record number 1-9 or 101 + "}
Another example:
{
# Skip header
If (NR> 10 ){
Print "OK, now for the real information! "
}

}

Awk provides additional variables suitable for various purposes. We will discuss these variables in future articles.

Multi-row record
Awk is an excellent tool for reading and processing structured data (such as system/etc/passwd files. /Etc/passwd is a UNIX user database, and is a text file bounded by a colon. It contains many important information, including all existing user accounts, user IDs, and other information. In my previous article, I demonstrated how awk can easily analyze this file. We only need to set the FS (field separator) variable ":".

After setting the FS variable correctly, you can configure the awk to analyze almost any type of structured data, as long as the data is one record per line. However, to analyze records that occupy multiple rows, setting FS alone is not enough. In these cases, we also need to modify the RS record delimiter variable. The RS variable tells the awk when the current record ends and when the new record starts.

For example, let's discuss how to deal with the address list of persons involved in the Federal Witness Protection Plan:
Jimmy the Weasel
100 Pleasant Drive
San Francisco, CA 12345
Big Tony
200 Incognito Ave.
Suburbia, WA 67890

Theoretically, we want awk to regard every three rows as an independent record, rather than three independent records. If awk regards the first line of an address as the first field ($1) and the street address as the second field ($2 ), the city, state, and zip code is regarded as the third field $3, so this code will become very simple. The Code is as follows:
BEGIN {
FS = "n"
RS = ""
}

In the code above, setting FS to "n" tells awk that each field occupies a row. By setting RS to "", each address record of the awk is also told to be separated by a blank line. Once awk knows how to format the input, it can perform all the analysis work for us. The rest of the script is simple. Let's look at a complete script that will analyze the address list, print each record on a line, and separate each field with commas.
Address. awk BEGIN {
FS = "n"
RS = ""
}
{
Print $1 "," $2 "," $3
}

Save the script as address. awk, and store the address data in the address.txt file. You can run this script by entering "awk-f address. awk address.txt. The output is as follows:
Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
Big Tony, 200 Incognito Ave., Suburbia, WA 67890

OFS and ORS
In the print Statement of address. awk, we can see that awk connects (merges) strings adjacent to each other in a row. We use this function to insert a comma and space (",") between the three fields in the same line (","). This method is useful but ugly. Instead of inserting "," string between fields, it is better to set a special awk variable OFS to let the awk complete the task.
Print "Hello", "there", "Jim! "

The comma in this line of code is not part of the actual text string. In fact, they tell awk "Hello", "there", and "Jim! "Is a separate field and the OFS variable should be printed between each string.
By default, awk generates the following output:
Hello there Jim!

This is the output result by default. OFS is set to "" with a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. The following is a revision of the original address. awk program, which uses OFS to output the intermediate "," string:

Address. awk Revision
BEGIN {
FS = "n"
RS = ""
OFS = ","
}
{
Print $1, $2, $3
}
Awk also has a special variable ORS, which is the output record separator ". By setting the OFS with the default line feed ("n"), we can control the characters automatically printed at the end of the print statement. The default ORS value causes awk to output each new print statement in the new line. If you want to double the output interval, you can set ORS to "nn ". Or, if you want to separate records with a single space (without line breaks), set ORS "".

Convert multiple rows to a tab-separated format
Suppose we have written a script that converts the address list to one row for each record and uses the tab format to import the workbook. After using address. awk with a slight modification, we can clearly see that this program is only applicable to three lines of addresses. If awk encounters the following address, the fourth row is discarded and the row is not printed:
Cousin Vinnie
Vinnie's Auto Shop
City Alley 300
Sosueme, OR 76543.

To handle this situation, the Code should consider the number of records for each field and print each record in sequence. Now, the Code only prints the first three fields of the address. Here are some of the code we want:

Suitable for address. awk versions with any multi-field addresses
BEGIN {
FS = "n"
RS = ""
ORS = ""
}
{
X = 1
While (x <NF ){
Print $ x "t"
X ++
}
Print $ NF "n"
}

First, set the field separator FS to "n" and the record separator RS to "", so that the awk can correctly analyze the multi-row address as before. Then, set the output record separator ORS to "", which will enable the print statement to output no new lines at the end of each call. This means that if you want any text to start from a new line, You need to explicitly write print "n ".

In the main code block, a variable x is created to store the number of the current field being processed. At first, it is set to 1. Then, we use a while loop (an awk loop structure, equivalent to a while loop in C) to print records and tab characters repeatedly for all records (except the last record. Finally, print the last record and line feed. In addition, because the ORS is set to "", print does not output line breaks. The program output is as follows, which is exactly what we expected (not beautiful, but it is bounded by tab to facilitate the import of workbooks ):
Jimmy the Weasel 100 Pleasant Drive San Francisco, CA 12345
Big Tony 200 Incognito Ave. Suburbia, WA 67890
Cousin Vinnie's Auto Shop 300 City Alley Sosueme, OR 76543

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More