awk Basic Knowledge Summary 1th/2 page _linux Shell

Source: Internet
Author: User
Tags numeric

1, the use of rules

Awk is suitable for text processing and report generation, and it has a number of well-designed features that allow you to program programs that require special skills.
awk syntax is more common. It draws on some of the best parts of some languages, such as C, Python, and bash.

First awk
Let's go ahead and start using awk to understand how it works. At the command line, enter the following command:
$ Awk ' {print} '/etc/passwd
You will see the contents of the/etc/passwd file appear in front of you. Now, explain what awk has done. When invoking awk, we specify/ETC/PASSWD as the input file. When you execute awk, it executes the print command for each row in the/etc/passwd in turn. All output is sent to stdout, and the result is exactly the same as the execution catting/etc/passwd.

Now, interpret the {print} code block. In awk, curly braces are used to group together several pieces of code, similar to the C language. There is only one print command in the code block. In awk, if only the Print command appears, the entire contents of the current line are printed.

Here is another example of awk that works exactly the same as the previous example:
$ Awk ' {print $} '/etc/passwd

In awk, the $ variable represents the entire current line, so print and print $ work exactly the same.

Create an awk program to output data that is completely unrelated to the input data.
Example 1:
$ Awk ' {print '} '/etc/passwd
As long as the string is passed to the print command, it prints a blank line. Testing the script, you will find that awk prints a blank line for each row in the/etc/passwd file. So, awk executes this script for every line in the input file.

Example 2:
$ Awk ' {print ' Hiya '} '/etc/passwd
Running this script will write full hiya on your screen.

2, processing multiple fields
Awk is very good at handling text that is divided into multiple logical fields, and it can refer to each individual field in the awk script.
Print a list of all user accounts on the system:
$ awk-f ":" ' {print $} '/etc/passwd
In the example above, when you invoke awk, use the-F option to specify ":" as the field separator. When awk processes the print command, it prints out the first field that appears in each row in the input file.
Here's another example:
$ awk-f ":" ' {print $} '/etc/passwd

The following is an excerpt from the script output:
Halt7
Operator11
Root0
Shutdown6
Sync5
Bin1
..... etc.

As you can see, awk prints out the first and third fields of the/etc/passwd file, which are exactly the user name and user identification fields. Now, when the script runs, it's not ideal-there are no spaces between the two output fields! If you're used to programming with bash or Python, you'll expect the print $ command to insert spaces between two fields. However, when two strings are adjacent to each other in the awk program, awk connects them but does not add spaces between them. The following command inserts spaces in both fields:
$ awk-f ":" ' {print $ ' "$} '/etc/passwd

When you call print this way, it connects to $, "", and $ $ to create readable output.
You can also insert some text labels:
$ awk-f ":" ' {print ' username: "$" ttuid: "$"} '/etc/passwd

This produces the following output:
Username:halt Uid:7
Username:operator uid:11
Username:root uid:0
Username:shutdown Uid:6
Username:sync Uid:5
Username:bin uid:1
..... etc.

3. Calling external scripts
Passing the script as a command-line argument to awk is simple for small single-line programs.
For a multiline program, you can compose a script in an external file, and then pass the-f option to awk to provide it with an external script file call:
$ awk-f Myscript.awk myfile.in

You can also use the additional awk feature by putting your script in a text file. For example:
BEGIN {
Fs= ":"
}
{Print $}
Print out the first field in each row of the/etc/passwd

In this script, the field delimiter is specified in the code itself (by setting the FS variable).
To set the field separator in the script itself, you can enter a command line argument less.

4, begin and End blocks

BEGIN and End Blocks
In general, Awk executes each script block once for each input line. However, you might need to execute initialization code before AWK starts working on the text in the input file. In this case, awk allows you to define a BEGIN block. We used the BEGIN block in the previous example. Because Awk executes the BEGIN block before it starts processing the input file, it is an excellent place to initialize the FS (field delimiter) variable, print the header, or initialize other global variables that will be referenced later in the program.

Awk also provides another special block, called an end block. Awk executes this block after all the rows in the input file are processed. Typically, end blocks are used to perform final calculations or to print summary information that should appear at the end of the output stream.

5, Regular expression
AWK allows the use of regular expressions to select the execution of a separate block of code based on whether the regular expression matches the current row.
Output lines containing the character sequence Foo:
/foo/{print}

Complex, print only the lines that contain floating point numbers:
/[0-9]+. [0-9]*/{print}

You can place any of the Boolean expressions before a block of code to control when a particular block is executed. Awk executes a block of code only if the preceding Boolean expression is evaluated as true. The following sample script output outputs the third field in all rows whose first field equals Fred. If the first field in the current row is not equal to Fred,awk will continue to process the file without executing the PRINT statement for the current row:
$ = = "Fred" {print $}

AWK provides a complete set of comparison operators, including "= =", "<", ">", "<=", ">=", and "!=". In addition, AWK provides the "~" and "!~" operators, which represent "match" and "mismatch", respectively.
Their use is to specify a variable on the left side of the operator, and a regular expression on the right. If the fifth field in a row contains the character sequence root, the following example prints only the third field in the line:
$ ~/root/{print $}

6, conditional statement
Awk also provides a very good if statement similar to the C language. If statement Example:
{
if ($ ~/root/) {
Print $
}
}
Executes a block of code for each input line, using the IF statement to select the Execute Print command.
More complex example of an awk if statement.
{
if ($ = = "Foo") {
if ($ = "Foo") {
Print "Uno"
} else {
Print "One"
}
else if (= = "Bar") {
Print "Two"
} else {
Print "Three"
}
}

You can also use the IF statement to add code:
! /matchme/{print $ $}
Convert into:
{
if ($!~/matchme/) {
Print $ $
}
}
Both scripts output only those rows that do not contain a sequence of matchme characters.

Awk also allows you to use the Boolean operator "| |" (Logical AND) and "&&" (logical OR) to create more complex Boolean expressions:
($ = = "Foo") && ($ = "Bar") {print}
This example prints only the row with the first field equal to Foo and the second field equals bar.

7, variable
Awk's variables, numeric variables, and string variables.

numeric variables
So far, we are not printing a string, the whole line is a specific field. However, awk can also perform integer and floating-point operations. With mathematical expressions, you can easily write a script that calculates the number of blank lines in a file.
BEGIN {x=0}
/^$/{x=x+1}
End {print "I found" x "Blank lines.:}"}
In the BEGIN block, initialize the integer variable x to 0. Then, each time awk encounters a blank row, awk executes the x=x+1 statement, incrementing X.
After all the rows have been processed, the end block is executed, and awk prints out the final summary indicating the number of blank lines it finds.

A string of variables
One of the advantages of awk is "simplicity and string." I think the awk variable is "string" because all of the awk variables are internally stored as strings. Also, the awk variable is "simple" because it can be mathematically manipulated, and as long as the variable contains a valid numeric string, awk automatically processes the conversion steps of the string to the number. To understand my point of view, consider the following example:
x= "1.01"
# We just set X to contain the *string* "1.01"
X=x+1
# We just added one to a *string*
Print X
# Incidentally, these are comments:)
awk will output:
2.01

Although assigning a string value of 1.01 to a variable x, you can still add one to it. But you can't do that in bash and Python.
First, bash does not support floating-point operations. Also, if bash has "string" variables, they are not "simple", and to perform any mathematical operation, bash requires us to put the numbers in the Ugly $ ()) structure.
If you are using Python, you must convert it to a floating-point value before performing any mathematical operations on the 1.01 string. While this is not difficult, it is still an additional step.
If you use awk, it's fully automated, and that makes our code nice and tidy. If you want to have the first field of each input row and add one, you can use the following script:
{print ($1^2) +1}

If you do a small experiment, you can see that if a particular variable does not contain a valid number, AWK treats the variable as number 0 when it is evaluated on a mathematical expression.

8, operator
Awk has a complete set of mathematical operators. In addition to the standard addition, subtraction, multiplication, and addition, AWK also allows you to use the exponential operator "^", modulo (remainder) operator "%", and many other easy-to-use assignment operators borrowed from the C language as shown earlier.

These operators include front and back subtraction (i++ 、--foo), add/subtract/multiply/divide assignment operators (a+=3, b*=2, c/=2.2, d-=6.2). Not only that--we also have an easy-to-use modulo/exponential assignment operator (a^=2, b%=4).

Field Separator
AWK has its own set of special variables. Some of these allow you to adjust the way awk works, while other variables can be read to collect useful information about the input. We have contacted one of these special variables, FS. As mentioned earlier, this variable allows you to set the sequence of characters between the fields that awk is looking for. When we use/etc/passwd as input, the FS is set to ":". When there is a problem with this, we can also use FS more flexibly.

The FS value is not limited to a single character; you can set it to a regular expression by specifying a character pattern of any length. If you are working with one or more tab-delimited fields, you may want to set up FS in the following ways:
Fs= "t+"

In the example above, we use the special "+" rule expression character, which represents "one or more previous characters."

If the field is delimited by a space (one or more spaces or tab), you may want to set up FS to the following rule expression:
Fs= "[[: space:]+]"

There is also a problem with this assignment expression, which is not necessary. Why? Because FS is set to a single space character by default, awk interprets this as representing "one or more spaces or tabs." In this particular example, the default FS setting is exactly what you want!

Complex rule expressions are not a problem. Even if your records are delimited by the word "foo" followed by three digits, the following rule expression still allows the data to be parsed correctly:

Fs= "Foo[0-9][0-9][0-9]"

Number of fields
The two variables that we're going to discuss are usually not assigned values, but are used to read to get useful information about the input. The first is the NF variable, also known as the "Number of fields" variable. awk automatically sets the variable to the number of fields in the current record. You can use the NF variable to display only some of the input lines:
NF = = 3 {Print "This particular the record has three fields:" $}
Of course, you can also use the NF variable in a conditional statement as follows:
{
if (NF > 2) {
Print $ "$": "$
}
}

9, Processing Records
Record number
The record number (NR) is another convenient variable. It always contains the number of the current record (awk counts the first record as record number 1). So far, we've processed an input file that contains one record for each row. For these cases, NR will also tell you the current line number. However, when we start working with multiline records in the later part of this series, there is no such situation, so be careful! You can use NR like the NF variable to print only some of the input lines:
(NR < 10) | | (NR > 100) {print "We are on the record number 1-9 or 101+"}
Another example:
{
#skip Header
if (nr>10) {
Print "OK, now for the real information!"
}

}

AWK provides additional variables for a variety of purposes. We'll discuss these variables in a future article.

Multi-Line Records
Awk is an excellent tool for reading and processing structured data, such as the system's/etc/passwd files. /ETC/PASSWD is a UNIX user database and is a colon-delimited text file that contains a number of important information, including all existing user accounts and user identities, and other information. In my previous article, I demonstrated how awk can easily parse this file. We only need to set the FS (field delimiter) variable to ":".

After the FS variable is set correctly, awk can be configured to parse almost any type of structured data, as long as the data is one record per line. However, if you want to analyze records that occupy multiple rows, it is not enough to rely solely on setting up FS. In these cases, we also need to modify the RS record separator variable. The RS variable tells awk when the current record is over and when the new record begins.

For example, let's discuss how to accomplish the task of dealing with the address list of the people involved in the federal Witness Protection Program:
Jimmy the Weasel
Pleasant Drive
San Francisco, CA 12345
Big Tony.
Incognito Ave.
Suburbia, WA 67890

Theoretically, we want awk to consider every 3 lines as a separate record, not three separate records. If Awk sees the first line of the address as the first field ($), the street address as the second field ($), the city, the state, and the ZIP code as the third field of $, then the code becomes very simple. The code is as follows:
BEGIN {
Fs= "n"
Rs= ""
}

In the above code, setting FS to "n" tells Awk that each field occupies one row. By setting RS to "", you also tell awk that each address record is delimited by a blank line. Once awk knows how to format the input, it can perform all the profiling work for us, and the rest of the script is simple. Let's look at a complete script that will parse the address list and print each record on one line, separating each field with a comma.
Address.awk BEGIN {
Fs= "n"
Rs= ""
}
{
Print $ "," $ "," $
}

Save script as Address.awk, address data stored in file Address.txt, you can execute this script by entering "Awk-f Address.awk address.txt". The output is as follows:
Jimmy the weasel, pleasant Drive, San Francisco, CA 12345
Big Tony, Incognito Ave., Suburbia, WA 67890

OFS and ORS
In the Address.awk print statement, you can see that awk joins (merges) strings that are adjacent to each other in a row. We use this feature to insert a comma and a space (",") between three fields on the same line. Although this method is useful, it is rather ugly. Rather than inserting a "," string between fields, let awk do this by setting a special awk variable, OFS.
Print "Hello", "There", "jim!"

The comma in this line of code is not part of the actual literal string. In fact, they tell awk that "Hello", "There" and "jim!" are separate fields, and that you should print OFS variables between each string.
By default, awk produces the following output:
Hello there jim!

This is the default output, and the OFS is set to "", a single space. However, we can easily redefine the OFS so that awk inserts our favorite field separator. The following is a revision of the original Address.awk program, which uses OFS to output those intermediate "," strings:

Revision of the Address.awk version
BEGIN {
Fs= "n"
Rs= ""
Ofs= ","
}
{
Print $, $ $
}
Awk also has a special variable ORS, the full name is "output record separator". By setting a OFS that defaults to newline ("n"), we can control the characters that are automatically printed at the end of the print statement. The default ORS value causes awk to output each new print statement in a new row. If you want to double the interval of output, you can set the ORS to "nn". Alternatively, if you want to separate the records with a single space (without wrapping), set the ORS to "".

Convert multiple lines to a tab-delimited format
Let's say we wrote a script that converts the address list into one row for each record, and tab-bound format to import the spreadsheet. With a slightly modified Address.awk, you can clearly see that the program is only suitable for three rows of addresses. If awk encounters the following address, it discards line fourth and does not print the line:
Cousin Vinnie
Vinnie ' s Auto shop
City Alley
Sosueme, or 76543

To handle this situation, the code is best to consider the number of records per field and print each record in turn. The code now prints only the first three fields of the address. Here are some of the code we want:

Address.awk version for addresses with any number of fields
BEGIN {
Fs= "n"
Rs= ""
Ors= ""
}
{
X=1
while (X<NF) {
Print $x "T"
X + +
}
Print $NF "n"
}

First, set the field separator FS to "n" and set the record separator RS to "" so that awk can parse the multiple-line address correctly as before. The output record separator ORS is then set to "", which causes the print statement to not output new rows at the end of each call. This means that if you want any text to start on a new line, you need to explicitly write to print "n".

In the main code block, a variable x is created to store the number of the current field being processed. At first, it was set to 1. We then use a while loop (an awk loop structure that is equivalent to a while loop in the C language), and repeat the record and tab characters for all records except the last record. Finally, print the last record and line break; In addition, print will not output a line break because the ORS is set to "". The program output is as follows, which is exactly what we expected (not pretty, but tab-bound to facilitate the import of the spreadsheet):
Jimmy the Weasel pleasant Drive San Francisco, CA 12345
Big Tony incognito Ave. Suburbia, WA 67890
Cousin Vinnie Vinnie ' s Auto shop-city Alley Sosueme, or 76543


Current 1/2 page 12 Next read the full text
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.