Awk instance Part 1

Last Update:2013-11-16 Source: Internet

Author: User

Tags ibm developerworks

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cause: Recently, google's awk tutorials go to a beginner awk series on IBM developerworks, written by Daniel Robbins, founder of Gentoo Linux many years ago. I think it is quite good, but the Chinese version on developerworks is short of a few pounds. It is confusing to see the loss of the entire segment of the original article. I wonder whether it is a machine translation problem or a layout problem? So I only had to go through the English version (my English was poor, and I was forced to read it), and I had to translate it again. I have the right to reserve it for myself and for other friends who needed it. Despite being a beginner, it is still worth seeing. I did not expect the founder of Gentoo to write such basic things for the whole life.

Address: http://www.ibm.com/developerworks/linux/library/l-awk1/index.html

Common threads: Awk by example, Part 1

A powerful language with a strange name

Summary: awk is a powerful language with a strange name. This is the first article in this series. This series includes three articles. In this article, Daniel will show you how to quickly master awk programming skills. As the series progresses, it will include more advanced topics and ultimately demonstrate a real advanced awk application.

Defend AWK

In this series of articles, I will lead you to become a skilled awk programmer. I admit that awk does not have a pretty and fashionable name, And the GNU version of awk (named gawk) sounds strange. When those who are not familiar with the language hear awk, they will think of a mess of outdated and outdated code, they think it even caused the most knowledgeable UNIX experts to go to the crazy edge (causing them to constantly use "kill-9! ", Just like using a coffee maker ).

Indeed, awk has no good names, but it is a powerful language. AWK is suitable for text processing and report generation, and many well-designed features allow rigorous programming. Unlike some other languages, awk syntax is quite familiar to you, he borrowed some of the best parts of languages such as C, python, and bash (although awk appears earlier than python and bash in terms of technology ). AWK is such a language-once you learn it, it will become an important part of your programming strategy library.

The first awk Program

Let's start playing with awk and see how he works. Enter the following command in the command line:

$ awk '{ print }' /etc/passwd

You will see your/etc/passwd file appear in front of your eyes. Now let's explain what awk has done. When we call awk, we specify/etc/passwd as the input file. During the awk execution, he runs the print command on each line in/etc/passwd sequentially. All outputs are sent to the standard output. The result is the same as executing cat/etc/passwd. Now let's explain the {print} code block. Like the C language, awk uses curly brackets to organize statement blocks. There is only one print command in our code block. When the print command appears separately in the awk, all contents of the current line will be printed.

Another awk example is as follows:

$ awk '{ print $0 }' /etc/passwd

In awk, $0 indicates the entire row of the current row, so print and print $0 are actually the same thing. If you like, you can also create an awk program to output data that is completely irrelevant to the input data. The following is an example:

$ awk '{ print "" }' /etc/passwd

Every time you pass the "" string to the print command, it prints a blank line. If you test the script above, you will find that each row in the/etc/passwd awk outputs a blank line. This is because awk executes your script for each line of the input file. Another example is as follows:

$ awk '{ print "hiya" }' /etc/passwd

Run this script and your screen will be filled with hiya. :)

Multiple Fields

It is really convenient for Awk to process text that is divided into multiple logical fields. awk allows you to reference each individual field from the script effortlessly. The following script prints a list of all users in your system:

$ awk -F":" '{ print $1 }' /etc/passwd

In the preceding example, when calling awk, use the-F option to specify ":" As the field delimiter. When awk processes the print $1 command, it prints the first field in each line of the input file. Here is another example:

$ awk -F":" '{ print $1 $3 }' /etc/passwd

Here is an excerpt from the output of this script:

halt7 
operator11 
root0 
shutdown6 
sync5 
bin1 
....etc.

As you can see, awk prints the first and third fields in the/etc/passwd file, which are the username and uid fields. Now, although this script is working, it is not perfect-there is no space between the two output fields! If you are used to programming in bash or python, you may expect the print $1 $3 command to insert spaces between two fields. However, when two strings are adjacent to each other in the awk program, the awk will not add spaces to connect them. The following command inserts a space between two fields:

$ awk -F":" '{ print $1 " " $3 }' /etc/passwd

When you call print in this way, it connects $1, "", $3, which creates a readable output. Of course, if necessary, we can also insert some text labels:

$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd

The output of this command is as follows:

username: halt     uid:7 
username: operator uid:11 
username: root     uid:0 
username: shutdown uid:6 
username: sync     uid:5 
username: bin      uid:1 
....etc.

External scripts

For a small single-line script, It is very convenient to pass it as a command line parameter to awk, but when the script becomes a complex multi-line program, you absolutely want to organize the script in an external file. Then pass the script file to awk using the-f option:

$ awk -f myscript.awk myfile.in

Placing scripts in a separate text file also allows you to take advantage of the additional features of awk. For example, the following multi-line script is the same as a single-line Script: print the first field of each line in/etc/passwd:

BEGIN { 
        FS=":" 
} 
{ print $1 }

The difference between the two methods is how we set the delimiter. In this script, specify the delimiter (by setting the FS variable) in the Code, and in the previous example, pass the-F ":" option in the awk command line. Just by doing this, you can get the benefit of missing a command line parameter, so setting separators in the script is usually the best. We will introduce FS variables in more detail in this article.

BEGIN and END blocks

Generally, awk executes each code block in the script on each line of the input file. In many programming environments, You need to perform initialization before starting to process the text of the input file. In this case, awk allows you to define the BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is executed before awk starts to process the input file, it performs such operations as initializing the FS variable, the print header line, or initializing some global variables (variables used after the program) good place.

Awk also provides another special code block called the END block. After all the lines in the input file are processed, awk executes the content of this code block. Typically, the END block is used to execute the final calculation or print the summative content that should appear at the END of the output stream.

Regular Expressions and code blocks

Awk allows you to use a regular expression to execute a code block based on whether the regular expression matches the current row. Here is an instance script used to output the rows containing the Character Sequence "foo:

/foo/ { print }

Of course, you can also use more complex regular expressions. The script below will only print the rows containing floating point numbers:

/[0-9]+\.[0-9]*/ { print }

Expressions and code blocks

There are many other ways to selectively execute code blocks. We can place any type of Boolean expression before the code block to control when the code block can be executed. Awk executes the code block only when the Boolean expression value before the code block is true. The following script example outputs the Third Field of all rows whose first field value is "fred. If the first field in the current row is not "fred", awk will not execute the print statement in the current row and continue to process the remaining lines of the file:

$1 == "fred" { print $3 }

Awk provides all comparison operator operations, including using "=", "<", ">", "<=", ">=", and "! = ". In addition, awk also provides "~ "And "!~ "Operators, which indicate" match "and" mismatch "respectively ". The two operators have the specified variable on the left and a regular expression on the right. In this example, only the third field of the row containing the Character Sequence "root" is printed:

$5 ~ /root/ { print $3 }

Condition Statement

Awk also provides an if statement similar to the C language. If you want to, you can use the if statement to overwrite the preceding script:

{   if ( $5 ~ /root/ ) {           print $3   } }

The functions of the two scripts are the same. In the first example, the Boolean expression is placed outside the code block. In the second example, the if statement is used to selectively execute the print command on each input line. Both methods are available. You can select the method that best suits you.

Here is an example of a more complex awk if statement. As you can see, such a complex nested if statement looks the same as the if statement in C:

{   if ( $1 == "foo" ) {            if ( $2 == "foo" ) {                     print "uno"            } else {                     print "one"            }   } else if ($1 == "bar" ) {            print "two"   } else {            print "three"   } }

Using the if statement, we can use the following code:

! /matchme/ { print $1 $3 $4 }

Convert:

{   if ( $0 !~ /matchme/ ) {           print $1 $3 $4   } }

Both scripts output only the rows that do not contain the "matchme" string. Similarly, you can select the code that best suits you. They all do the same thing.

Awk can also use the Boolean operator "|" (logical or) and "&" (logical and) to create complex Boolean expressions:

( $1 == "foo" ) && ( $2 == "bar" ) { print }

In this example, only the rows whose first field is foo and whose second field is two are printed.

Numeric variable

So far, we have used awk to print a string, an entire line, or a specific field. However, awk also allows us to perform mathematical operations on integers and floating-point numbers. Using arithmetic expressions, it is easy to write a script to count the number of blank lines in a file. Here is an example of doing this:

BEGIN { x=0 } /^$/  { x=x+1 } END   { print "I found " x " blank lines. :)" }

In the BEGIN block, we initialize the integer variable x to 0. Then, every time awk encounters a blank line, it will execute the x = x + 1 Statement and add the value of x to 1. After all the rows are processed, awk will output the final summary and point out the number of blank rows.

String variables

One clever thing about awk variables is that the variables are "simple and stringy". The reason why I say that awk variables are string-type is that all awk variables are stored internally in the form of strings. At the same time, the awk variable is also simple because as long as the variable contains a numeric string, you can perform arithmetic operations, and awk will automatically convert the string to the value. Let's look at the example below to understand what I mean:

x="1.01" # We just set x to contain the *string* "1.01" x=x+1 # We just added one to a *string* print x # Incidentally, these are comments :)

Awk will output:

2.01

Fun! Even if we assign the variable x A string value of 1.01, we can still add 1 to x. This cannot be done in bash or python. First, bash does not support floating-point arithmetic. Second, although bash has string variables, they are not "simple". To execute any arithmetic operations, bash requires us to use an ugly $ () structure to enclose expressions. If we use python, we must also convert the string "1.01" to a floating point before performing arithmetic operations. With awk, all this is done automatically, which will make our code look clearer and clean. If we want to square the first field of each input line and Add 1, we can use the following script:

{ print ($1^2)+1 }

Make a small attempt and you will find that when performing arithmetic operations, if a specific variable does not contain valid values, awk considers the variable as a value of 0.

A large number of operators

Another good thing about awk is that it implements full Arithmetic Operators. Apart from standard addition, subtraction, multiplication, and division, awk also allows us to use the exponential operator "^" (used in the previous example), modulo operator "%" and a bunch of other convenient value assignment operators borrowed from the C language.

Including pre-and post-auto-increment and auto-increment (i++, --foo ), Add/subtract/multiply/divide the value assignment operator (a+=3, b*=2, c/=2.2, d-=6.2). This is not all -- we can also get a convenient modulo/exponential value assignment operator (a ^ = 2, B % = 4 ).

Field Separator

Awk implements some unique variables. Some special variables allow you to fine-tune awk functions, and some can be used to collect valuable information about input data. We have been exposed to a special variable: FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we use/etc/passwd as the input, FS is set ":". Although this is true, FS allows us to use it more flexibly.

The value of FS is not limited to a single character. It can use regular expressions to specify character modes of any length. If you are using one or more tabs as field delimiters, you may want to set FS as the following format:

FS="\t+"

Above, we use a special regular expression symbol "+", which means "one or more leading characters".

If your field is separated by spaces (one or more spaces or tabs), you can use the following regular expression to set FS:

FS="[[:space:]+]"

Although this assignment is valid, it is not necessary. Why? Because the default value of FS is a single blank character (awk is interpreted as one or more spaces or tabs ). In the preceding special example, the default value of FS is exactly what you want.

There is no problem with complex regular expressions. For example, if your record is separated by the word "foo" and three digits, the following regular expression will help you properly parse the record:

FS="foo[0-9][0-9][0-9]"

Field quantity

The two variables we will introduce below are generally not used for writing, but for reading to obtain useful information about input data. The first is the NF variable, also known as the "field quantity" variable. Awk automatically sets this variable to the number of fields in the current record. You can use this variable to display specific input rows:

NF == 3 { print "this particular record has three fields: " $0 }

Of course, you can also use the NF variable in the Condition Statement as follows:

{   if ( NF > 2 ) {           print $1 " " $2 ":" $3   } }

Record Number

The record number (NR) is another convenient variable. It always contains the number of the current record (the number of the first record processed by awk is 1 ). Until now, every action in the input file we process is a record. In this case, NR will tell you the current row number. However, when we start to process multiple rows later, this is no longer the case, so be careful! NR can print specific rows like NF:

(NR < 10 ) || (NR > 100) { print "We are on record number 1-9 or 101+" }

Another example:

{   #skip header   if ( NR > 10 ) {           print "ok, now for the real information!"   } }

The additional variables provided by Awk can be used in many cases. In the subsequent articles, we will include more of these variables. A Preliminary Study of awk has been completed. As this series goes on, I will show more advanced awk functions, and finally I will end this series with a real awk application. At the same time, if you are eager to learn more, let's take a look at the resources listed below.

Resources

If you 'd like a good old-fashioned book, O 'Reilly's sed & awk, 2nd Edition is a wonderful choice.
Be sure to check out the comp. lang. awkFAQ. It also contains lots of additional awk links.
Patrick Hartigan's awk tutorial ispacked with handy awk scripts.
Thompson's TAWK Compiler compiles awk scripts into fast binary executables. Versions are available for Windows, and DOS.
The GNU Awk User's Guide is available for online reference.

The first part ends!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More