Awk is a programming language that has a strong ability to handle document data. The awk name is named after the first letter of the last name of its three original designers: Alfred v. Aho, Peter J. We i n b E rg e R, Brian W. Kernighan.
Awk was first completed in 1 9 7 7 years. A new version of Awk was published in 1985, and its functionality is much more powerful than the old version. Awk has the ability to modify, compare, extract, print, and process documents in a very short program. Writing a program in a language such as C or Pascal can be inconvenient and time consuming, and the program will be very large.
Awk is not just a programming language, it's also an indispensable tool for Linux system administrators and programmers.
The awk language itself is very studious, easy to master, and particularly flexible.
Gawk is a G-N U-planned Awk,gawk that was initially completed in 1 9 8 6 years later, and is constantly being improved and updated.
The gawk contains all of Awk's features.
Main functions of 1.gawk
The main function of gawk is to search for the specified format for each line of the file (l i n e), which is each record. When a row conforms to the specified format, gawk executes the specified action on the line. Gawk automatically processes each line of the input file in this manner until the input file ends.
Gawk are often used in several ways:
· Select a few rows of a file, several columns, or some fields to display the output, as required.
· Analyze the frequency, location, etc. of a word appearing in a document.
· Prepares formatted output based on information from one document.
· Filter the output document in a very powerful way.
· The calculation is based on the values in the document.
2. How to execute the gawk program
There are basically two ways to execute a gawk program.
If the Gawk program is short, you can write gawk directly to the command line as follows:
Gawk ' program ' Input-file1 input-file2 ...
The program includes some pattern and action.
If the Gawk program is longer, it is more convenient to put the Gawk program in a file,
The format of the gawk is as follows:
Gawk-f program-file input-file1 Input-file2 ...
When more than one file is gawk a program, the format of the execution gawk is as follows:
Gawk-f program-file1-f program-file2 ... input-file1 input-file2 ...
3. Files, records, and fields
In general, Gawk can handle numeric data in a file, but it can also handle string information. If the data is not stored in a file, you can provide input to gawk through the pipeline command and other redirection methods. Of course, gawk can only process text files (A S C i-code files).
The phone number is a simple example of a file that gawk can handle. The phone number is made up of many entries, each with the same format: Last name, first name, address, phone number. Each entry is sorted alphabetically.
In Gawk, each of these entries is called a record. It is a collection of complete data. For example, the entry for Smith John in the phone book, including his address and phone number, is a record.
Each item in the record is called a field. In Gawk, the field is the most basic unit. A collection of multiple records makes up a file.
In most cases, fields are separated by a special character, such as spaces, tabs, semicolons, and so on. These characters are called field separators. Take a look at this/etc/passwd file:
tparker;t36s62hs;501;101; Timparker;/home/tparker;/bin/bash
etreijs;2ys639dj3;502;101; Edtreijs;/home/etreijs;/bin/tcsh
ychow;1h27sj;503;101; Yvonne Chow;/home/ychow;/bin/bash
You can see that the/etc/passwd file uses semicolons as the field delimiter. Each line in the/etc/passwd file includes seven fields: User name, password, user i D; workgroup i D; note; h o m e catalogue; start-up shell. If you want to find the sixth field, just a few five semicolons.
But given the example of the following phone numbers, you will find some questions:
Smith John Wilson St. 555-1283
Smith John 2736 artside Dr Apt 123 555-2736
Smith John Westmount Cr 555-1726
Although we were able to distinguish between four fields per record, G A W K was powerless. The phone number uses a space as a delimiter, so Gawk thinks Smith is the first field, John is the second field, 13 is the third field, and so on. For gawk, if you use a space as the field delimiter, the first record has six fields, and the second record has eight fields.
So, we have to find a better field delimiter. For example, use a slash as the field delimiter as follows:
SMITH/JOHN/13 Wilson st./555-1283
smith/john/2736 Artside dr/apt/123/555-2736
smith/john/125 Westmount cr/555-1726
If you do not specify a different character as the field delimiter, Gawk will use space or tab as the field delimiter by default.
4. Modes and actions
Each command in the Gawk language consists of two parts: a pattern and a corresponding action
(action). As long as the pattern conforms, the gawk will perform the corresponding action. Where the mode part is enclosed in two slashes, and the motion
The part is enclosed in a pair of curly braces. For example:
/pattern1/{action1}
/pattern2/{action2}
/pattern3/{action3}
All gawk programs are made up of such pairs of patterns and actions. Where patterns or actions can be saved,
But both cannot be omitted at the same time. If the pattern is omitted, then for each row in the file as input, the action
will be executed. If the action is omitted, the default action is executed, showing all input lines that conform to the pattern and not
Any changes.
The following is a simple example, because the Gawk program is very short, so the Gawk program is written directly to the shell command line:
Gawk '/tparker/'/etc/passwd
The program finds records in the/etc/passwd file mentioned above that match the Tparker pattern and shows (there is no action in this case, so the default action is executed).
Let's look at one more example:
Gawk '/unix/{print $ ' file2.data
This command looks for records in the File2.data file that contain UNIX on a line-by-row basis and prints the second field of those records.
You can also use multiple modes and action pairs in one command, for example:
Gawk '/scandal/{print $/rumor/{print} ' Gossip_file
This command searches for records that include scandal in the file Gossip_file, and prints the first field. And then search from scratch.
The gossip_file includes a record of rumor and prints the second field.
5. Comparing and numerical operations
Gawk has a number of comparison operators, and here are some of the important ones:
= = Equal
! = Not Equal
> Greater than
< less than
> = greater than or equal to
< = less than or equal to
For example:
Gawk ' $4 > ' testfile
Records with the fourth field greater than 1 0 0 in the file testfile will be displayed.
The following table lists the basic numeric operators in Gawk.
Operator Declaration Example
+ Addition Operation 2+6
-Subtraction Operation 6-3
* Multiplication Operation 2*5
/Division Operation 8/4
^ exponentiation Operation 3^2 (=9)
% for remainder 9%4 (=1)
For example:
{Print $3/2}
Displays the result of a third field being removed by 2.
In Gawk, the precedence of an operator is the same as that of a general mathematical operation. For example:
{Print $1+$2*$3}
Displays the result of the second field and the third word multiplies multiplied, and then the first field added.
You can also change the order of precedence with parentheses. For example:
{print ($1+$2) *$3}
Displays the sum of the first field and the second field, and then the result of multiplying by the third word multiplies.
6. Intrinsic functions
There are various intrinsic functions in gawk, which are now described as follows:
6.1 Random numbers and mathematical functions
sqrt (x) to find the square root of X
Sin (x) to find the sine function of X
COS (x) to find the cosine function of X
atan2 (x, y) cotangent function for X/y
Log(x) to find the natural logarithm of X
EXP (x) asks X for the E-time Square
int (x) to find the integer part of X
Rand () asks for a random number between 0 and 1
Srand (x) sets X to the number of seeds of rand ()
6.2 intrinsic function of the string
· Index (In,find) Find string in string in where find first appears, the return value is a string
Find appears in the inside of the string in. If a string find is not found in the string in, the return value is 0.
For example:
Print index ("Peanut", "a N")
displaying results 3.
· Length (string) to find a string with several characters.
For example:
Length ("ABCDE")
displaying results 5.
· Match (STRING,REGEXP) looks for the longest, most left-most substring of the string that conforms to RegExp. The return value is regexp at the start of the string, which is the index value. The match function sets the system variable
Rstart equals the value of index, and the system variable Rlength equals the number of characters that match. If it is not met, it will
Set Rstart to 0, rlength to-1.
· sprintf (Format,expression1, ...) is similar to printf, but sprintf does not display, but instead returns a string.
For example:
sprintf ("PI =%.2f (approx.)", 2 2/7)
The returned string is pi = 3.14 (approx.)
· s U B (r e g e x p,r e p L A c e m e n t,t a RG e T) find the longest, most left-of-regexp in the string t a Rget
Place, the string replacement instead of the leftmost r e g e x P.
For example:
str = "Water,water,everywhere"
Sub (/at/, "ith", str)
Result string Str will become Wither,water,everywhere
· Gsub (Regexp,replacement,target) is similar to the previous sub. Look for all the places in the string target that match regexp, replacing all regexp with the string replacement.
For example:
str = "Water,water,everywhere"
Gsub (/at/, "ith", str)
Result string Str will become Wither,wither,everywhere
· SUBSTR (string,start,length) returns the substring of the string, which is length, starting at the start of the first position.
For example:
SUBSTR ("Washington", 5, 3)
return value is ing
If there is no length, the returned substring starts at the start of the first position to the end.
For example:
SUBSTR ("Washington", 5)
The return value is Ington.
· ToLower (string) changes the upper case of the string s t r i n g to lowercase letters.
For example:
ToLower ("MiXeD case 123")
The return value is mixed case 123.
· ToUpper (string) changes the lowercase letter of the string s t r i n g to uppercase.
For example:
ToUpper ("MiXeD case 123")
The return value is mixed case 123.
6.3 Intrinsic functions for input and output
· Close (filename) closes the input or output file filename.
· System (Command) This function allows the user to performOperating SystemInstructions, after execution will return to the Gawk program.
For example:
BEGIN {System ("ls")}
7. Strings and Numbers
A string is a series of characters that can be translated literally and gawk. strings are enclosed in double quotation marks. The number cannot be enclosed in double quotes, and gawk it as a numeric value. For example:
Gawk ' $! = ' Tim ' {print} ' testfile
This command displays all records that are not the same as the first field and the Tim. If the command does not have double quotes on either side of the Tim, Gawk will not execute correctly.
Again such as:
Gawk ' = = ', ' {print} ' testfile
This command displays all the first fields and 5 0 records of the same string. G A W K regardless of the value in the first field
Size, rather than just literally comparing. At this point, the string 5 0 and the value 5 0 are not equal.
8. Formatted output
We can let the action show some more complicated results. For example:
Gawk ' $! = ' Tim ' {print $1,$ 5,$ 6,$2} ' testfile
The first, third, sixth, and second fields of all records with a different first and Ti m in the testfile file are displayed.
Further, you can add a string to the P R i N T action, for example:
Gawk ' $! = ' Tim ' {print ' The entry for ', $1, ' was not Tim. ', ' $ $ ' testfile
Each part of the print action is separated by commas.
The output form of gawk can be more varied by borrowing the formatted output instruction of C language. At this point, you should use printf instead of print. For example:
{printf "%5s likes this language/n", $2}
The%5s section in printf tells Gawk how to format an output string, which is 5 characters in length. Its value is indicated by the last part of printf, where it is the second field. /n is a carriage return line break. If the second field is stored with a person's name, the output is roughly as follows:
Tim likes this language
Geoff likes this language
Mike likes this language
Joe likes this language
Other format control symbols supported by the Gawk language are as follows:
· c If it is a string, the first character is displayed, and if it is an integer, the number is displayed as an ASCII character.
For example:
printf "% c", 65
The result will show the letter A.
· D displays a decimal integer.
· I displays a decimal integer.
· e Displays the floating-point number in the form of scientific notation.
For example:
Print "$ 4. 3 E ", 1950
The result will show 1.950e+03.
· F Displays the number as a floating point.
· G displays numbers in the form of scientific notation or floating point. The absolute value of the number if it is greater than or equal to 0. 0 0 0 1 Then
Displayed as floating point, otherwise in the form of scientific notation.
· o Displays unsigned octal integers.
· s displays a string.
· x displays the unsigned hexadecimal integer. 1 0 to 1 5 are indicated by a to F.
· X displays the unsigned hexadecimal integer. 1 0 to 1 5 are indicated by a to F.
· % It is not really a format control character,% will be displayed%.
When you use these formats to control characters, you can give numbers in the control word match either to indicate how many or several characters you will use. For example, 6 d indicates that an integer has 6 bits. Again, consider the following example:
{printf "%5s works for%5s and earns%2d an hour", $1,$2,$3}
An output similar to the following will be produced:
Joe works for Mike and earns a hour
When working with data, you can specify the exact number of digits of the data
{printf "%5s earns $%.2f an hour", $3,$ 6}
Its output will resemble the following:
Joe earns $12.17 an hour
You can also format the output of the entire line using some code-changer controls. The reason for this is called the swap control because Gawk has a special explanation for these symbols. The following is a list of commonly used exchange-code control characters:
/A warning or bell character.
/b back one pane.
/F to change pages.
/n line break.
/R Enter.
/T Ta B.
/V Vertical t a B.
9. Change the field delimiter
In G A W K, the default field delimiter is usually a space character or Ta B. But you can use the-f option at the command line to change the character delimiter, just after-F followed by the delimiter you want to use.
Gawk-f ";" /tparker/{print} '/etc/passwd
In this example, you set the character delimiter to the component number. Note:-F must be uppercase, and must precede the first quotation mark.
10. Meta-Characters
The Gawk language has its own special rules when formatting matches. For example, cat can match the three-character field anywhere in the record. But sometimes you need some more special matches. If you want cat to match only concatenate, you need to add spaces to both ends of the format:
/CAT/{print}
For example, if you want to match both cat and cat, you can use or (|):
/Cat | CAT/{print}
In Gawk, there are several characters that have special meanings. These characters that can be used in the Gawk format are listed below:
· ^ Indicates the beginning of the field.
For example:
$ ~/^b/
If the third field starts with the character B, it matches.
· $ indicates the end of the field.
For example:
$ ~/b$/
Matches if the third field ends with the character B.
· . Represents and matches any single character M.
For example:
$ ~/i.m/
Matches if the third field has the character I.
· | Represents "or".
For example:
/C A T | C at/
Matches the cat or C at character.
· * Represents 0 to many repetitions of a character.
For example:
/uni*x/
and u n x, u n i x, u n i i x, u n i i x, etc. match.
· + represents one to many repetitions of a character.
For example:
/uni+x/
and u n i x, u n i i x and so on match.
· /{a,b/} represents a repetition of the character A to B.
For example:
/U N I/{1,3/} X
and u n i x, u n i i x and u n i i x match.
· ? Represents the repetition of a character 0 times and one time.
For example:
/uni? x/
and Unx and U N I x match.
· [] represents the range of characters.
For example:
/i[bdg]m/
And I B m, I D m and I G m match
· [^] denotes characters that are not in [].
For example:
/i[^de]m/
Matches all strings that include three characters starting with I and M, except for IDM and IEM.
11. Calling the Gawk program
You can write a gawk program (also called a gawk script) when you need a lot of patterns and actions. In the Gawk program, you can omit the quotes on both sides of the pattern and the action, because in the Gawk program, it is clear where the pattern and action begin and end.
You can call the G A W K program using the following command:
Gawk-f scrīpt filename
This command causes Gawk to execute a gawk program named Scrīpt to the file filename.
If you do not want to use the default field delimiter, you can specify the new field delimiter with the F option followed by the F option (you can, of course, be specified in the Gawk program), for example, using a semicolon as the field delimiter:
Gawk-f scrīpt-f ";" FileName
If you want the Gawk program to process multiple files, list each file name followed by:
Gawk-f scrīpt filename1 filename2 filename3 ...
By default, the output of the gawk is sent to the screen. But you can use the Linux redirect command to send the gawk output to a file:
Gawk-f scrīpt filename > Save_file
12.BEGIN and End
There are two special modes that are very useful in gawk. The begin pattern is used to indicate that gawk starts processing a file before performing some action. Begin is often used to initialize values, set parameters, and so on. End mode is used to execute some instructions after the file has been processed and is generally used as a summary or comment.
All instructions to be executed in BEGIN and end should be enclosed in curly braces. BEGIN and end must use uppercase.
Take a look at the following example:
BEGIN {print "Starting the process the file"}
$ = = "UNIX" {print}
$ > {printf "This line has a value of%d", $2}
END {print "Finished processing the file. Bye! "}
In this program, you first display a message: Starting the process the file, and then all the first fields are equal to
The entire UNIX record is displayed, and then the second field is displayed with a record greater than 10, and the last message is displayed: finished processing the file. bye!
13. Variables
In Gawk, you can assign a value to a variable with an equal sign (=):
var1=10
In gawk, you do not have to declare the variable type beforehand.
Take a look at the following example:
$ = = "Plastic" {count = count + 1}
If the first field is plastic, the value of Count is added to 1. Prior to this, we should give count the initial value, usually at the Begin section.
The following is a more complete example:
BEGIN {count = 0}
$ = = "UNIX" {count = count + 1}
END {printf "%d occurrences of UNIX were found", count}
Variables can be used with fields and values, so the following expressions are valid:
Count = Count + $6
Count = $5-8
Count = $ $ + var1
Variables can also be part of a format, such as:
$ > Max_value {print "Max value exceeded by", $2-max_value}
$4-var1 < min_value {print "Illegal value of", $4}
14. Built-in variables
There are several useful built-in variables in the Gawk language, which are now listed below:
NR number of records that have been read.
FNR the number of records that are read from the current file.
FileName Enter the name of the file.
FS field delimiter (default is a space).
The RS record delimiter (the default is line wrapping).
OFMT the output format of the number (default is% g).
OFS the Output field delimiter.
ORS the output record delimiter.
NF the number of fields in the current record.
If you only work with one file, the values of NR and FNR are the same. However, if it is multiple files, nr is for all files, while FNR is only for the current file.
For example:
NR <= 5 {print ' Not enough ' in the record '}
Check if the number of records is less than 5 and if it is less than 5, an error message is displayed.
FS is useful because FS controls the field delimiter for input files. For example, in the begin format, use the following command:
F S = ":"
15. Control Structure
15.1 If expression
The syntax for an if expression is as follows:
if (expression) {
C o m m a n d S
}
E L s E {
C o m m a n d S
}
For example:
# a simple if loop
(if (= = = 0) {
Print "This cell has a value of zero"
}
else {
printf "The value is%d/n", $1
} )
Let's look at another example:
# A nicely formatted if loop
(if ($ > $) {
Print "The first column is larger"
}
else {
Print "The second column is larger"
} )
15.2 while loop
The syntax for the while loop is as follows:
while (expression) {
C o m m a n d S
}
For example:
# Interest calculation computes compound interest
# inputs from a file is the the Amount,interest_rateand years
{var = 1
while (Var <= $) {
printf ("%f/n", $1* (1+$2) ^var)
var++}
}
15.3 for Loop
The syntax for the For loop is as follows:
for (initialization; expression; increment) {
C o m m a n d
}
For example:
# Interest calculation computes compound interest
# inputs from a file is the the Amount,interest_rateand years
{for (var=1; var <= $; var++) {
printf ("%f/n", $1* (1+$2) ^var)
}
}
5.4 Next and exit
The next instruction is used to tell Gawk to process the next record in the file, regardless of what is being done now. The syntax is as follows:
{Command1
C o m m A n d 2
C o m m A n d 3
Next
C o m m A n d 4
}
As soon as the program executes to the n e x t instruction, it jumps to the next record to execute the command from the beginning. Therefore, in this case, the C o m m A n d 4 instruction will never be executed.
When the program encounters the Exit command, it goes to the end of the program and executes end, if there is an end.
16. Arrays
The Gawk language supports the array structure. Arrays do not have to be initialized beforehand. The method for declaring an array is as follows:
A r r a y n a m e [n u m] = v A l u E
Take a look at the following example:
# reverse lines in a file
{LINE[NR] = $} # Remember each line
END {Var=nr # output lines in reverse order
while (var > 0) {
Print Line[var]
V A R--
}
}
This program reads each line of a file and displays it in reverse order. We use NR as the subscript for the array to store each record of the file, and then display the file one at a to start with the last record.
17. User-defined Functions
Complex gawk programs can often be simplified by using their own defined functions. Invoking a user-defined function is the same as calling an intrinsic function. The definition of a function can be placed anywhere in the Gawk program.
The format of the user-defined function is as follows:
Function name (parameter-list) {
b o D y-o f-f u n c t i o n
}
Names is the name of the function that is defined. A correct function name can include a sequence of letters, numbers, down lines (underscores), but not a number to begin with. Parameter-list is a list of all the parameters of a function, separated by commas. Body-of-function contains an expression of gawk, which is the most important part of the function definition, which determines what the function actually does.
The following example adds the square of the value of the first field in each record to the square of the value of the second field.
{print "sum =", S q u a r e S u m ($ 1,$ 2)}
function Squaresum (x, y) {
s U m = x * x + y * y
return sum
}
In this, we already know the basic usage of gawk. Gawk language is very easy to learn, for example, you can use Gawk to write a small program to calculate the number and capacity of all files in a directory. If you use a different language, such as C, it can be very troublesome, instead, gawk only need a few lines to complete the work.
18. Several examples
Finally, give some examples of gawk:
Gawk ' {if (NF > max) max = NF}
END {print Max} '
This program displays the maximum number of fields in all input rows.
Gawk ' Length ($) > 80 '
This program displays each line of more than 80 characters. Only the pattern is listed here, and the action is to display the entire record with default values.
Gawk ' NF > 0 '
Displays all rows that have at least one field. This is an easy way to delete all the blank lines in a file.
Gawk ' BEGIN {for (i = 1; I <= 7; i++)
print INT (101 * rand ())} '
This program shows a range of 7 random numbers from 0 to 100.
Ls-l Files | Gawk ' {x + = $4}; END {print "Total bytes:" x} '
This program displays the total number of bytes for all specified files.
Expand File | Gawk ' {if (x < Length ()) x = Length ()}
END {print "Maximum line length is" x} '
This program displays the length of the longest line in the specified file. Expand will change the tab to space, so the actual right boundary is used to make the length comparison.
Gawk ' BEGIN {FS = ': '}
{Print $ | "Sort"} '/etc/passwd
This program displays the login names of all users in alphabetical order.
Gawk ' {nlines++}
END {print Nlines} '
This program displays the total number of rows in a file.
Gawk ' END {print NR} '
This program also displays the total number of rows in a file, but the work of calculating the number of lines is done by gawk.
Gawk ' {print nr,$ 0} '
When the program displays the contents of the file, the travel number is displayed at the front of each line, and its function is similar to ' cat-n '.
An explanation of the awk command for Linux