The structure of the AWK program is as follows:
awk ' begin{print ' start '} pattern {commands} end{print "END"} File
The awk program consists of 3 parts: A BEGIN statement block, an end statement block, and a generic statement that can be matched using a pattern (regular expression/relationship expression), any part of which can be omitted, and the BEGIN statement block executes at the very beginning of the program, usually doing some initialization work, The end statement block is executed at the end of the program, usually in the final format of the output result. The mode section works as follows: Gets a row first, checks whether the row matches the provided style, and executes the statement in the {} corresponding to the style if it matches.
Simply put, the BEGIN statement block (if any) is executed first, then the data in the file is checked row by line to see if it satisfies the pattern, and if it does, executes the statement in {}, if it is not satisfied, does nothing, and executes the end statement block (if any) at the end of the program.
You can also get the data through the pipeline, as follows:
Cat File | awk ' begin{print ' start '} pattern {commands} end{print "END"}
2.2 Hello,world
Let's take a look at how to print Hello,world in awk, as mentioned above, the AWK program consists of three parts (Begin statement block, pattern, end statement block), any part of the three parts can be omitted, the simplest way to print Hello,world As follows:
Only the BEGIN statement block
echo | awk ' begin{print ' Hello, World '} '
Only the end statement block
echo | awk ' end{print ' Hello, World '} '
Only the pattern part
In the pattern section, because pattern is empty, there is only {} statement blocks, so the program considers that each row of data satisfies the schema and executes the contents of the {} statement block.
2.3 In-depth understanding of three statement blocks
Again, AWK executes the BEGIN statement block before the program executes, then reads a row of data, checks to see if the row's data satisfies the pattern, executes the contents of the {} statement block corresponding to the pattern, reads the next row of data again, performs a second pattern check until all the data is read, and at the end of the program, Executes the statement in the end statement block. A standard awk program is as follows:
Seq 5 | awk ' begin{print ' BEGIN '} $ > 2 {print} end{print "END"} '
Output Result:
Begin345end
The program first executes the BEGIN statement block, outputs begin, and then checks to see if the schema $ > 2 (explained later) is satisfied, executes the statement in the {} block, that is, print, in awk, there is nothing behind print, which means that the contents of the line are printed. The end statement block is executed at the end of the program, printing end.
2.4 Special variables in awk
AWK handles text very conveniently because it provides a number of built-in features and variables that allow us to easily manipulate the data.
NR: Indicates the amount of records (number of record), which is equivalent to the current line number during execution
NF: Indicates the number of fields in the current row
$ A scalar contains the textual content of the current line during execution
This scalar contains the contents of the first field in the current row
The contents of the second field in the current row
FS domain separator, equivalent to the-D option in the Sort command and the cut command
OFS output, the domain separator, the default is "\ T"
2.5 Examples
Here are a few simple examples, assuming you have a file emp.data with the following data:
Beth 4.00 0Dan 3.75 0Kathy 4.00 10Mark 5.00 20Mary 5.50 22Susie 4.25 18
Among them, the first column is the employee's name, the second column is hourly wages, the 3rd column is the length of work, in order to get the wages of each employee, only need the following statement:
awk ' $ > 0 {print $, $ emp.data} '
Since awk has divided the data for us, we just need to quote directly, and when we use the variable, it means the first field of the current row, that is, the name, when we use the variable $ $, it refers to the 3rd field, that is, the length of the work.
If you want to see which employees are not working, that is, employees who work 0 hours, just like this:
awk ' $ = = 0 {print} ' Emp.data
or this:
awk ' $ = = 0 {print} ' Emp.data
The above two statements for the demo, print without any parameters, is to print all the contents of the current line, and also represents all the contents of the current line, so, the above two statements are the same, if you want to see the name of the employee who is working 0, then:
awk ' $ = = 0 {print $ emp.data} '
You can also print out the line number to make it easy to see how several employees are not working:
awk ' $ = = 0 {print NR, ': \ t ', ' $ ' emp.data
If, you just want to see Kathy's salary, then:
awk ' $ = = Kathy {print $ $ * $ $} ' Emp.data
or use regular expressions:
awk ' $ ~/kathy/{print $, $ * $ $} ' Emp.data
The output can also be formatted with printf in Awk, and printf is used in the same way as printf in the C language, such as:
awk ' $ > 0{printf ("Total pay for%s is%.2f\n", $ $ * $ $)} ' Emp.data
Patterns can be used in conjunction, such as:
awk ' $ >= 4 | | $ >= {print $} ' Emp.data
The output results are as follows:
Beth 4.00 0Kathy 4.00 10Mark 5.00 20Mary 5.50 22Susie 4.25 18
2.6 Data validation
Patterns are typically used to select the data that needs to be processed, for example, above we use $ > A to select employees who do not have 0 working hours, and can also go to the following:
awk ' nr < 5 {} ' # handles only the first 4 rows, the line number is less than 5 of the row awk ' nr = = 1, NR = = 4{} ' # line number between 1 and 4 awk '/linux/{} ' # contains the lines of the style Linux awk '!/lunux /{} ' # does not contain lines of style Linux
In addition, the pattern can be used to validate the data as follows:
NF! = 3 {print $, "Number of fields are not equal to 3"}$2 < 3.35 {print $, "rate is below minimum wage" }$ 2 > {print $, "rate exceeds $ $ per hour" }$3 < 0 {print $, "negative hours worked" }$3 > {print $, "Too many hours worked" }
2.7 The function of the BEGIN statement block
The BEGIN statement block is typically used to output header information, or to process additional information beforehand. For example, we can print the caption like this:
awk ' begin{print ' NAME rate HOURS "; print" "} {print $} ' Emp.data
The output information is as follows:
NAME rate hoursbeth 4.00 0Dan 3.75 0Kathy 4.00 10Mark 5.00 20Mary 5.50 22Susie 4.25 18
The more common usage is to execute FS in the BEGIN statement block, which is the domain delimiter, as previously said, the domain delimiter is similar to the-D option in the Sort command and the cut command, for example, you want to view each used and his home directory, using Cut, the method is as follows:
Cut-d:-f1,6/etc/passwd
It's also easy to use awk
awk ' begin{fs= ': '} {print $, $6} '/etc/passwd
We can also write awk to the file
#print. Awk-print user and it ' s home dirbegin{fs = ":"} { print $, $6}
You only need to enter the following command when executing:
Awk-f print.awk/etc/passwd
Or
cat/etc/passwd | Awk-f Print.awk
2.8 End Statement Block
The end statement block is primarily used to output some aggregated information, for example, we want to know exactly how many employees:
awk ' END {print NR, ' employees '} ' Emp.data
We can also easily get the total wage and the average salary per employee:
#count. Awk-compute the average pay {pay = Pay + $ $ * $ }end{print NR, employees print "Total pay are", pay print "Average pay is", Pay/nr}awk-f Count.awk Emp.data
The output results are as follows:
6 employeestotal pay are 337.5average pay is 56.25
2.9 Control Statements in awk
The control statements in AWK are the same as C usage and C, but there is no switch.
#reverse-print Input in reverse order by line{LINE[NR] = $} #remember each input lineend{i = NR while (i > 0 { print line[i] i-- }} #reverse-print input in reverse order by line{LINE[NR] = $} #remember each input l ineend{for (i = NR; i > 0; i--) print line[i] }
In addition to this, the for loop is slightly different, with the For loop in Awk in two ways:
for (i = 0; i < i++) {print $i;} For (i in array) {print array[$i];}
2.10 Associative arrays
The so-called associative array is an array that can be used either as a number or as a string. To demonstrate the use of associative arrays, consider the following examples. Suppose there are file files, the data is as follows:
item1,200item2,500item3,900item2,800item1,600
In file files, some item appears several times, assuming you want to sum the same item:
Awk-f, ' {a[$1] + = $ $ $ end{for (i in a) print I "," a[i]} ' file
The output results are as follows:
Item1, 800ITEM2, 1300ITEM3, 900
There are two new points of knowledge, one using associative arrays, we don't need to initialize an array, we don't have to know how many elements are in the array, because we can output the contents of the array in the form of a second for loop, the second knowledge point is the print statement, and the print statement is not separated by commas. Instead, connect the fields directly and concatenate the strings in awk as follows:
str = "Hello" "World" "!"
If we just want to print the largest number of items in each item, instead of adding them up, you can do the following:
Awk-f, ' {if (a[$1] < $) a[$1] = $ end{for (i in a) {print I, A[i]}} ' ofs=, file
The output results are as follows:
item1,600item2,800item3,900
Count the occurrences of each item:
Awk-f, ' {a[$1]++}end{for (I in a) print I, A[i]} ' file
The output results are as follows:
Item1 2item2 2item3 1
Print the first occurrence of each item:
Awk-f, '!a[$1]++ ' file
The output results are as follows:
item1,200item2,500item3,900
Here, only the schema, without the {} statement block, is output by default.
2.11 Built-in functions in AWK
AWK provides a number of functions to help users work more efficiently
3. A handful of Userful "one-liners" print the total number of the input lines
END {print NR}
Print the tenth input line:
NR = = 10
Print the last field of every input line
{Print $NF}
Print the last field of the last input line
{field = $NF} END {print Field}
Print every input line + than four fields
NF > 4
Print every input line in which the last field was more than 4
$NF > 4
Print the total number of fields in all input lines:
{NF = nf + NF} END {print NF}
Print the total number of lines that contain Beth
/beth/{nlines + = 1}end {print Nlines}
Print the largest first field and the line this contains it (assumes some is positive)
$ > Max {max = $ maxline = $}end {print Max, maxline}
Print every line, have at least one field
NF > 0
Print every line longer than characters
Length ($) > 80
Print the number of fields in every line followed by the line itself:
{print NF, $}
Print the first in opposite order, of every line
{print $, $ $}
Exchange the first every ine and then print the line
{temp = $; $ = $ = $; $ = temp; Print}
Print every line with the first field replaced by the line number
{$ = NR; Print}
Print every line after erasing the second filed
{$ = ""; Print
Print in reverse order the fields of every line
{for (i = 1; I <= NF; ++i) printf ("%s", $i) print ""}
Print the sums of the Every line
{sum = 0 for (i = 1; I <= NF; ++i) sum + = $i Print sum}
Add up all lines and print the sum
{for (i = 1; I <= NF; ++i) sum + = $iEND {print sum}
Print every line after replacing all field by its absolute value
{for (i = 1; I <= NF; ++i) if ($i < 0) $i =-$i Print}
4. Application of the AWK program
4.1 Count word Occurrences
The first program we analyzed is a program that counts the number of occurrences of a word, because awk provides an associative array, and by default initializes the variable to 0, so awk solves the problem with a handy, I've discussed it in the previous article, and I tried to use C,c++,shell to solve it in that article.
But this problem uses awk to solve the most simple, but also easy to deal with the problem of punctuation, the program is as follows:
#wordfreq-print Number of occurences of each word#input:text#output:number-word pairs sorted by number{gsub (/[.=,:;!? () {}]/, "") #remove punctuation for (i = 1; I <= NF; i++) count[$i]++ }end {for (W in count) print count[w], W | "Sort-rn"}
4.2 Data processing
Although awk is not all-powerful, it solves a lot of problems, but it is best at data processing, assuming the following:
USSR 8649 275 asiacanada 3852 North americachina 3705 1032 Asiausa 3615 237 North Americabrazil 3286 134 South Americaindia 1267 746 Asiamexico 762 North Americafrance 211 europejapan 144 asiagermany 56 Europeengland 94 Europe
The first is the country, the second is the size of the country, the third is the population, the last is the continent, and if we want to sort by the name of the continent in ascending order and then in descending order of the population density of each country, what should we do?
I don't know what Excel does, it should be, the benefit of learning awk is that you can solve a lot of Excel problems with awk without having to study the fool-like Excel, and obviously, awk is more flexible.
To solve this problem, the basic idea for AWK to solve this kind of problem is to prepare the data--sort (or other processing)--to format the output.
4.3 Markov chain algorithm
Definition of Markov chain algorithm see this article, I discussed in that article the implementation of Markov chain algorithm in various languages, this is simply a question of demonstrating the merits of awk, the specific procedures are as follows, no longer discussed in detail.
# Copyright (C) 1999 Lucent technologies# excerpted from ' The Practice of programming ' # by Brian W. Kernighan and Rob Pike # Markov.awk:markov chain algorithm for 2-word prefixesbegin {maxgen = 10000; Nonword = "\ n"; W1 = W2 = Nonword} {for (i = 1; I <= NF; i++) { # read all words statetab[w1,w2,++nsuffix[w1,w2]] = $i
W1 = W2 W2 = $i }}end { statetab[w1,w2,++nsuffix[w1,w2]] = nonword # add tail w1 = W2 = Nonword F or (i = 0; i < Maxgen; i++) { # generate r = Int (rand () *NSUFFIX[W1,W2]) + 1 # nsuffix >= 1 p = s TATETAB[W1,W2,R] if (p = = nonword) exit Print p w1 = W2 # advance chain w2 = P }}
Transferred from: http://www.lvtao.net/tool/awk.html
awk Usage Introduction