Use awk to format output text

Last Update:2016-04-17 Source: Internet

Author: User

Tags processing text uppercase character

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use awk to format output text

Note: This is not an awk entry-level article, but an example.

Awk draws on c syntax, so awk retains traces of c language in many places, such as printf statement; for, if syntax structure, etc.

Introduction

In short, AWK is a programming language tool for processing text. The processing mode is to execute a series of commands as long as there is a pattern match in the input data. Awk command format:

awk {pattern + action} {filenames}

Awk can read the connected files or the standard input from the previous command. It scans each line of input data to check whether the pattern in the command line matches. If yes, subsequent actions are taken. If the pattern does not match or the action is partially processed, continue to process the next line until the end

Compared with sed, awk tends to divide a row into several fields for processing. Awk regards input data as a text database. Like a database, it also has the concept of records and fields. By default, the record separator is carriage return, and the field separator is a blank character (space, \ t). Therefore, each line of input data represents a record, the content in each row is left blank and separated into multiple fields. With fields and records, awk can process files flexibly.

Syntax 1 syntax

A typical awk syntax is as follows:

Awk '{BEGIN {stat1} BEGIN {stat2}
Pattern1 {action1} pattern2 {action2}... patternn {actionn} {default action, unconditional, always execute} END {stat1} END {stat2 }}'

BEGIN is an operation before processing text. It is generally used to change FS, OFS, RS, ORS, etc. After the BEGIN part is complete, awk reads the first line of input, fill in the data in the first row with variables such as 0, 0, 1, $2, NR, and NF, and then enter the formal processing stage. After all rows are processed, enter the END part, END is generally used to summarize and print reports. Formal processing is a built-in loop. Each cycle reads a row of data, and each row is processed in multiple modes and actions, if the text line meets the condition pattern1, the action action1 will be executed when the pattern2 meets the condition action2 ..., There can also be default actions, that is, the action in {} is always executed without pattern judgment. The "BEGIN" and "END" sections do not have to appear. Either the "BEGIN" or the "END" sections are written as follows:

/Reg/: Match reg within the entire line. After matching, perform subsequent actions.

! /Reg/: The whole line does not match reg before the subsequent action is executed.

$1 ~ /Reg/: Match reg only in the first field

$1 !~ /Reg/: Mismatch

NR> = 2: process from the second line

The pattern, part, and the subsequent if, for part can use the following symbols:

2 built-in Variables

$0 current record (this variable stores the content of the entire row) $1 ~ $ N the nth field of the current record. The fields are separated by FS. The default delimiter of the FS input field is space or \ tNF. The number of fields in the current record, this is the number of records read by the NR column, that is, the row number. Starting from 1, if there are multiple files, this value is constantly increasing. The current number of FNR records. Different from NR, this value will be the record delimiter input by RS in the row number of each file. The default value is the delimiter of the field output by OFS, it is also the record separator output by space ORS by default. The default value is the name of the current input file of the Line Break FILENAME.

3 if and for statements

# At any time, {} can be associated with multiple parallel actions (separated by ";"). The following {action1} and {action1; action2 ;...} both indicate that {} has multiple actions in the body, and there is no difference between the two. The second is simply to visually indicate that multiple actions can be performed.

# For statement

For (I = 1; I <= NF; I ++) {action1; action2 ;..} # {} use semicolons to separate multiple actions for (I = 1; I <= NF; I ++) if; else if; else # for followed by an if structure for (I = 1; I <= NF; I ++) printf "for add" # simple cyclic Printing

# If statement

If ($1 ~ /Reg/) {action1}; else if ($1 ~ /Reg2/) {action2}; else {action3} # else if can have no if ($1 ~ /Reg/& $2 ~ /Reg2/) {action} # multiple conditions use "&", "|" to indicate if ($1 ~ /Reg/| NR> = 5) {action

# If and for hybrid writing

{For (I = 1; I <= NF; I ++) if (...) Printf "test"; else if (...) Printf "test2"; else printf "test3"; print "not_for "}
# Print "not_for" is another action in parallel with the for loop structure. Outside of the for Loop, only one {for (I = 1; I <= NF; I ++) {if (...) Printf "test"; else if (...) Printf "test2"; else printf "test3"; print "in_for"}; print "not_in_for"} {for (I = 1; I <= NF; I ++) {if {s1; s2;} else if {s3; s4;} else {s5; s6 ;}; print "test" }}# no extra points before else if {for (I = 1; I <= NF; I ++) printf "for_add"; if (...); Else if (...); Else} # if not in the for Loop

The scope of the for Loop is:

The if; else statement followed
Followed by multiple actions in {}
The first normal action followed

If statement scope:

First Action followed by if
If followed by multiple actions in {}

4 awk skills

1: The RE used by AWK is ERE.

2: If OFS is set in BEGIN, OFS takes effect only when $0 is changed.

3: The difference between printf and print: printf does not automatically print line breaks, and print automatically prints

4: the return value of gsub is not the string after replacement, but the number of replacement times.

5: The String constant must be enclosed by "", otherwise it will be used as a variable, for example, $1 = "ipaddress"

6: The AWK for loop is C-Style, that is, for (), different from for I in shell...

7: multiple separators can be used in the AWK. To encapsulate them in square brackets, use ''To enclose them to prevent shell from interpreting them, such as awk-F '[: /t] ', using space, colon, tab as the Separator

8: next statement: Get the next input line from the input file and re-execute the command at the top of the AWK command table. It is generally used to skip some special lines.

9: awk matches multiple conditions: awk '/kobe/&/james/' # Matches rows with both kobe and james

10: The default value of FS is [/t/n] +. The default value of OFS is space, and the default values of RS and ORS are line breaks.

11: There are two methods to locate a row: 1: NR = row number 2: RE/Love $/

12: exit statement: ends the AWK program, but does not skip the END statement.

13... 1... n indicates the column (field), $0 indicates the entire row.

14: awk available comparison operators :! =,>, <, >=, <=

15: String Matching :~ : Match !~ : Mismatch

16: &: multiple conditions and, | multiple conditions or

17: {s1; s2; s3;...} multiple statements are separated by semicolons; if; else

18: print is equivalent to print $0 without any parameters, and the entire line of record is printed.

Awk character Functions

Function	Description
Gsub (Ere, Repl, [In])	Except that all the specific values of the regular expression are replaced, it is executed exactly the same as that of the sub function ,.
Sub (Ere, Repl, [In])	Replace the first specific value of the extended regular expression specified by the Ere parameter In the string specified by the In parameter with the string specified by the Repl parameter. The number of replicas returned by the sub function. The & (and symbol) appearing In the string specified by the Repl parameter is replaced by the string specified by the In parameter that matches the extended regular expression specified by the Ere parameter. If the In parameter is not specified, the default value is the entire record ($0 record variable ).
Index (String1, String2)	In the string specified by the String1 parameter (which contains the parameter specified by String2), the return position starts from 1. If the String2 parameter does not exist in the String1 parameter, 0 (zero) is returned ).
Length [(String)]	Returns the length (in character format) of the String specified by the String parameter ). If the String parameter is not provided, the length of the entire record is returned ($0 record variable ).
Blength [(String)]	Returns the length (in bytes) of the String specified by the String parameter ). If the String parameter is not provided, the length of the entire record is returned ($0 record variable ).
Substr (String, M, [N])	Returns the number of characters specified by the N parameter. The substring is obtained from the String specified by the String parameter. Its character starts at the position specified by the M parameter. The M parameter specifies the first character in the String parameter as the number 1. If N is not specified, the length of the substring is the length from the position specified by the M parameter to the end of the String parameter.
Match (String, Ere)	Returns the position (in character format) in the String specified by the String parameter (the extended regular expression specified by the Ere parameter appears in it), starting from 1, or if the Ere parameter does not appear, 0 (zero) is returned ). RSTART special variables are set to return values. The RLENGTH special variable is set to the length of the matched string, or if no match is found, it is set to-1 (negative one ).
Split (String, A, [Ere])	Splits the parameter specified by the String parameter into array elements A [1], A [2],..., A [n], and returns the value of n. This separator can be performed using the extended regular expression specified by the Ere parameter or using the current field separator (FS special variable) (if the Ere parameter is not given ). Unless the context specifies that A specific element should also have A numeric value, the elements in array A are created using string values.
Tolower (String)	Returns the String specified by the String parameter. Each uppercase character in the String is changed to lowercase. The ing between upper and lower case is defined by the LC_CTYPE category of the current language environment.
Toupper (String)	Returns the String specified by the String parameter. Each lowercase character in the String is changed to uppercase. The ing between upper and lower case is defined by the LC_CTYPE category of the current language environment.
Sprintf (Format, Expr, Expr ,...)	Format the expression specified by the Expr parameter based on the printf subroutine Format string specified by the Format parameter and return the final generated string.

The above functions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More