Awk 20-minute introductory introduction

Last Update:2015-10-23 Source: Internet

Author: User

Tags first string

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Welcome attention Number: Neihanrukou

What is awk

Awk is a small programming language and command-line tool. (its name is from the first letter of its founder Alfred Aho, Peter Weinberger and Brian Kernighan's surname). It is well suited for log processing on the server, primarily because awk can manipulate files, often building lines in readable text.

I say it applies to servers because log files, dump files, or any text-formatted server that terminates dumps to disk can become very large, and you will have a large number of such files on each server. If you've ever experienced a situation where you have to analyze a few g of files in 50 different servers without a splunk or other equivalent tool, you'll feel bad about getting and downloading all of these files and analyzing them.

I have experienced this situation firsthand. When some Erlang nodes are going to die and leave a 700MB to 4GB crash dump file (crash dump), or when I need to quickly browse the log on a small personal server (called a VPS) to find a regular mode.

In any case, awk is not just looking for data (otherwise, grep or ACK is enough)-it also allows you to process the data and transform the data.

Code Structure

The code structure of the awk script is simple, which is a series of pattern and Behavior (action):

1 2 3 4 5 6 7 8 9 10 11 # comment Pattern1 { ACTIONS; } # comment Pattern2 { ACTIONS; } # comment Pattern3 { ACTIONS; } # comment Pattern4 { ACTIONS; }

Each row of the scanned document must be matched against each pattern, and only one pattern at a time is matched. So, if I give you a file that contains the following:

This was Line 1
This was line 2

The line 1 will match the PATTERN1. If the match succeeds, the actions are executed. Then this was line 1 will be matched with Pattern2. If the match fails, it jumps to Pattern3 to match, and so on.

Once all of the patterns have been matched, this was line 2 will be matched with the same steps. The other rows are the same until the entire file is read.

In short, this is Awk's running mode

Data Type

Awk has only two primary data types: strings and numbers. Even so, awk's strings and numbers can be converted to each other. A string can be interpreted as a number and its value converted to a numeric value. If the string does not contain a number, it is converted to 0.

They can all use the = operator to assign values to variables in the Actions section of your code. We can declare and use variables at any time, anywhere, or use uninitialized variables, at which point their default value is an empty string: "".

Finally, awk has array types, and they are dynamic one-dimensional associative arrays. Their syntax is this: var[key] = value. AWK can simulate multidimensional arrays, but whatever it is, it's a big technique (big hack).

Mode

The patterns that can be used are divided into three main categories: regular expressions, Boolean expressions, and special patterns.

Regular Expressions and Boolean expressions

You use the awk regular expression to compare the light weight. They are not pcre under awk (but gawk can support the library-it depends on the specific implementation!). Please use awk
–version view), however, is sufficient for most of the usage requirements:

1 2 3) 4 5 /admin/ { ... } # any line that contains ‘admin‘ /^admin/ { ... } # lines that begin with ‘admin‘ /admin $/ { ... } # lines that end with ‘admin‘ /^[0-9.]+ / { ... } # lines beginning with series of numbers and periods /(POST|PUT|DELETE)/ # lines that contain specific HTTP verbs

Note that patterns cannot capture specific groups (groups) so that they are executed in the Actions section of the code. Patterns are specifically matched to content.

Boolean expressions are similar to Boolean expressions in PHP or JavaScript. In particular, you can use && in awk ("with"), | | ("or"),! ("not") operator. You can find traces of them in almost all Class C languages. They can operate on regular data.

A more similar feature to PHP and JavaScript is the comparison operator, = =, which will be fuzzy-matched (matching). So the "23" string equals 23, "The 23″== 23 expression returns TRUE.! = operator is also used in awk, and do not forget the other common operators: >,<,>=, and <=.

You can also mix them: Boolean expressions can be used with regular expressions. /admin/| | Debug = True This usage is legal, and the expression will match successfully when it encounters a line containing the word "admin" or the debug variable equals True.

Note that if you have a specific string or variable to match with the regular expression, ~ and!~ are the operators you want. This uses them: string ~/regex/and string!~/regex/.

It is also important to note that all of the patterns are only optional. An awk script that contains the following:

{ACTIONS}

The actions are simply executed for each line you enter.

Special Mode

There are some special patterns in awk, but not many.

The first is the begin, which matches only when all the rows have been entered into the file. This is the main place where you can initialize your script variables and all kinds of states.

The other one is end. As you might have guessed, it will match after all the inputs have been processed. This allows you to do cleanup work and some final output before exiting.

The last type of pattern, it's difficult to classify it. It is between a variable and a special value, which we typically call a field. and truly.

Domain

Use the intuitive example to better interpret the domain:

1 2 3 4 5 6 7 8 9 10 11 # According to the following line # # $1 $2 $3 # 00:34:23 GET /foo/bar.html # _____________ _____________/ # $0 # Hack attempt? /admin .html$/ && $2 == "DELETE" { print "Hacker Alert!" ; }

Fields (by default) are separated by spaces. The $ A field represents a whole line of strings. The domain is the first string (before any spaces), the $ domain is the latter, and so on.

An interesting fact (and something we want to avoid in most cases), you can modify the corresponding row by assigning a value to the corresponding field. For example, if you execute $ = "HAHA the line is GONE" in a block, now the next mode will operate on the modified row instead of the original row. The other domain variables are similar.

Behavior

There are a bunch of available behaviors (possible actions), but the most common and useful behaviors (in my experience) are:

1 2 3 4 5 6 7 8 9 10 11 12 { print $0; } # prints $0. In this case, equivalent to ‘print‘ alone { exit ; } # ends the program { next; } # skips to the next line of input { a=$1; b=$0 } # variable assignment { c[$1] = $2 } # variable assignment (array) { if (BOOLEAN) { ACTION } else if (BOOLEAN) { ACTION } else { ACTION } } { for (i=1; i<x; i++) { ACTION } } { for (item in c) { ACTION } }

These will be the main tools of your AWK toolkit, and you can use them whenever you work with files such as logs.

The variables in awk are global variables. No matter what variable you define in a given block, it is visible to the other blocks, even to each row. This severely limits your awk script size, or they can cause horrible results that are not maintainable. Please write as small a script as possible.

function

You can use the following syntax to invoke a function:

{Somecall ($)}

Here are some limited built-in functions that can be used, so I can give generic documentation of these functions (regular documentation).

User-defined functions are also simple:

1 2 3 4 5 6 7 8 9 # function arguments are call-by-value function name(parameter-list) { ACTIONS; # same actions as usual } # return is a valid keyword function add1(val) { return val+1; }

Special Variables

In addition to the regular variables (global, which can be used anywhere), there are a number of special variables, which function somewhat like configuration entries (config entries):

1 2 3 4 5 6 7 8 9 10 11 BEGIN { # Can be modified by the user FS = "," ; # Field Separator RS = "n" ; # Record Separator (lines) OFS = " " ; # Output Field Separator ORS = "n" ; # Output Record Separator (lines) } { # Can‘t be modified by the user NF # Number of Fields in the current Record (line) NR # Number of Records seen so far ARGV / ARGC # Script Arguments }

I put modifiable variables in the begin, because I prefer to rewrite them there. However, the rewrite of these variables can be placed anywhere in the script and then take effect in the following line.

Example

These are the core elements of the awk language. I don't have a lot of examples here, because I tend to use awk to do a quick one-time task.

But I still have some script files that I carry with me to handle things and tests. One of my favorite scripts is to handle the crash dump file for Erlang, as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 =erl_crash_dump:0.3 Tue Nov 18 02:52:44 2014 Slogan: init terminating in do_boot () System version: Erlang /OTP 17 [erts-6.2] [ source ] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll: false ] Compiled: Fri Sep 19 03:23:19 2014 Taints: Atoms: 12167 =memory total: 19012936 processes: 4327912 processes_used: 4319928 system: 14685024 atom: 339441 atom_used: 331087 binary: 1367680 code: 8384804 ets: 382552 =hash_table:atom_tab size: 9643 used: 6949 ... =allocator:instr option m: false option s: false option t: false =proc:<0.0.0> State: Running Name: init Spawned as: otp_ring0:start /2 Run queue: 0 Spawned by: [] Started: Tue Nov 18 02:52:35 2014 Message queue length: 0 Number of heap fragments: 0 Heap fragment data: 0 Link list: [<0.3.0>, <0.7.0>, <0.6.0>] Reductions: 29265 Stack+heap: 1598 OldHeap: 610 Heap unused: 656 OldHeap unused: 468 Memory: 18584 Program counter: 0x00007f42f9566200 (init:boot_loop /2 + 64) CP: 0x0000000000000000 (invalid) =proc:<0.3.0> State: Waiting ... =port: #Port<0.0> Slot: 0 Connected: <0.3.0> Links: <0.3.0> Port controls linked- in driver: efile =port: #Port<0.14> Slot: 112 Connected: <0.3.0> ...

Produces the following result:

1 2 3 4 5 6 7 8 9 10 $ awk -f queue_fun. awk $PATH_TO_DUMP MESSAGE QUEUE LENGTH: CURRENT FUNCTION ====================================== 10641: io:wait_io_mon_reply /2 12646: io:wait_io_mon_reply /2 32991: io:wait_io_mon_reply /2 2183837: io:wait_io_mon_reply /2 730790: io:wait_io_mon_reply /2 80194: io:wait_io_mon_reply /2 ...

This is a list of functions that run in the Erlang process, which causes the mailboxe to become very large. The script is in this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # Parse Erlang Crash Dumps and correlate mailbox size to the currently running # function. # # Once in the procs section of the dump, all processes are displayed with # =proc:<0.M.N> followed by a list of their attributes, which include the # message queue length and the program counter (what code is currently # executing). # # Run as: # # $ awk -v threshold=$THRESHOLD -f queue_fun.awk $CRASHDUMP # # Where $THRESHOLD is the smallest mailbox you want inspects. Default value # is 1000. BEGIN { if (threshold == "" ) { threshold = 1000 # default mailbox size } procs = 0 # are we in the =procs entries? print "MESSAGE QUEUE LENGTH: CURRENT FUNCTION" print "======================================" }# Only bother with the =proc: entries. Anything else is useless. procs == 0 && /^=proc/ { procs = 1 } # entering the =procs entries procs == 1 && /^=/ && !/^=proc/ { exit 0 } # we‘re done# Message queue length: 1210 # 1 2 3 4 /^Message queue length: / && $4 >= threshold { flag=1; ct=$4 } /^Message queue length: / && $4 < threshold { flag=0 }# Program counter: 0x00007f5fb8cb2238 (io:wait_io_mon_reply/2 + 56) # 1 2 3 4 5 6 flag == 1 && /^Program counter: / { print ct ":" , substr($4,2) }

Did you keep up with the idea? If you keep up, you've learned about awk. Congratulations!

Awk 20-minute introductory introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More