Linux awk command usage details

Source: Internet
Author: User

Linux awk command usage details

Awk is a powerful text analysis tool. Compared with grep search and sed editing, awk is particularly powerful in data analysis and report generation. To put it simply, awk refers to reading files row by row. Each line is sliced with spaces as the default separator, and the cut part is analyzed and processed.

Awk has three different versions: awk, nawk, and gawk, which are generally gawk and gawk is the GNU version of AWK.

Awk is named from the first letter of its founder Alfred Aho, Peter Weinberger, and Brian Kernighan. In fact, AWK does have its own language: AWK programming language. The three creators have formally defined it as "style scanning and processing language ". It allows you to create short programs that read input files, Sort data, process data, perform calculations on input, and generate reports. There are countless other functions.

Awk is a powerful text analysis tool. Compared with grep search and sed editing, awk is particularly powerful in data analysis and report generation. To put it simply, awk refers to reading files row by row. Each line is sliced with spaces as the default separator, and the cut part is analyzed and processed.

Awk has three different versions: awk, nawk, and gawk. It is generally gawk.

The report generation capability of the awk program is often used to extract data elements from large text files and format them into readable reports. The perfect example is to format the log file.

Awk usage
awk 'BEGIN{ commands } pattern{ commands } END{ commands }'

Step 1: run the statements in the in {commands} statement block.

Step 2: Read a row from a file or standard input (stdin. Run the pattern {commands} statement block, which scans the file row by row and repeats the process from the first row to the last row. Until all files are read.

Step 3: When reading to the end of the input stream. Run the END {commands} statement block.

The BEGIN statement block is run before the awk starts to read rows from the input stream. This is an optional statement block, for example, statements such as variable initialization and output table header can be written in the BEGIN statement block.

The END statement block is run after awk reads the full row from the input stream. For example, print the analysis results of all rows. This type of information is summarized in the END statement block, which is also an optional statement block.

Generic commands in the pattern statement block are the most important and optional. If no pattern block is provided, {print} is run by default to print each read row. This statement block is run for each row read by the awk.

None of these three parts can be used.

Built-in Variables

List objects in a directory:

[root@localhost profile.d]# ls -lhtotal 136K-rwxr-xr-x 1 root root  766 Jul 22  2011 colorls.csh-rwxr-xr-x 1 root root  727 Jul 22  2011 colorls.sh-rw-r--r-- 1 root root   92 Feb 23  2012 cvs.csh-rwxr-xr-x 1 root root   78 Feb 23  2012 cvs.sh-rwxr-xr-x 1 root root  192 Mar 25  2009 glib2.csh-rwxr-xr-x 1 root root  192 Mar 25  2009 glib2.sh-rw-r--r-- 1 root root  218 Jun  6  2013 krb5-devel.csh-rw-r--r-- 1 root root  229 Jun  6  2013 krb5-devel.sh-rw-r--r-- 1 root root  218 Jun  6  2013 krb5-workstation.csh-rw-r--r-- 1 root root  229 Jun  6  2013 krb5-workstation.sh-rwxr-xr-x 1 root root 3.0K Feb 22  2012 lang.csh-rwxr-xr-x 1 root root 3.4K Feb 22  2012 lang.sh-rwxr-xr-x 1 root root  122 Feb 23  2012 less.csh-rwxr-xr-x 1 root root  108 Feb 23  2012 less.sh-rwxr-xr-x 1 root root   97 Mar  6  2011 vim.csh-rwxr-xr-x 1 root root  293 Mar  6  2011 vim.sh-rwxr-xr-x 1 root root  170 Jan  7  2007 which-2.sh

Try awk

ls -lh | awk '{print $1}'

Here, there is no BEGIN and END behind the awk, followed by pattern, that is, each line will go through this command. In awk, $ n indicates the column number, in this example, the first column of each row is printed.

  • $0 current record (this variable stores the content of the entire row)
  • $1 ~ $ N the nth field of the current record. The fields are separated by FS.
  • The default delimiter of the FS input field is space or Tab.
  • The number of fields in the current NF record is the number of Columns
  • The number of records that NR has read, that is, the row number, starting from 1. If there are multiple files, this value is also constantly increasing.
  • The current number of FNR records. Different from NR, this value will be the row number of each file.
  • The record delimiter input by RS. The default value is a line break.
  • Delimiter of the output field of OFS, which is also a space by default
  • The record delimiter output by ORS. The default value is a line break.
  • FILENAME name of the current input file

For example, print the number of rows in each row:

[root@localhost profile.d]# ls -lh | awk '{print NR " " $1}'1 total2 -rwxr-xr-x3 -rwxr-xr-x4 -rw-r--r--5 -rwxr-xr-x6 -rwxr-xr-x7 -rwxr-xr-x8 -rw-r--r--9 -rw-r--r--10 -rw-r--r--11 -rw-r--r--12 -rwxr-xr-x13 -rwxr-xr-x14 -rwxr-xr-x15 -rwxr-xr-x16 -rwxr-xr-x17 -rwxr-xr-x18 -rwxr-xr-x

In this case, we can easily understand this statement:

root@Ubuntu:~# awk  -F ':'  '{printf("filename:%10s,linenumber:%s,columns:%s,linecontent:%s\n",FILENAME,NR,NF,$0)}' /etc/passwdfilename:/etc/passwd,linenumber:1,columns:7,linecontent:root:x:0:0:root:/root:/bin/bashfilename:/etc/passwd,linenumber:2,columns:7,linecontent:daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologinfilename:/etc/passwd,linenumber:3,columns:7,linecontent:bin:x:2:2:bin:/bin:/usr/sbin/nologinfilename:/etc/passwd,linenumber:4,columns:7,linecontent:sys:x:3:3:sys:/dev:/usr/sbin/nologinfilename:/etc/passwd,linenumber:5,columns:7,linecontent:sync:x:4:65534:sync:/bin:/bin/syncfilename:/etc/passwd,linenumber:6,columns:7,linecontent:games:x:5:60:games:/usr/games:/usr/sbin/nologin
Variable

In addition to the built-in variables of awk, awk can also customize variables.

The sum variable is introduced as follows to calculate the size of the py file:

root@ubuntu:~# ls -l  *.py | awk '{sum+=$5} END {print sum}'574
Statement

The condition statements in the awk are used for reference in the C language. See the following declaration method:

If statement

if (expression) {    statement;    statement;    ... ...}if (expression) {    statement;} else {    statement2;}if (expression) {    statement1;} else if (expression1) {    statement2;} else {    statement3;}

Loop statement

The loop statements in awk are also used in C language and support while, do/while, for, break, and continue. These keywords have the same semantics as those in C language.

Array

Because the subscript of an array in awk can be numbers and letters, the subscript of an array is usually called a key ). Both values and keywords are stored in an internal table that uses hash for key/value applications. Because hash is not stored in sequence, you will find that the array content is not displayed in the expected order. Arrays and variables are automatically created when they are used, and awk automatically determines whether they are stored as numbers or strings. In general, arrays in awk are used to collect information from records. They can be used to calculate the sum, count words, and track the number of times the template is matched.

Use an array to count the number of repeated occurrences:

[root@localhost cc]# cat test.txta 00b 01c 00d 02[root@localhost cc]# awk '{sum[$2]+=1}END{for(i in sum)print i"\t"sum[i]}' test.txt00 201 102 1
Site log analysis

The following uses Awk in Linux to analyze the log files in tomcat, and mainly counts pv and uv.

Log File Name: access_2013_05_30.log, 57.7 MB in size.

This analysis is just a simple demonstration, so it is not too precise to process data.

Log address: http://download.csdn.net/detail/u011204847/9496357

Sample log data:

Total number of log lines:

The seventh column of printed data is the log URL:

Some knowledge used in analysis:

  • MPs queue in shell |
    Command 1 | command 2 # its function is to pass the result of the first command 1 as the input of command 2 to command 2

  • Wc-l # Number of Statistics rows

  • Uniq-c # Number of times each line appears in the input file before the output line

  • Uniq-u # only show rows that are not repeated

  • Sort-nr
    -N: sort by value
    -R: sort in reverse order
    -K: sort by which column

  • Head-3 # top three

Data cleansing:

1. First cleansing: Remove URLs starting with/static /.

awk '($7 !~ /^\/static\//){print $0}' access_2013_05_30.log > clean_2013_05_30.log

Before removal:

After removal:

2. Second cleaning: remove images, css and js

awk '($7 !~ /\.jpg|\.png|\.jpeg|\.gif|\.css|\.js/) {print $0}' clean_2013_05_30.log > clean2_201 3_05_30.log

PV

Pv refers to the number of webpage renewal requests

Method: Count the total number of rows of all data

Data cleansing: Filter interference data in raw data

awk 'BEGIN{pv=0}{pv++}END{print "pv:"pv}' clean2_2013_05_30.log > pv_2013_05_30

UV

Uv refers to the number of people asked by the hacker. That is, the number of independent IP addresses.

Deduplication of repeated ip data, and then counting all the rows

Awk '{print $1}' clean2_2013_05_30.log | sort-n | uniq-u | wc-l> uv_2013_05_30

IP addresses with the most frequently asked questions (top 10)

When deduplication is performed on repeated ip address data, the first 10 ip addresses must be collected.

awk '{print $1}' clean2_2013_05_30.log | sort -n | uniq -c |sort -nr -k 1|head -10 > top10_2013_05_30

Ask the top 10 URLs (which is the most popular module for analyzing the site)

awk '{print $7}' clean2_2013_05_30.log | sort | uniq -c |sort -nr -k 1|head -10 > top10URL_2013_ 05_30

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.