Regular Expressions and file formatting commands in Linux (awk/grep/SED)

Source: Internet
Author: User
Tags egrep
I. Regular Expression 1.1 The Class Name of the international character pattern matching or matching pattern

[: Alnum:]: 0-9, A-Z, A-z
[: Alpha:]: A-Z, A-z
[: Upper:]: A-Z
[: Lower:]: A-z
[: Digit:]: 0-9
[: Space:]: space or tab key

1.2 Basic Regular Expressions

^ Word: string to be searched at the beginning of the row.
Word $: string to be searched at the end of the row.
.: It must contain any character.
\: Escape characters to remove the meaning of special characters.
*: Repeats the first character from 0 to infinite.
[LIST]: Find the desired character from the re character of the character set, which represents a character to be searched in;
For example, [AFG] indicates a, F, or G.
[N1-n2]: represents the character range to be located;
For example, [0-9] indicates 0 to 9; [A-Z] indicates A to Z.
[^ List]: identifies strings or ranges that do not contain the strings in the list;
Eg: [^ A-Z]: Indicates not to use uppercase letters; [^ t]: Do not use letters T.
\ {N, m \}: equivalent to {n, m} escape, because {escape is \ {, Because} escape is \}, it indicates n to M consecutive first characters.
\ {N \}: equivalent to {n}, indicating to repeat n previous characters.
\ {N ,\}: equivalent to {n ,}, indicating N or more consecutive previous characters.
Note: The special characters of a regular expression are not the same as the "wildcard" of a common command line input command. * Indicates 0 to unlimited characters in the wildcard. However, in a regular expression, * is the first character from 0 to infinity.
For example, you can find two implementations of file names starting with:
Wildcard: ls-l *
Regular Expression: ls-L | grep '^ .*'

1.3 expanded Regular Expression

Grep supports only basic regular expressions by default. If you use an extended expression, you can use grep-E. However, we recommend that you use egrep directly. Egrep = grep-e.
+: Repeat one or more previous re characters.
Eg: egrep-N 'go + d' a.txt
That is, God, good, goood ,....
? : The first re character of 0 or 1.
Eg: egrep-n' go? D' a.txt
That is, Gd, God, good ,....
|: Returns several strings by using the or method.
Eg: egrep-N 'gd | God' a.txt
Search for rows with GD or good.
(): Find the 'group' string
Eg: egrep-e 'G (La | OO) D' a.txt
Find the rows of gglad or good.
() +: Identify multiple duplicate groups
Eg: egrep 'a (XYZ) + C' a.txt
Starts with a, ends with C, and has more than one XYZ string in the middle.

Note: Re: Regular Expression (regular expression ).

Ii. awk command
2.1awk Overview

Most of log analysis on the Internet uses Linux Shell for processing. In this case, awk in shell is mainly used because awk has powerful field processing capabilities. Next we will explain it from a simple perspective and give an example.

First give a good awk learning link: http://www.cnblogs.com/chengmo/tag/awk.

Awk tends to divide a row into several "fields" for processing. Its running mode is:

Awk 'condition type 1 {Action 1} Condition Type 2 {Action 2}... 'filename

Note: awk mainly processes data in fields in each row, and the default field separator is Space key or tab key.

2.2awk entry

(1) eg: Last-N 5 | awk '{print $1 "\ t" $3 }'

It indicates that the data of the login is retrieved, and only the first five rows are retrieved. $1 indicates the first data of the current row; $3 indicates the third data of the current row.

(2) How does awk know how many rows and columns the data has? This requires the following built-in variables for help:

NF: the total number of fields in each row ($0); Nr: the row where awk is currently processing; FS: The current delimiter, which is the Space key by default.

Eg: in/etc/passwd, the colon is used as the field separator. The first field of the file is the account, and the first field of the file is the UID, to query data with 3rd columns less than 10, and only list the first and third columns, you can do the following:

CAT/etc/passwd | awk 'in in {FS = ":" }3 3 <10 {print $1 "\ t" $3 }'

The output is as follows:

Root 0

Bin 1

Note: Begin: Specifies the action that occurs before the first input record is processed. You can set global variables here. End: The action that occurs after the last input record is read.

(3)has the following salary table pay.txt:

Name 1st 2st 3st

Sean 23000 24000 25000

Zhao 21000 20000 23000

Bird2 43000 42000 41000

How can we format the output and calculate the total amount of each person?

Consider the following: the first line does not need to add a total, but the description (Nr = 1); after the second line, there is a total (NR> = 2 ).

The command is as follows:

Cat pay.txt | awk '{

If (Nr = 1)

Printf "% 10 S % 10 S % 10 S % 10 S % 10s \ n", $1, $2, $3, $4, "Total "}

NR> = 2 {

Total = $2 + $3 + $4

Printf "% 10d % 10d % 10d % 10d % 10.2f \ n", $1, $2, $3, $4, total }'

You can also do this:

Cat pay.txt | awk 'nr = 1 {

Printf "% 10 S % 10 S % 10 S % 10 S % 10s \ n", $1, $2, $3, $4, "Total "}

NR> = 2 {

Total = $2 + $3 + $4

Printf "% 10d % 10d % 10d % 10d % 10.2f \ n", $1, $2, $3, $4, total }'

(4) how to view the last five lines of a file: tail-N 5 a.txt

How to view the first five lines of a file: Head-N 5 a.txt

Number of lines printing files: awk 'end {print Nr} 'data.txt

(5) how to view the number of objects in a folder

Ls-L | awk 'end {print Nr }'

Note: When the last row is processed, the current Nr is output, that is, the total number of lines in the file.

(6) convert each four rows into one row

Awk '{If (NR % 4 = 0) {print $0} else {printf "% s", $0} 'a. xml

2.3awk Application

Now, only a few lines of log data are provided, as shown below:

23:59:59 BJ other other1 detail 1 0 8178526912798594496654
23:59:59 ly Fang fang1 list 34 2 7564641773641447038883
23:59:59 BJ Fang fang1 detail 1 0 4062479590911058005479
23:59:59 BJ Fang fang1 detail 1 0 7311067232020513225874

(1) The PV and uv in the awk statistics log

PV indicates the page access volume. UV indicates the number of visitors, that is, the number of users accessing this page.

The following describes how to count all PVs and Uvs in the log and the PVs and Uvs of each category (the third column of data). The shell is as follows:

#!/bin/shif [ $# -lt 1 ]; then    echo "Usage:"    echo "      $0 [filename]"    exit 1fiif [ "$1" == "-h" ]; then    echo "Usage: $0 [filename]"    exit 1fiawk -F"\t" '{    total[$8] ++;    uvlist[$3"#"$8] ++;    if($5 == "detail")    {        detailpv ++;        pvlist[$3] ++;    }}END{    print "total\t"detailpv"\t"length(total)    for(k in uvlist)    {        split(k, c, "#");        cate[c[1]] ++;    }    for(k in pvlist)    {        print k"\t"pvlist[k]"\t"cate[k];    }}' $1

The running result is as follows:

(2) awk collects the on-line rate from logs

The statistics here are based on the second column (city) and the third column (category. Shell:

#! /Bin/shif [$ #-LT 1]; then Echo "Usage: "Echo" $0 [filename] "Exit 1 fiif [" $1 "="-h "]; then Echo" Usage: $0 [filename] "Exit 1 ficat $1 | awk-F" \ t "'$5 =" list "{if ($2! = "BJ" & $2! = "Sh" & $2! = "GZ" & $2! = "SZ") next; citycategory = $2 "_" $3; citymajor = $2 "_" $4; if ($3 = "fang ") {ctotal [citymajor] + = $6; conline [citymajor] + = $7;} ctotal [citycategory] + = $6; conline [citycategory] + = $7; listall + = $6; listonlineall + = $7;} $5 = "detail" {if ($2! = "BJ" & $2! = "Sh" & $2! = "GZ" & $2! = "SZ") next; citycategory = $2 "_" $3; citymajor = $2 "_" $4; if ($3 = "fang ") {dctotal [citymajor] + = $6; dconline [citymajor] + = $7;} dctotal [citycategory] + = $6; dconline [citycategory] + = $7; detailall + = $6; detailonlineall + = $7;} end {# Count cslen = asorti (ctotal, CIA) together by city and category ); print "<Table border = \" 1 \ "> <caption align = \" Top \ "> customer online rate Statistics </caption> <tr> <TD> city </TD> <TD> Category </TD> <TD> total number of people on the List page </TD> <TD> List page on-line rate </TD> <TD> total number of people on the details page </TD> <TD> details page on-line rate </TD> </tr> "; for (I = 1; I <= cslen; I ++) {split (linoleic [I], arr, "_"); DC = DCT = "-"; if (linoleic [I] In dctotal) {DCT = dctotal [linoleic [I]; DC = dconline [linoleic [I]/DCT ;} printf ("<tr> <TD> % S </TD> <TD> % d </TD> <TD> %. 2f % </TD> <TD> % d </TD> <TD> %. 2f % </TD> </tr> ", arr [1], arr [2], ctotal [linoleic [I], conline [linoleic [I] * 100/ctotal [linoleic [I], DCT, DC * 100 );} print "</table> <br/> ";}'

3. grep command 3.1grep Overview

Grep: Compare the characters and print the matching characters. Use a regular expression to search for text and print matching rows. Grep selects data based on the unit of behavior.

Format:Grep [Option] 'regular expression' file name

Option:-N indicates the row number displayed when printing;

-V indicates reverse selection, that is, rows that do not conform to regular expressions;

-A: The number that can be added after the row. In addition to this row, the subsequent n rows can also be listed. Eg:-A3

-B: before. In addition to this row, the preceding n rows can also be listed.

3.2grep

(1) Search for a specific string

Grep-N 'a.txt: Find the row containing the, and display the row number

Grep-VN 'a.txt: Find the row without the ', and display the row number.

Grep-in the 'a.txt: Ignore the case and find the rows containing the, that is, the, ......

(2) Search for character set combinations and use []

Grep-n't [AE] st'a.txt: Find rows containing tast or test

Grep-n '[^ g] oo' a.txt: do not see G before oo

Grep-n '[^ A-Z] oo' a.txt: Do not want to include lowercase letters in front of OO, equivalent to grep-n' [^ [: lower:] oo 'a.txt
(3) start and end of a line ^ $

(A) grep-n' ^ [^ A-Za-Z] 'a.txt: Do not start with a letter

Equivalent to grep-n' ^ [^ [: Alpha:] 'a.txt

Note: ^ outside [] indicates positioning at the beginning of a row. ^ Indicates "reverse selection" in ".

(B) grep-n' \. $ 'a.txt:.The end row. Because.It represents any character, so it is a special symbol and needs to be escaped.

(C) grep-n' ^ $ 'a.txt: Find out which rows are empty.

(D) grep-V '^ $' a.txt | grep-V '^ #': Do not empty rows or rows starting.

(4) Any character. Repeated characters *

Grep-N 'G.. G' a.txt: the character of g _ g.

Grep-N 'G * G' a.txt: G * indicates an empty character or more G, followed by G.

Grep-N 'G. * G' a.txt: indicates the beginning and end of G. * Indicates 0 or any number of characters.

(5) limit the range of consecutive re characters

Because {} has special significance in shell, we must use escape characters\To make it meaningless.

Grep-N 'go \ {2, \} G' a.txt: indicates that more than two o s are contained. That is, good, goood ,....

Iv. Sed command 4.1sed Overview

Sed is usually used to process the entire row, mainly to operate the rows in the file.

The format is:Sed [-nefr] [action]

Action Description: [addressing] Function

The addressing method is as follows:
(1) N1 [, N2]: indicates the action between N1 and N2. Eg: [Action Behavior]
(2)/pattern/: pattern (Regular Expression) the row specified by Pattern
(3)/pattern/,/pattern/: the rows between two modes (regular expressions ).
(4)/pattern/, X: mode + row (query mode on the given row number)
(5) X, Y/pattern/: Match rows by row number and Pattern
(6) x, y! : Does not contain the specified row number
(7) $: indicates the last row.
Note: The range can be determined by numerical value, regular expression, or combination of the two.

The Fuction (Action) is as follows:
(1) A: append (new), insert a string in the next row of the current address line
(2) c: Change (replace), replace the behavior string from the current address
(3) D: delete (delete): Delete the row of the current address
(4) I: insert
(5) P: Print (print): print the data of the current address row
(6) S: substitute (replace), in the format of:'s/string to be replaced/New String/G'

Note: After a, c, and I actions, strings can be separated by spaces or \

4.2sed getting started

(1) New features in behavior units
Note: The New Operation outputs the result to the standard output and cannot be edited. It must be saved to another file.

Insert a row in the next row of the matched row:
Sed '/dreamb/A \ appended line' a.txt: indicates to insert a line "appended line" after matching rows containing "dreamb"
Or sed '/dreamb/A append line' a.txt
CAT/etc/passwd | sed '2a drink tea ': adds a row of "drink tea" to the end of the second row"
To save the changes, you can usually redirect them to another file.

Sed '/five/I four' a.txt> result.txt: indicates to insert a line "four" into the previous line of the matching row containing "five"
To directly modify the source file, run the following command:
Sed-I '/five/I four' a.txt
(2) Modify text
Sed '3c \ changed line' a.txt: replace it with "changed line" in the third line"
It is equivalent to SED '3c changed line' a.txt
(2) Delete text
Sed '1, 3D 'a.txt: delete 1 to 3 rows
Sed '/dreamd/d' a.txt: delete rows containing "dreamd"
Sed-n'/begin/,/end/d' a.txt | more: delete data between two matched rows
Sed '$ d' a.txt: Delete the last row
Sed '/^ Th/d' a.txt: delete rows starting with th
(3) Replacement
Format: sed '[Address [, address] S/pattern_find/replacement_pattern/[g, P, W, N]' filename

Where, G indicates to replace all the situations in the mode space. By default, only the mode that appears for the first time is replaced.
Sed '1, 2 S/D $/& dd' a.txt: attaches all rows ending with D to A dd
Sed '/First/S/ST/' a.txt: replace ST in a row containing "first" with an uppercase St.
CAT/etc/man. config | grep 'man '| SED's/#. * $' // G': deletes the data after the "#" annotation.
(4) convert files
It is usually used for case-insensitive conversion.
Syntax: Y/ABC/XYZ/
Sed 'y/five/six1 'a.txt: convert five to six1
(5) display matching rows
Sed-n'5, 7p' a.txt: displays 5-7 rows in a.txt.
Equivalent to head-N 7 | tail-N 3: The last three rows in the first seven rows.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.