Linux. Shell programming notes-Regular Expressions

Source: Internet
Author: User
Tags character classes control characters posix printable characters uppercase letter egrep

Chapter 4 What is a regular expression

When writing a program or webpage that processes strings, it is often necessary to find strings that meet certain complex rules. Regular Expressions are the tool ports used to describe these Rules. In other words, regular expressions are the code that records text rules.

Wide application of Regular Expressions

Regular Expressions are widely used in unix/linux systems, and the functions of the tool are enhanced. Common UNIX tools that support regular expressions include:

Grep tool family used to match text lines:

Used to change the sed stream Editor (streamediter) of the input stream );

It is used to process strings, such as awk, python, perl, and tcl;

File Viewing program, or paging program, such as mare. page, less, etc.

Text Editor, such as ed, vi, emacs, vim, etc ·

How to Learn Regular Expressions

Instance 1

Match the existing file word and there is a file sentence after the word

\ Bfile \ B. * \ bfile \ B

Copy a segment

If you are looking for a Lucy not far behind hi, you should use \ bhi \ B. * \ bLucy \ B.

Here, it is another metacharacters that match any character except the line break. * It is also a metacharacter, but it does not represent a character, nor a position, but a number-It specifies * the content of the front edge can be repeatedly used for any consecutive times to match the entire expression. Therefore,. * When connected, it means that any number of characters do not contain line breaks. Now \ bhi \ B. * \ bLucy \ B is very obvious: first, a word hi, then any character (but not a line break), and finally Lucy.

The line break is '\ n' and the ASCII code is 10 (hexadecimal 0x0A) characters.

If other metacharacters are used at the same time, we can construct a more powerful regular expression. For example:

0 \ d-\ d match a string that starts with 0 and then contains two numbers, then there is a hyphen "-" and the last eight digits (that is, the Chinese phone number. Of course, this example can only match a three-digit area code ).

Here \ d is a new metacharacters that match a digit (0, or 1, or 2, or ......). -It is not a metacharacter. It only matches itself-a hyphen (or a hyphen ).

To avoid so many annoying repetitions, we can also write this expression: 0 \ d {2}-\ d {8 }. Here {2} ({8}) after \ d means that the previous \ d must be repeated twice (eight times ).

To be more precise, \ B matches the following position: the first character and the last character are not all (one is, one is not or does not exist) \ w.

How to practice Regular Expressions

We are not machines. We always make mistakes. Generally, regular expressions cannot be correctly written at a time. In this case, we need to constantly modify the Regular Expression and try to get closer to the correct method until we can achieve what we want.

NOTE

There have been three types of grep in history that can be used to match text. They are:

Grep's earliest text matching program. Use the Basic Regular Expression (BRE) supported by PQSIX)

Egrep extended grep (Extend grep ). Use the Extended Regular Expression [Extended Regular Expression, ERE)

Fgrep Fast grep (Fast grep ). This version is used to match fixed strings instead of regular expressions.

In the PDS1X standard released in 942, three prep versions are combined into one. POSIX supports multiple regular expression modes through parameters, whether it is BRE or ERE, fgrep and egrep can still be used on all unix/inux systems, but are marked as deprecated (not recommended 〕

Regular Expression basic metacharacters

A regular expression is a tool used to describe a matching rule.

There are two types of characters in a regular expression: A Basic character (without any meaning) and a metacharacters (with some meanings of expression matching)

Meta characters supported by POSIXBRE and ERE

^

BRE, ERE

Anchor specifies the start of a row or string. For example, "^ grep" matches all rows starting with grep.
BRE: it has special meanings only at the end of a regular expression;
ERE: it has special meanings anywhere in the regular expression.

$

BRE, ERE

Anchor specifies the end of a row or string. For example, "grep $" matches all rows ending with grep.
BRE: it has special meanings only at the end of a regular expression;
ERE: it has special meanings anywhere in the regular expression.

.

BRE, ERE

Match any single character (except NUL) that does not wrap ).
For example, after 'gr. P' matches 'gr, add any character, and then

*

BRE, ERE

Matches any number of characters or individual characters before it,
Example:. indicates any character, then. * matches any length of any character

[]

BRE, ERE

Match any character in square brackets, where hyphens (-) indicate the range of consecutive characters
^ If the symbol appears at the first position of square brackets, it indicates that it matches any character not in the list,

Only characters in POSIXBRE:

\ {N, m \}

Interval expression, which indicates the number of times a single character is reproduced before it.
\ {N \} refers to the reproduction of n times; \ {n, m \} refers to the reproduction of n to m times;

\(\)

Reserved space. up to 9 independent sub-modes can be stored in a single mode.
For example, \ (AB \). * \ 1: indicates that a combination of AB can be reproduced twice, and any number of characters can exist in the middle.

\ N

Repeat the pattern of nth sub-pattern to this vertex in \ (and \) square brackets.

Only characters in POSIXERE:

{N, m}

Same as \ {n, m \} function of BRE

+

Matches one or more extensions of the previous regular expression.

?

Matches zero or one extension of the previous regular expression.

|

Match | regular expression before or after a symbol

()

Regular Expression group enclosed by square brackets

Meta characters supported by the Grep program plus

\ <

Anchor specifies the start of a word, for example, "\ <grep" matches the line of a word starting with grep.

\>

Anchoring the end of a word,
For example, "grep \>" matches the row that contains the word ending with grep.

\ W

Match text and numeric characters. That is, [A-Za-z0-9],
For example, 'G \ w * P' matches a string of 0 or more characters or numbers after G, followed by p

\ W

The reverse form of \ w matches one or more non-word characters, such as periods and periods.

\ B

Word lock, for example, \ bgrep \ B only matches grep

Instance

// Only display files starting with

[houchangren@ebsdi-23260-oozie shell]$ ls |grep '^a'add.sha.sh 

// Display the rows that contain hadoop in the file starting with

[houchangren@ebsdi-23260-oozie shell]$grep  'hadoop'  a*a.sh:HADOOP_IN_PATH="/usr/lib/hadoop/bin/hadoop"a.sh:HADOOP=$HADOOP_HOME/bin/hadoop

// Display the rows containing hadoop in the. sh File

[houchangren@ebsdi-23260-oozie shell]$grep  'hadoop'  a.shHADOOP_IN_PATH="/usr/lib/hadoop/bin/hadoop"HADOOP=$HADOOP_HOME/bin/hadoop 

// Display the row with five consecutive A-Z characters in one row

[houchangren@ebsdi-23260-oozie shell]$ grep'[a-z]\{5\}' a.shHADOOP_IN_PATH="/usr/lib/hadoop/bin/hadoop" HADOOP_DIR=`dirname "$HADOOP_IN_PATH"`/..HADOOP=$HADOOP_HOME/bin/hadoopHADOOP_VERSION=$($HADOOP version | awk '{if(NR == 1) {print $2;}}');

// If test is matched, es is stored in the memory and marked as 1. then search for any character (. *). These characters are followed by another es (\ 1). If they are found, the row is displayed. \ 1 is the value in the placeholder brackets.

[houchangren@ebsdi-23260-oozie shell]$ cata.txtI'm test; test;so you don't worry![houchangren@ebsdi-23260-oozie shell]$ grep't\(es\)t.*\1' a.txtI'm test; test;

POSIX (The PortableOperating System Interface) adds a special character class, for example [: alnum:] is another way of writing a A-Za-z0-9. Put them in the [] to become a regular expression, such as [A-Za-z0-9] or [[: alnum:] in Linux grep except fgrap, supports POSIX character classes. In addition to the characters mentioned above, square brackets also support other forms of composition:

1. posix Character Set

[: Alnum:]

Character

[: Alpha:]

Character

[: Blank:]

Space and positioning characters

[: Cntrl:]

Control characters

[: Digit:]

Numeric characters

[: Graph:]

Non-space characters

[: Lower:]

Lowercase letter

[: Print:]

Printable characters

[: Punct:]

Punctuation character

[: Space:]

Space Character

[: Upper:]

Uppercase letter

[: Xdigit:]

Hexadecimal number

2.Sort symbol
Multiple characters are considered as one symbol. For example, [. cn.] indicates that cn is regarded as one symbol.

3. Equivalent Character Set
Multiple characters are considered to be equal. For example, [= e =] can match multiple characters similar to e in locale of French.

Regular Expressions allow you to mix POSIX character sets with other character sets. For example, [[: alpha:]!] Match any English letter or exclamation point (!).

// Match or number!

[Houchangren @ ebsdi-23260-oozie shell] $ grep-E '[[: digit:]!] + 'A.txt

So you don't worry!

Single Character

There are four methods to match a single character: general characters, escaped meta characters,. (DOT) meta characters, and square brackets.

1. General characters
For example, abc matches abc.
 

2. Escape meta characters
For example, \ * match * \ [\] Match []

3 .. (DOT) character
For example,. bc matches abc and dbc.

4. square brackets
For example, [cC] hina only matches china and Chnia
[^ Abc] d matches any lower-case letters except abc

Match anchor of a single expression that matches multiple characters
^ Indicates the operator priority starting with and ending &

BRE calculation is preferred.

Operator

Description

[..] [=] [:]

Square brackets

\ Meta

Escape meta characters

[]

Square brackets

\ (\) \ N

Backward reference expression

*\{\}

Range expression and star expression

Unsigned

Continuous

^ &

Anchor

Prior operation of ERE

Operator

Description

[..] [=] [:]

Square brackets

\ Meta

Escape meta characters

[]

Square brackets

()

Group

* +? {}

Reset the Regular Expression

Unsigned

Continuous

^ &

Anchor

|

Alternate

More differences

1. Backward reference
BRE provides a mechanism named "back-to-reference" backreferences, which means that matching is the part selected by the previous regular expression. We use \ 1 1 \ 19 to reference the previously selected mode, and use \ (and \) to include the part that we want to reference later.

For example:

\ (Go \). * \ 1 matches two go

2. Alternate
The alternative is the feature of ERE. When the square brackets expression is used, it indicates that the character can be "matched" or "that character", but it cannot "match this character sequence or that Character Sequence ". When we need this feature, we use the alternative in ERE. The alternative is to separate different sequences with pipeline symbols. For example, you | me matches you or me.

Alternate characters can be the same as MPs queue symbols. Multiple characters can be used in a regular expression to provide multiple options. Because the alternating character has the lowest priority, it will be extended until the end of the new alternating character or regular expression.

3. Group

In BRE, we use some meta characters to modify the prefix and match duplicates. However, this operation is only

For a single character. In ERE, the grouping function can describe the prefix string with meta characters. Grouping symbols are used (and)

Enclose the statement. For example, [go) + matches one or more consecutive go.

Grouping is very useful when alternating values are used. For example: (Lily | Lucy) will visit my house today. In this regular expression, grouping limits the access to my house to Lily or Lucy.

Application of Regular Expressions

One reason why regular expressions are so important is that many programs (including those in UNIX/linux and Windows)

They all use regular expressions to Provide extensions for themselves to support more powerful functions.

There are two types of Regular Expressions: BRE and ERE, which are historical products. The regular expression in the egrep style is

UNIX development has already appeared in the early stages, but KenThompson, the founder of Linux, does not need to be enabled in the ed editor.

This comprehensive regular expression is supported, and the ed standard is later evolved into BRE.

Ed has become the foundation of grep and sed. The regular expression types supported by grep and sed are BRE. In pre-V7

Period, egrep was invented, egrep is using the ERE style regular. However, when egrep, grep, and fgrep are merged

In grep programs, grep also supports regular expressions of different styles (supported by the E Option) o at the same time, egrep

Despite being kicked out of the POSIX standard, Many UNIX/LINUX releases still support the egrep command, and many

Legacy scripts are still using egrep, although the standard practice should be grep-E

Some programs that use the expression of "n" in the ERE style include egrep. awk and lex. lex are a lexical analyzer build program,

Except in special cases, it is rarely used in shell programming.

Extension

In addition to the support for BRE and ERE standard regular expressions, many programs provide regular tables as needed.

Syntax extension. The most common extensions are \ <and \>, which respectively match the start and end of a word.

The start of a word can be either at the starting position of a row, followed by a non-word character, or at the end of a word; the end of a row and the tail of a non-word.

GUN supports additional regular expression Operators

 

Description

\ W

Matches any word to form a character, which is equivalent to [[: alnum:]

\ W

Match any non-single-component character, equivalent to [^ [: alnum:]

\ <\>

Match the start and end of a word

\ B

Matching the null character o \ bword found at the beginning and end of a word is equivalent to \ <word \>

\ B

Matches an empty string between two characters.

\'\'

Match the beginning and end of the emacs buffer respectively. GNU programs generally regard them as synonymous with ^ and S

Study Rome numerals

I have a big head. I was in a mood to study it that day.

Resolution phone number instance

echo '80055512121234' | grep -E"^[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{4}[^[:digit:]]*[[:digit:]]*$"echo '800-555-1212' | grep -E"^[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{4}[^[:digit:]]*[[:digit:]]*$"echo '800-555-1212-1234' | grep -E"^[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{4}[^[:digit:]]*[[:digit:]]*$"echo 'work 1-(800) 555.1212 #1234' | grep-E "[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{3}[^[:digit:]]*[[:digit:]]{4}[^[:digit:]]*[[:digit:]]*$"

Parse the expression:

[[: Digit:] {3} three digits

[^ [: Digit:] * anything other than a number (or none)

[[: Digit:] {3} three digits

[^ [: Digit:] * anything other than a number (or none)

[[: Digit:] {4} Four digits

[^ [: Digit:] * anything other than a number (or none)

[[: Digit:] * Number (not available)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.