Basic analysis of Linux operating system (v)--grep command Family and regular expression initial knowledge

Last Update:2016-05-23 Source: Internet

Author: User

Tags posix expression engine egrep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

grep is known as one of the Three Musketeers of the text processing, although the Three Musketeers is the least functional and the simplest, but still can not be underestimated.
The full name of grep is: Global search REgular expression and Print out of the line, which is: find the regular expression and display the matching rows.

Then there is a new concept in it: regular expressions. So what is a regular expression?

Regular expressions, also known as formal notation, conventional notation English: (Regular expression, often abbreviated in code as regex, RegExp, or re), a concept of computer science. A regular expression uses a single string to describe and match a series of rules that conform to a certain syntax. In many text editors, regular expressions are often used to retrieve and replace text that conforms to a pattern.

A regular expression is a logical formula for a string operation, which is a "rule string" that is used to express a filter logic for a string, using predefined specific characters and combinations of these specific characters.
Given a regular expression and another string, we can achieve the following purposes:
1. Whether the given string conforms to the filtering logic of the regular expression (called "match");
2. You can get the specific part we want from the string using a regular expression.
Regular expressions are characterized by:
1. Flexibility, logic and functionality are very strong;
2. Complex control of strings can be achieved quickly and in a very simple way.
3. For people who have just come into contact, it is more obscure and difficult to understand.

Because regular expressions are primarily applied to text, they are applied in various text editor situations. Many programming languages support the use of regular expressions for string manipulation. For example, in Perl, a powerful regular expression engine--pcre is built in, and in addition, the Java language comes with it.

The mainstream regular engine is also 3 categories: First, DFA, two, the traditional NFA, third, POSIX NFA.
The DFA engine executes in a linear time state because they do not require backtracking (and therefore they never test the same character two times). The DFA engine can also ensure that the longest possible string is matched. However, because the DFA engine contains only a finite number of states, it cannot match a pattern that has a reverse reference, and because it does not construct a display extension, it cannot catch sub-expressions.
The traditional NFA engine runs a so-called "greedy" matching backtracking algorithm that tests all possible extensions of a regular expression in the specified order and accepts the first match. Because the traditional NFA constructs a specific extension of the regular expression to achieve a successful match, it captures the sub-expression matches and matches the reverse reference. However, because of the traditional NFA backtracking, it can access the exact same state multiple times (if the state is reached through a different path). Therefore, in the worst case, it can execute very slowly. Because the traditional NFA accepts the first match it finds, it may also cause other (possibly longer) matches to be uncovered.
The POSIX NFA engine is similar to a traditional NFA engine, except that they continue to backtrack until they can ensure that the longest possible match has been found. As a result, the POSIX NFA engine is slower than the traditional NFA engine, and when using POSIX NFA, you may not be willing to support shorter match searches, rather than longer matching searches, when changing the order of backtracking searches.

The programs that use the DFA engine include: awk, Egrep, Flex, Lex, MySQL, procmail, etc.
The main procedures for using the traditional NFA engine are: GNU Emacs, Java, ERGP, less, more,. NET language, PCRE library, Perl, PHP, Python, Ruby, sed, vi, vim;
The procedures for using the POSIX NFA engine are: Mawk, mortice Kern Systems ' utilities, GNU Emacs (which can be specified explicitly when used);
There are also engines using DFA/NFA mixing: GNU awk, GNU Grep/egrep, TCL.

With this knowledge, it's easier to go on with the rest of the discussion with grep. In order to meet the different needs of different users, the grep family has three members, namely: grep, Egrep, Fgrep. Where grep is the most basic of a functional implementation, if you want to use the extended regular expression more easily, use Egrep, if you want to quickly implement a lookup without the need to use regular expression matching, it is necessary to choose Fgrep.
Regardless of which command we use, our purpose is unique: to correctly find the rows that we are interested in from the specified file.

here's how to use the following grep commands:
grep: External command
Function: Displays the rows that are matched by the pattern from the source data. Also, grep works by default in greedy mode, that is, each grep pattern match is done with as many matching criteria as possible.
Format:
grep [OPTIONS] PATTERN [FILE ...]
grep [OPTIONS] [-E PATTERN |-f file] [FILE ...]
Common options:
-A NUM,--after-context=num: The line that is matched by the pattern and the # line after it, if not, is not displayed; cannot be used with the-o option
-B NUM,--before-context=num: The line to which the pattern is matched and its first # line, not displayed if not previously, and cannot be used with the-o option
-C NUM,-num,--context=num: The line to which the pattern is matched and the # lines before and after it, not shown if it is not in front or back; cannot be used with the-o option
--color=auto: Coloring the text to match to the display;
-E pattern,--regexp=pattern: Implements multiple pattern matching with logical or relationship, each-e option can only take one pattern as a parameter
-E,--extended-regexp: extended regular expression switch, enables grep to use regular expressions
-I,--ignore-case: Ignore character case
-N,--line-number: Displays the line number of the matching line in the original file
-O,--only-matching: Displays only the matched string;
-Q,--quiet,--silent: Silent mode, do not output any information;
-V,--invert-match: Displays rows that are not matched by the pattern;
-W,--word-regexp: Whole line matches Whole word
Exit Status:
If the selected row is successfully found, the status return value is 0; otherwise it is 1.
If an error occurs and there is no-Q 、--The quiet or--silent option, the status return value is 2

Pattern can also be called patterns, which can be composed in many ways:
1. Ordinary characters or strings
2. Regular expressions (meta-characters)

For example:
[[email protected] ~]# grep ' root '/etc/passwd
Root:x:0:0:root:/root:/bin/bash
Operator:x:11:0:operator:/root:/sbin/nologin
The pattern used in this example to provide grep with a matching condition is a string. This pattern is very mechanical and inflexible when matched, so the effect is not obvious.

The following is a simple collation of the common basic regular Expression metacharacters, which we hope will help you:

1. Wildcard characters:
. (dot): Generic matches any single character
[]: Generic matches a single character in any specified range
^: Used in brackets, the wildcard characters in brackets are reversed, meaning "not included".
In addition, the wildcard character set provided by Glob can also be used in regular expressions. Here are a few common examples:
[:d Igit:]: denotes all decimal digits, equivalent to 0-9
[: Upper:]: denotes all uppercase English letters, equivalent to A,b,c,d,..., x, y, Z
[: Lower:]: denotes all lowercase English letters, equivalent to A,b,c,d,..., x, y, Z
[: Alpha:]: denotes all English letters, including uppercase and lowercase, that is, the set of two set of characters.
[: Alnum:]: Denotes all English letters, including uppercase and lowercase letters, and also includes all decimal digits, the first three set of characters in the combined set
[: Space:]: denotes all whitespace characters.
[:p UNCT:]: denotes all special symbols, including,./!? \;‘" -=+_, etc.

[^[:p UNCT:]]: A character representing any of the non-special symbols.

2. Number of occurrences of the preceding character:
*: The preceding character can appear 0 or more times, equivalent to: \{0,\}
\?: The preceding character can appear 0 or 1 times, equivalent to: \{0,1\}
\+: The preceding character appears at least 1 times, equivalent to: \{1,\}
\{\}: The preceding character can appear any number of times specified
\{m\}: The preceding character must appear m times
\{0,m\}: The preceding characters can appear 0 or up to M times
\{m,\}: The characters preceding it appear at least m times
\{m,n\}: The preceding characters appear at least m times, up to N times

3. Anchor character:
^: Anchor characters at the beginning of the line
$: End of line anchor character
\&LT: The first anchor character of a word
\>: Ending anchor character

4. Grouping characters:
\ (\): Divides the characters enclosed in parentheses into a group
In the expression processing engine, if the parentheses are found, in the order in which they appear, the contents of the parentheses in the corresponding position are recorded in a particular variable, and the method of storing the values in these variables is called "Back reference", that is, using \1, \2, \3, ... such as

5. Select the character:
\|: Or, the whole that represents or signs around is optional content
such as: B\|root, which means B or root two cases
\ (b\|r\) oot indicates boot or root two cases, which are the result of using \ (\) grouping.

Note: When using regular expression metacharacters, be sure to note the following points:
1. Regular expressions are composed of the meta-characters of regular expressions, and in the form of regular expressions, there are generally no whitespace characters between the individual metacharacters.
2. It is best to use the meta-character of the regular expression in quotation marks, otherwise it might be interpreted by the shell rather than being processed by the regular expression engine.
3. When writing quotation marks, both single and double quotes are OK, but to take into account the use of the $ character, once a $ is used to represent the end of the line, you must use single quotation marks, otherwise it will be interpreted by the shell as a variable guide, thus attempting to interpret the string following it as a variable name, which interferes with our use
4. If you want to use the character itself, such as ".", you need to precede the meta-character with "\", such as "." Represents any single character, and ' \. ' It just means that the character itself, sometimes, we may want to match the domain name or IP address, you must use this method to be able.
5. When we write regular expressions, we should think about the issues as carefully as possible and consider all the situations possible in order to write a regular expression that best meets the requirements.

Metacharacters of the extended regular expression:
Includes all the meta-characters of the basic regular expression, with the new rules:
\? --?
\{\} and {}
\ (\)--()
\+--+
\| -|
The "\" in the initial anchor and the ending anchor is still retained.

let's take a few examples to make it easier for everyone to understand.

1. Remove the base name of the/etc/sysconfig/network-scripts/ifcfg-eno16777736 file

[Email protected] ~]# Echo/etc/sysconfig/network-scripts/ifcfg-eth0 | Grep-o "[^\/]\+\/\?$" | Tr-d '/' Ifcfg-eth0

2. Find the line with a parenthesis followed by a word (including an underscore) in the/etc/rc.d/init.d/functions file

[[email protected] ~]# grep-o ". *_*.* ()"/etc/init.d/functionssystemctl_redirect () Checkpid () __pids_var_run () __pids_ Pidof () daemon () Killproc () Pidfileofproc () Pidofproc () status () echo_success () echo_failure () echo_passed () echo_ Warning () Update_boot_stage () success () failure () passed () Warning () Action () Strstr () Is_ignored_file () is_true () Is_ False () Apply_sysctl ()

3. Display user account information in the/etc/passwd file with the same user name as the user's default shell name

[Email protected] ~]# grep-e "(\<.+\>). *\1$"/etc/passwdsync:x:5:0:sync:/sbin:/bin/syncshutdown:x:6:0: Shutdown:/sbin:/sbin/shutdownhalt:x:7:0:halt:/sbin:/sbin/halt

4. Displays the absolute path of the dynamic library file used by the LS command that is seen with the LDD command

[Email protected] ~]# Ldd/bin/ls | Egrep-o "/.*lib"/[^[:space:]]+ "/lib64/libselinux.so.1/lib64/libcap.so.2/lib64/libacl.so.1/lib64/libc.so.6/ lib64/libpcre.so.1/lib64/liblzma.so.5/lib64/libdl.so.2/lib64/ld-linux-x86-64.so.2/lib64/libattr.so.1/lib64/ libpthread.so.0

This article is from the "home of the Ops" blog, so be sure to keep this source http://zhaotianyu.blog.51cto.com/132212/1775956

Basic analysis of Linux operating system (v)--grep command Family and regular expression initial knowledge

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More