POSIX specification and genre of linux-regular expressions

Source: Internet
Author: User
Tags closing tag control characters locale posix egrep

POSIX specification for Linux/unix tools and regular expressions
A reader who has a basic understanding of regular expressions must not be unfamiliar with expressions such as "\d", "[a-z]+"), which match a numeric character, which matches more than one lowercase English letter. But if you have used tools like VI, grep, awk, sed, and so on, you might find that these tools, while supporting regular expressions, are very different in syntax, according to the usual habit of Linux/unix, "\d", "[a-z]+", is often not unrecognized or a matching error. Also, there are differences between these tools themselves, and the same structure, sometimes escaping, sometimes needs to be escaped. So, what is it about?
  The reason is that most of the tools under Unix/linux are POSIX compliant, and the POSIX specification can be divided into two genres (flavor). Therefore, it is necessary to understand the POSIX specification first.


POSIX specification
Common Regular expression notation, in fact, originates from Perl, in fact, a regular expression derived from Perl a prominent genre, called Pcre (Perl Compatible Regular Expression), "\d", "\w", "\s" such as notation, is the characteristics of this genre. But outside of Pcre, there are other genres of regular expressions, such as the regular expressions of the POSIX specification that are described below. ()
POSIX is the full name of the portable Operating System Interface for UNIX, which consists of a series of specifications that define what the UniX operating system should support, so the "POSIX canonical regular expression" is actually just "POSIX specification for regular expressions", posix defines the Bre (basic Regular expression, basic regular expression) and ere (Extended Regular Express, Extended regular expression) two genres . On POSIX-compliant UNIX systems, tools like grep and Egrep follow the POSIX specification, and regular expressions in some database systems are also compliant with the POSIX specification.


BRE
in Linux/unix common tools, grep, vi, sed all belong to the BRE. , its syntax looks strange, the meta-character "(") "," {","} "must be escaped after the special meaning, so the regular expression" (a) B "can only match the string (a) b instead of the string ab; the regular expression" a{1,2} "can only match the string a{ The regular expression "a\{1,2\}" matches the string A or AA.
is so troublesome because these tools are born early, and many of the functions of regular expressions evolve over time, and they may not have special meanings before, and for backwards compatibility, you can only use escape. And some features are not even supported at all, such as bre does not support "+" and "?" Quantifier, also does not support multi-select Structure "(... | ...)』 and reverse reference "\1", "\2" ....
Today, however, the pure Bre is rarely seen, after all, it has been thought that the regular expression "rightfully" support multi-select structure and reverse reference functions, not really inconvenient. So while VI belongs to the BRE genre, these features are provided. gnu also extended the BRE to support "+", "?", "|", but only when used must be written "\+", "\", "\|", but also support "\1", "\2" and other reverse references. In this way, GNU's grep and other tools, while nominally part of the Bre stream, are more specifically named GNU Bre.

Ere
In the common tools of Linux/unix,egrep and awk belong to the faction of Ere . Although the Bre is named "Basic" and Ere is named "Extended", Ere does not require a compatible BRE syntax, but self-contained. Therefore, the metacharacters are not escaped (adding backslashes before the metacharacters will cancel their special meaning), so "(AB|CD)" can match the string ab or CD, the quantifier "+", "?", "{n,m}" can be used directly. ere does not explicitly support reverse referencing, but many tools support reverse references such as "\1", "\2", and so forth .
GNU produced Egrep and other tools belong to ERE stream (more accurate name is GNU ere), but because GNU has done a lot of extensions of the BRE, so-called GNU ERE is actually just a statement, it has a function of the GNU Bre have, but the meta-character does not need to escape it.
The table below briefly illustrates the differences between several POSIX genres (in fact, there is no difference between the current Bre and Ere, the main difference being in the escaping of metacharacters ).
  Description of several POSIX genres (important)

Schools

Description

Tools

BRE

( ) , { , } All must be escaped, not supported + , , |

grep sed vi (but support for these multi-select structures and reverse references)

GNU BRE

(,) , {,} , + , ? , | must all be escaped using

GNU grep,GNU sed

Ere

metacharacters do not have to be escaped, + , , , ) , { , } , | can be used directly, \1 , \2 support indeterminate

Egrep,awk

GNU ERE

metacharacters do not have to be escaped, + , , , ) , { , } , | can be used directly, support \1 , \2

Grep–e,GNU awk


For easy access, the following table lists the basic regular functions in the common tool notation, where the GNU version of the tool prevails.
notation in common Linux/unix tools
  

PCRE notation

Vi/vim

Grep

Awk

Sed

*

*

*

*

*

+

\+

\+

+

\+

?

\=

\?

?

\?

{M,n}

\{m,n}

\{m,n\}

{M,n}

\{m,n\}

\b *

\< \>

\< \>

\< \>

\y \< \>

(..... | ...)

\ (... \|...\)

\ (... \|...\)

(..... | ...)

(..... | ...)

(...)

\(...\)

\(...\)

(...)

(...)

\1 \2

\1 \2

\1 \2

Not supported

\1 \2


Note:pcre commonly used in \b to denote "the beginning or end of a word", but Linux/unix tools usually use \< to match the "starting position of a word" and \> to match "end position of word", and \y in SED can match both positions simultaneously .


POSIX character Group
In some documents, you'll also find representations like "[:d Igit:]", "[: Lower:]" that don't seem difficult to understand (digit is "numbers", Lower is " lowercase "), but strangely, this is the POSIX character group. These groups of characters appear not only in the common tools of Linux/unix, but even in some languages, to avoid confusion, it is necessary to introduce them briefly.
in the POSIX specification, notation such as "[A-Z]", "[Aeiou]" is still legal, and its meaning is no different from the character set in Pcre, except that the exact name of such notation is the POSIX square bracket expression (bracket expressions), It is mainly used in the Unix/linux system. The main difference between POSIX square bracket notation and the Pcre character group is that in POSIX character groups, backslashes \ are not used for escaping. So posix Square bracket notation "[\d]" can only match \ and d two characters, not "[0-9]" corresponding to the numeric character .
in order to solve the escaping problem of special meaning characters in a character group, POSIX square bracket notation stipulates that if you want to express a character in a character group (rather than as the closing tag of a group of characters), you should let it follow the opening brackets of the character group, so POSIX, the regular expression "[]a]" The characters that can be matched are] and A; If you want to express the character in the POSIX square bracket notation-(instead of the range notation), it must be close to the closing square brackets], so that "[a]" matches the character a and-. The
POSIX specification also defines a POSIX character set, which is approximately equivalent to the Pcre character group précis-writers method, which represents a set of characters in a visually meaningful name, such as digit for "numeric characters" and Alpha for "alphabetic characters".
However, there is one notable concept in POSIX: locale (usually translated as "locale"). It is a set of language and culture-related settings, including date format, currency value, character encoding, and so on. The meaning of the POSIX character group changes depending on the locale, and the following table describes the common POSIX character groups in the ASCII locale and the Unicode language environment for your reference.

POSIX character Groups

Description

ASCII Language Environment

Unicode Language Environment

[: alnum:]*

alphabetic characters and numeric characters

[A-za-z0-9]

[\p{l&}\p{nd}]

[: Alpha:]

Letters

[A-za-z]

\p{l&}

[: ASCII:]

ASCII characters

[\x00-\x7f]

\p{inbasiclatin}

[: Blank:]

Space characters and tabs

[\ t]

[\p{zs}\t]

[: Cntrl:]

Control characters

[\x00-\x1f\x7f]

\P{CC}

[:d Igit:]

numeric characters

[0-9]

\P{ND}

[: Graph:]

Characters other than white space characters

[\x21-\x7e]

[^\p{z}\p{c}]

[: Lower:]

Lowercase alphabetic characters

[A-z]

\P{LL}

[:p rint:]

Similar to [: graph:], but includes whitespace characters

[\x20-\x7e]

\P{C}

[:p UNCT:]

Punctuation

[][!" #$%& ' () *+,./:;<=>[email protected]\^_ ' {|} ~-]

[\p{p}\p{s}]

[: Space:]

White space characters

[\t\r\n\v\f]

[\p{z}\t\r\n\v\f]

[: Upper:]

Uppercase characters

[A-z]

\p{lu}

[: word:]*

Alphabetic characters

[A-za-z0-9_]

[\P{L}\P{N}\P{PC}]

[: Xdigit:]

Hexadecimal characters

[A-fa-f0-9]

[A-fa-f0-9]


Note 1: The character group of the tag * précis-writers method is not in the POSIX specification, but is available in a wide range of languages, and is also present in the documentation.
Note 2: The corresponding Unicode attribute is referenced in the section on Unicode that has been published in this article series. The use of the
POSIX character group differs. The main difference is that the pcre character group précis-writers can appear directly from the square brackets, and the POSIX character group must appear in square brackets, so it is also a match number character, when it appears separately, "\d" can be written directly in Pcre, and POSIX character groups must be written as "[[:d Igit:]]" .
linux/unix tool can generally be used directly with POSIX character groups, while Pcre's character group précis-writers method "\w", "\d" and so on are mostly not supported , so don't be surprised if you see "[[: Space:]]" instead of "\s".
However, in common programming languages, Java, PHP, and Ruby also support the use of POSIX character groups. The POSIX character groups in Java and PHP are matched according to the ASCII locale; Ruby is a bit more complicated, ruby 1.8 matches in the ASCII locale and does not support "[: Word:]" and "[: Alnum:]", Ruby 1.9 matches according to the Unicode locale while supporting "[: Word:]" and "[: Alnum:]".

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.