Linux regular expression-POSIX character class
POSIX standardizes the meanings of Regular Expression characters and operators. This standard defines two types of Regular Expressions: Basic Regular Expressions (BRE), grep and sed use this regular expression; extended regular expressions, egrep and awk use this regular expression.
To adapt to non-English environments, the POSIX standard enhances the function of matching character classes that are not in the English alphabet. For example, French is a letter character, but the typical character class [a-z] does not match it. This standard provides an additional letter sequence, which should be viewed as a single unit when matching and sorting string data.
POSIX also changes the frequently used terminology. What we call "character classes" is known as "Bracket expressions" in POSIX standards ". In a bracket expression, except for characters (such as ,! And so on. As follows:
? Character class. A posix character class consisting of [: And:] keywords. Keywords describe different character classes, such as text characters and control characters.
? Sort and conform. The collation conforms to a multi-Character Sequence. It indicates that these characters should be considered as a unit, which is composed of characters enclosed by [, and.
? Equivalence Class. Equivalence classes list character sets that should be considered as equivalent, such as e and e. It is composed of regionalized character elements (surrounded by [= and =.
All the three results must appear in square brackets of the brackets expression. For example, [[: alpha:]!] Match any single letter or exclamation point. [[. ch.] matches the arrangement element ch, but not only the letter c or letter h. In French, [[= e =] can match any e, e, or é. The following table lists the classes and their matching characters.
Brackets |
Description |
[: Alnum:] |
Alphanumeric characters |
[: Alpha:] |
Letter |
[: Cntrl:] |
Control characters |
[: Digit:] |
Numeric characters |
[: Graph:] |
Non-blank characters (non-spaces, control characters, etc) |
[: Lower:] |
Lowercase letters |
[: Print:] |
Similar to [: graph:], but contains space characters |
[: Punct:] |
Punctuation |
[: Space:] |
All blank characters (line breaks, spaces, and tabs) |
[: Upper:] |
Uppercase letters |
[: Xdigit:] |
Hexadecimal numbers allowed (0-9a-fA-F) |
When the vendor fully implements POSIX standards, these features gradually approach sed and awk's commercial version. GNU awk and GNU sed support character class symbols, but do not support the other two parentheses. You can check local system documents to see if they are available.
Because these features cannot be widely used, scripts on this site do not rely on them, and we will continue to use the term "character class" to represent the two-dimensional table in square brackets.