AHK Regular Expression related documents (i)

Source: Internet
Author: User
Tags character classes ranges uppercase letter expression engine

Regular Expressions (RegEx)--Quick reference

Basic knowledge


Match anywhere: By default, regular expressions can match substrings anywhere in the searched string. For example, the regular expression ABC can match abc123, 123ABC, and 123abcxyz. To limit the match at the beginning or the end, use an anchor.

Escape characters: Most characters (such as abc123) can be used directly in regular expressions. However, \.*?+[{| () ^$ these characters must precede them with a backslash to match. For example, \. Represents a literal period and \ \ represents a literal backslash. Use \q...\e to avoid escaping. For example: \qliteral text\e.

Case-sensitive: By default, regular expressions are case-sensitive. This feature can be changed using the "I" option. For example, mode i) ABC searches for "abc" in all uppercase and lowercase forms. See below for additional options.

Options (case sensitive)


At the front of the regular expression, specify 0 or more of the following options followed by a closing parenthesis. For example, mode "IM" ABC will have a case-insensitive and multiline option to search for ABC (this parenthesis can be omitted if the option is not included). Although this syntax differs from tradition, it does not require a special delimiter (such as a forward slash), so you do not need to escape such delimiters in the pattern. In addition, it improves performance because it is easy to parse out options.

I Case-insensitive matching, which treats the letters A through Z as equivalent to their lowercase copies.
M

MultiRow. Consider haystack as a collection of many separate rows (if it contains new line characters) instead of a single contiguous row. Specifically, it will change the following ways:
1) The Caron (^) matches the position immediately after all new line characters inside, as if it always matches the beginning of the Haystack (but it does not match the position after the new line character of the last face of the Haystack).
2) The dollar symbol ($) can match the position before any new line character in Haystack (as if it always matches the last position).
For example, the mode "M" ^abc$ contains the "m" option to form a match in Haystack "xyz ' R ' Nabc".
The "D" option is ignored when the "m" option is used.

S Dotall. This option causes the period (.) to match all characters including the new line character (in general, it cannot match the new line character). However, if the line break is the default CRLF (' R ' N), you must use two periods to match (not one). Exclude character classes (such as [^a]) always match new line characters, regardless of whether or not this option is used.
X Whitespace characters in the pattern are ignored, unless they are escaped or appear in the character class. The characters ' n and ' t are ignored when they reach PCRE because they are already raw/literal white space characters (compared to \ n and \ t are ignored because they are PCRE escape sequences). The X option also ignores the characters (including them) between the non-escaped # and the next new line character outside the characters class. This makes it possible to add comments in a complex pattern. However, this applies only to data characters; White space characters may never appear in a special character sequence, for example (?), which begins with a conditional sub-pattern.
A Forced fixed matching mode; That is, it can only match the beginning of the Haystack (even if it starts with a newline character, it starts with a newline character and does not start with the characters after the newline). Under most conditions, it works the same as using "^" in the pattern.
D Force the dollar symbol ($) to match the end of the Haystack , even if the last character of the Haystack is a new line break. Without this option, $ matches the position before the last new line break (if there is a new line break, the match does not include the new line character). Note: This option is ignored when the "m" option is used.
J Allows repeating named sub-patterns. It can be used in a set of identical named sub-patterns only one of which forms a matching pattern. Note: If there are multiple instances of the special name sub-pattern to form a match, then only the leftmost one is saved. In addition, variable names are not case-sensitive.
U Non-greedy. Let the qualifier *+? {} only consumes the necessary characters when forming a match, leaving the remainder to the back part of the pattern. When you do not use the "U" option, you can include a question mark after these characters to qualify them as non-greedy. Conversely, when you use the "U" option, the question mark becomes a greedy match qualifier.
X Pcre_extra. Enables PCRE features that are not compatible with Perl. Currently, such a unique function is to follow any backslash in the pattern followed by a letter with no special meaning, causing the match to fail and therefore set ErrorLevel. This option helps preserve unused backslash sequences for future use. Without this option, a backslash followed by a letter with no special meaning is considered to be literal (that is, \g and G are recognized as the literal g). Non-alphabetic backslash sequences that have no special meaning are always considered to be literal (that is, \/and/are treated as forward slashes), regardless of whether this option is used or not.
P Position mode. This causes regexmatch () to produce a match and its sub-pattern's position and length rather than the substring that matches them. For more details, see Unquotedoutputvar.
S Research patterns to improve performance. It can be used for special patterns (especially complex schemas) that are executed multiple times. If PCRE finds a way to lift the performance, it will store the discovery next to the pattern in the cache, so that it can be used later when the same pattern is executed (the S option is also required for subsequent use, because the same pattern in the cache must be found, including their order). (the study here mainly refers to the use of other generally simpler and faster methods before the match, such as assuming that the pattern matches at least 5 characters, while the source string is only 3, then the regular expression engine will return "no matching" results directly, without matching.)
C Enable auto-recall mode. See regular expression recall for more information.
' N Switch from the default new line character (' R ' N) to a separate line break (' n '), which is the standard for UNIX systems. The New line character you select affects the anchor (^ and $) and the pattern with the period.
' R Switch from the default new line character (' R ' N) to a separate carriage return (' R ').
' A In v1.0.46.06+, ' A can identify any type of new line character, i.e. ' R, ' N, ' R ' N, ' v/vt/vertical tab/chr (0xB), ' F/FF/FORMFEED/CHR (0xC), and NEL/NEXT-LINE/CHR ( 0x85). In v1.0.47.05+, the new line character can be limited to three kinds of CR, LF and CRLF, only need to specify the uppercase (*ANYCRLF) at the beginning of the pattern (after the option), such as IM) (*ANYCRLF) ^abc$.

Note: You can use either a space or tab separation between the two options.

Common symbols and usage

. By default, a period matches any single character except for a new line character (' R ' N), but this attribute can be changed using Dotall (s), New lines (' N), carriage return (' R), ' A or (*ANYCRLF) options. For example, AB can match ABC and ABZ and Ab_.
*

Asterisks match 0 or more of the preceding characters, character classes, or sub-patterns. For example, a * can match AB and Aaab. It can also match the beginning of any string that does not contain "a" at all.
Wildcard: Period asterisk mode. * is one of the most widely matched patterns, because it can match 0 or more arbitrary characters (except for new line characters: ' R and ' n '). For example, abc.*123 can match abcAnything123 and can match abc123.

? The question mark matches 0 or a preceding character, a character class, or a sub-pattern. It can be understood as "the previous item is optional". For example, Colou?r can match color and colour, because "u" is optional.
+ The plus sign matches one or more preceding characters, character classes, or sub-patterns. For example A + can match AB and Aaab. But with a * and a? The difference is that the pattern A + does not match the string at the beginning without "a".
{Min,max} Matches characters, character classes, or sub-patterns that occur in the preceding number of Min and Max. For example, a{1,2} can match AB but only the first two a in the Aaab.
In addition, {3} indicates an exact match of 3 times, while {3,} means matching 3 or more times. Note: The specified number must be less than 65536, and the first must be less than or equal to the second one.
[...] Character class: square brackets enclose a column of characters or a range (or both). For example, [ABC] denotes "A, B, or any one of the characters in C." Use dashes to create ranges; For example, [A-z] means "any one character between lowercase A and Z (contained)." Lists and ranges can be grouped together; For example [a-za-z0-9_] denotes "any one character in letters, numbers, or underscores."

After the character class, you can use *,?, +, or {Min,max} to qualify. For example, [0-9]+ matches one or more arbitrary numbers; So it can match xyz123 but does not match abcxyz.

The following POSIX named sets are also supported through [: XXX:], where xxx is one of the following words: Alnum, alpha, ASCII (0-127), blank (Space or tab), Cntrl (control character), digit (0- 9), Xdigit (hexadecimal number), print, graph (excluding space-printed characters), Punct, lower, upper, space (blank), Word (equivalent to \w).

In character classes, only characters with special meanings in the class need to be escaped; For example [\^a], [A\-b], [a\]] and [\\a].

[^...] Matches any one character that is not in the class. For example, [^/]* matches 0 or more any character that is not a forward slash, such as http://. Similarly, [^0-9xyz] matches any one character that is neither a number nor x, Y, or Z.
\d Match any number (equivalent to class [0-9]). Conversely, an uppercase \d represents "any non-numeric character". This and the following two can be used in the character class; For example, [\d.-] denotes "any number, period, or minus sign".
\s Matches any single whitespace character, mainly spaces, tab, and New line characters (' R and ' n '). Conversely, an uppercase \s represents "any non- whitespace character."
\w Matches any single "word" character, that is, a letter, a number, or an underscore. This is equivalent to [a-za-z0-9_]. Conversely, uppercase \w denotes "any non- word character".
^
$
Caron (^) and dollar symbols ($) are called anchors because they do not consume any characters; Instead, they constrain the pattern to match at the beginning or end of the searched string.
Use ^ at the beginning of the pattern to indicate that a match needs to be made at the beginning of the row. For example, ^ABC can match abc123 but does not match 123abc.
Using $ at the end of the pattern indicates that the end of the line needs to be matched. For example, abc$ can match 123abc but cannot match abc123.
These two anchors can also be used in combination. For example, ^abc$ matches only ABC (that is, it cannot have additional characters before or after it).
If the text being searched contains more than one line, you can use the "m" option to have the anchor applied to each row instead of all the text as a whole. For example, M) ^abc$ can match 123 ' R ' nabc ' R ' n789. However, if there is no "m" option, no match will be formed.
\b \b Represents a "word boundary", which resembles an anchor because it does not consume any characters. It requires the status of the current character to be a word character (\w), as opposed to the state of the previous character. It is often used to avoid accidentally matching a word within another word. For example, \bcat\b does not match catfish, but it can match cat regardless of whether there is punctuation or white space around it. Uppercase \b is the opposite: it requires that the current character is not the boundary of the word.
| A vertical bar separates two or more optional items. If any one of the optional items satisfies the criteria, a match is formed. For example, Gray|grey can match either gray or grey. Similarly, pattern gr (a|e) y can do the same with the help of the parentheses described below.
(...)

Items enclosed in parentheses are often used to:
• Determine the order in which the values are evaluated. For example, (sun| mon| tues| wednes| thurs| fri| Satur) Day can match the name of any of the days.
• Apply *,?, +, or {Min,max} to series characters instead of just a single character. For example, (ABC) + matches one or a string "abc"; So it can match abcabc123 but does not match ab123 or bc123.
• Capture sub-patterns, such as the period asterisk in ABC (. *) xyz. For example, Regexmatch () saves substrings that match each sub-pattern to the output array. Similarly, Regexreplace () allows substrings that match each sub-pattern to be reinserted into the replacement result by a forward reference like $ A. To use parentheses that do not capture sub-patterns, specify the opening two characters in parentheses as?:; For example: (?:. *)
• Change the options during the match process. For example, (? IM) will open the case-insensitive and multiline option for subsequent parts of the pattern (if it is in sub-mode, it will change the sub-mode option). Conversely, (?-im) closes them. All options are supported except DPS ' R ' N ' A.

\ t
\ r
Wait a minute
These escape sequences represent special characters. The most common ones are \ t (tab), \ r (carriage return) and \ n (newline). In AutoHotkey, you can also use accent marks (') instead of backslashes in these cases. An escape sequence in the \XHH format is also supported, where HH is the hexadecimal code of any ANSI character between 00 and FF.

In v1.0.46.06+, \ r means "a single, arbitrary type of new line character," which is listed in the ' A ' option (however, \ r only represents the letter "R" in the character class). In v1.0.47.05+, \ r can be limited to CR, LF and CRLF three, only need to specify uppercase (*BSR_ANYCRLF) at the beginning of the pattern (after option); such as IM) (*BSR_ANYCRLF) abc\rxyz

\P{XX}
\P{XX}
\x
[ahk_l 61+]: Unicode character property. Not supported in the ANSI version. \P{XX} matches the character with the XX attribute and \p{xx} matches any character without the xx attribute. For example, \PL matches any one letter and \p{lu} matches any uppercase letter. \x matches any number of characters that make up the extended Unicode sequence.

For a complete list of supported property names and other details, search for "\p{xx}" in Www.pcre.org/pcre.txt.

(*UCP) [ahk_l 61+]: Considering performance, \d, \d, \s, \s, \w, \w, \b and \b only ASCII characters are recognized by default, even in the Unicode version. If the pattern starts with (*UCP), the Unicode attribute is used to determine which character matches. For example, \w becomes equivalent to [\p{l}\p{n}_] and \d becomes equivalent to \p{nd}.

Greed : By default, * 、?、 + and {Min,max} are greedy, because they consume all the characters that last can satisfy the entire pattern. To make them stop at the first possible character, add a question mark after them. For example, pattern <.+> (where there is no question mark) means: "Search for a <, then one or more arbitrary characters, then a >". To stop when matching the entire string <em>text</em>, add a question mark after the plus sign: <.+?>. This causes the match to stop at the first ' > ', so it only matches the first label <em>

predict and review assertions : This group (? = ...), (?! ...), (<= ...) and (?<!...) are called assertions because they require that a certain condition be met without consuming any characters. For example, ABC (? =.*XYZ) contains a predictive assertion that requires string xyz to exist somewhere to the right of the string ABC (the match fails if it does not exist). (?=...) is called a positive prediction assertion because it requires the specified pattern to exist. In contrast, (?! ...) is a negative prediction assertion because it requires that the specified pattern does not exist. Similarly, (<=) and (?<!...) are positive and negative review assertions, because they check the left side of the current position instead of the right. The review is more restrictive than predictions because they do not support variable-size qualifiers, such as *,? and +. The escape sequence \k is similar to a retrospective assertion because it causes the previous matching character to be omitted from the last matching string. For example, Foo\kbar can match "foobar" but the result of the report match is "bar".

related : Ahk two important functions using regular Expressions Regexmatch (), Regexreplace ()

AHK Regular Expression related documents (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.