Delphi and Regular Expressions

Last Update:2018-12-05 Source: Internet

Author: User

Tags control characters ereg

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

TRegExpr is a good implementation of regular expressions in delphi.
Is a separate unit, which can be directly referenced during use. Several samples are provided.

Add several lines of notes to the SelfTest example as follows:
{Basic tests}

R: = TRegExpr. Create;

R. Expression: = '[A-Z]';
R. Exec ('234578923457823659ghjk38 ');
Check (0, 19, 1 );
//? This indicates that * is in non-Greedy mode.
R. Expression: = '[A-Z] *? ';
R. Exec ('234578923457823659artzu38 ');
Check (0, 1, 0 );

R. Expression: = '[A-Z] + ';
R. Exec ('234578923457823659artzu38 ');
Check (0, 19, 5 );
// Functions the same as the above + Method
R. Expression: = '[A-Z] [A-Z] *';
R. Exec ('234578923457823659artzu38 ');
Check (0, 19, 5 );
//? This indicates matching [A-Z] 0 times or once
R. Expression: = '[A-Z] [A-Z]? ';
R. Exec ('234578923457823659artzu38 ');
Check (0, 19, 2 );
// \ D indicates a number. ^ indicates a non-numeric character. It must be one or more non-numeric characters.
R. Expression: = '[^ \ d] + ';
R. Exec ('234578923457823659artzu38 ');
Check (0, 19, 5 );

Familiar with regular expressions for half an hour
Author: Web application network source: Web Application Network

Learn regular expressions!
Many people may have a headache with regular expressions. Today, I want to share my learning experience with you through my understanding and online articles.
The beginning is still worth mentioning ^ and $ are used to match the start and end of the string respectively. The following are examples:

"^ The": must start with a "The" string;
"Of despair $": the end must contain a string of "of despair;

So,
"^ Abc $": a string that must start with abc and end with abc. In fact, only abc matches
"Notice": match a string containing notice

You can see that if you wave apricot dream Yi Qiao? The pattern (Regular Expression) can appear anywhere in the string to be tested. You didn't lock it to either side.
Next, let's talk about '*', '+', and '? ',
They are used to indicate the number or sequence of occurrences of a character. They represent:
"Zero or more" is equivalent to {0 ,},
"One or more" is equivalent to {1 ,},
"Zero or one." is equivalent to {0, 1}. Here are some examples:

"AB *": it is synonymous with AB {0,}. It matches a string starting with a and followed by 0 or N B ("a", "AB ", "abbb", etc );
"AB +": it is synonymous with AB {1,}. It is the same as the above, but there must be at least one B ("AB", "abbb", etc .);
"AB? ": It is synonymous with AB {0, 1} and can have no or only one B;
"? B + $ ": match the string ending with one or zero a plus more than one B.
Key points: '*', '+', and '? 'Only the character before it.

You can also limit the number of characters in braces, such

"AB {2}": requires that a be followed by two B (one cannot be less) ("abb ");
"AB {2,}": requires that there must be two or more B (such as "abb", "abbbb", etc.) after .);
"AB {3, 5}": requires 2-5 B ("abbb", "abbbb", or "abbbbb") after ").

Now we can put a few characters in parentheses, for example:
"A (bc) *": match a with 0 or a "bc ";
"A (bc) {}": one to five "bc ."

There is also a character '│', which is equivalent to the OR operation:

"Hi │ hello": Match string containing "hi" or "hello;
"(B │ cd) ef": matches strings containing "bef" or "cdef;
"(A │ B) * c": matches multiple (including 0) a or B, followed by a c
String;

A point ('.') can represent all single characters, excluding "\ n"
What if we want to match all single characters including "\ n?
By the way, use the '[\ n.]' mode.

"A. [0-9]": Add a character to an a and a number ranging from 0 to 9.
"^. {3} $": End of any three characters.

The content enclosed in brackets only matches a single character.

"[AB]": Match a or B (same as "a │ B );
"[A-d]": match a single character from 'A' to 'D' (same effect as "a │ B │ c │ d" and "[abcd ); generally, we use [a-zA-Z] to specify a character as a case-sensitive English character.
"^ [A-zA-Z]": matches strings starting with an uppercase/lowercase letter.
"[0-9] %": match a string containing x %
", [A-zA-Z0-9] $": match a string ending with a comma plus a number or a letter

You can also include characters you don't want in brackets, you only need to use '^' in the brackets to start with "% [^ a-zA-Z] %" to match a non-letter string containing two percentage signs.
Key Point: ^ when used at the beginning of the brackets, it indicates that the characters in the brackets are excluded.
To be able to explain in PHP, you must add ''in front of these characters and escape some characters.
Do not forget that the character in the brackets is an exception to this rule-in the brackets, all the special characters, including (''), will all lose their special nature "[* \ +? {}.] "Match strings containing these characters.
Also, as the regx manual tells us: "If the list contains ']', it is best to use it as the first character in the list (possibly following '^ ). if it contains '-', it is better to put it at the beginning or the end, or a range of the second end point [a-d-0-9] In The Middle Of '-' will be valid.
After reading the example above, you should have understood {n, m. note that n and m cannot be negative integers, and n is always less than m. in this way, you can match at least n times and at most m times. for example, "p {}" matches the first five p in "pvpppp ".
Which of the following statements start \?
\ B indicates that it is used to match a word boundary, that is,... for example, 've \ B '. It can match the ve in love but not in very.
\ B is the opposite of \ B above.
... Suddenly think of it... you can look at the http://www.phpv.net/article.php/251 to see other syntaxes starting \

Well, let's make an application:
How to build a pattern to match the number of currency Input
Construct a matching pattern to check whether the input information is a number that represents money. We think there are four ways to indicate the number of money: "10000.00" and "10,000.00", or no decimal part, "10000" and "10,000 ". now let's start building this matching mode:
^ [1-9] [0-9] * $
This variable must start with a number other than 0, but it also means that a single "0" cannot pass the test. The following is a solution:
^ (0 │ [1-9] [0-9] *) $
"Only numbers starting with 0 and not 0 match", we can also allow a negative number before the number:
^ (0 │ -? [1-9] [0-9] *) $
This is: "0 or a number that starts with 0 and may have a negative number in front. "Well, now let's not be so rigorous. We can start with 0. now let's give up the negative number, because we don't need to use it to represent coins. we now specify a pattern to match the fractional part:
^ [0-9] + (\. [0-9] + )? $
This implies that the matched string must start with at least one Arabic number. but note that in the above mode, "10. "It does not match. Only" 10 "and" 10.2 "are allowed. (Do you know why)
^ [0-9] + (\. [0-9] {2 })? $
We have specified two decimal places. If you think this is too harsh, you can change it:
^ [0-9] + (\. [0-9] {1, 2 })? $
This will allow one or two decimal places. Now we add a comma (every three digits) to increase readability, which can be expressed as follows:
^ [0-9] {1, 3} (, [0-9] {3}) * (\. [0-9] {1, 2 })? $
Do not forget that '+' can be replaced by '*'. If you want to allow blank strings to be input (Why ?). Do not forget that the backslice bar '\' may cause errors in the php string (common errors ).
Now we can confirm the string. Now we can remove all the commas (,) from str_replace (",", "", $ money) then we can regard the type as double and then use it for mathematical computation.

Next:
Construct a regular expression for checking email
There are three parts in a complete email address:
1. username (everything on the left ),
2 .'@',
3. Server Name (that is, the remaining part ).
The user name can contain uppercase and lowercase letters, Arabic numerals, periods ('. '), minus sign ('-'), and underline ('_'). the server name also complies with this rule, except for the underlines.
The start and end of the user name cannot be a period. the same is true for servers. you cannot have at least one character between two consecutive periods. Now let's take a look at how to write a matching pattern for the user name:
^ [_ A-zA-Z0-9-] + $
The end cannot exist yet. We can add the following:
^ [_ A-zA-Z0-9-] + (\. [_ a-zA-Z0-9-] +) * $
The above means: "at least one canonicalized character (except.) starts with 0 or more strings starting with a vertex ."
To simplify it, we can use eregi () to replace ereg (). eregi () is case-insensitive and we don't need to specify two ranges: "a-z" and "A-Z"-Just specify one:
^ [_ A-z0-9-] + (\. [_ a-z0-9-] +) * $
The server name is the same, but the underline should be removed:
^ [A-z0-9-] + (\. [a-z0-9-] +) * $
Okay. Now you only need to use @ to connect the two parts:
^ [_ A-z0-9-] + (\. [_ a-z0-9-] +) * @ [a-z0-9-] + (\. [a-z0-9-] +) * $

This is the complete email authentication matching mode. You only need to call
Eregi ('^ [_ a-z0-9-] + (\. [_ a-z0-9-] +) * @ [a-z0-9-] + (\. [a-z0-9-] +) * $ ', $ eamil)
Then you can check whether the email is used.
Other regular expressions
Extract string
Ereg () and eregi () has a feature that allows users to extract a part of a string through regular expressions (you can read the manual for specific usage ). for example, we want to extract the file name from path/URL-the following code is what you need:
Ereg ("([^ \/] *) $", $ pathOrUrl, $ regs );
Echo $ regs [1];
Advanced replacement
Ereg_replace () and eregi_replace () are also very useful: if we want to replace all the negative signs at intervals with commas:
Ereg_replace ("[\ n \ r \ t] +", trim ($ str ));
Finally, I will check the regular expression of the EMAIL for you to analyze the article.
"^ [-! # $ % & \ '* + \./0-9 =? A-Z ^ _ 'a-z {|} ~] + '.'@'.'[-! # $ % & \ '* + \/0-9 =? A-Z ^ _ 'a-z {|} ~] + \.'.'[-! # $ % & \ '* + \./0-9 =? A-Z ^ _ 'a-z {|} ~] + $"
If it is easy to understand, the purpose of this article has been achieved.

Syntax rules for regular expressions in Jscript and VBscript

A regular expression is a text mode consisting of common characters (such as characters a to z) and special characters (such as metacharacters. This mode describes one or more strings to be matched when searching the text subject. A regular expression is used as a template to match a character pattern with the searched string.

Here are some examples of regular expressions that may be encountered:

Jscript VBscript matching
/^ \ [\ T] * $/"^ \ [\ t] * $" matches a blank row.
/\ D-\ d/"\ d-\ d" verify that an ID number consists of a two-digit number, a hyphen, and a five-digit number.
/<(. *)>. * <\/>/"<(. *)>. * <\/>" Matches an HTML Tag.

The following table shows a complete list of metacharacters and their behaviors in the context of a regular expression:

Character Description
\ Mark the next character as a special character, an original character, or a backward reference, or an octal escape character. For example, 'n' matches the character "n ". '\ N' matches a line break. The sequence '\' matches "\" and "\ (" matches "(".
^ Matches the start position of the input string. If the Multiline attribute of the RegExp object is set, ^ matches the position after '\ n' or' \ R.
$ Matches the end position of the input string. If the Multiline attribute of the RegExp object is set, $ also matches the position before '\ n' or' \ R.
* Matches the previous subexpression zero or multiple times. For example, zo * can match "z" and "zoo ". * Is equivalent.
+ Match the previous subexpression once or multiple times. For example, 'Zo + 'can match "zo" and "zoo", but cannot match "z ". + Is equivalent.
? Match the previous subexpression zero or once. For example, "do (es )? "Can match" do "in" do "or" does ".? It is equivalent.
N is a non-negative integer. Match n times. For example, 'O' cannot match 'O' in "Bob", but can match two o in "food.
N is a non-negative integer. Match at least n times. For example, 'O' cannot match 'O' in "Bob", but can match all o in "foooood. 'O' is equivalent to 'o + '. 'O' is equivalent to 'o *'.
Both m and n are non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o" matches the first three o in "fooooood. 'O' is equivalent to 'o? '. Note that there must be no space between a comma and two numbers.
? When this character is followed by any other delimiter (*, + ,?, ,) The matching mode is not greedy. The non-Greedy mode matches as few searched strings as possible, while the default greedy mode matches as many searched strings as possible. For example, for strings "oooo", 'O ++? 'Will match a single "o", and 'O +' will match all 'O '.
. Match any single character except "\ n. To match any character including '\ n', use a pattern like' [. \ n.
(Pattern) matches pattern and obtains this match. The obtained match can be obtained from the generated Matches set. Use the SubMatches set in VBscript and use... Attribute. To match the parentheses, use '\ (' or '\)'.
(? : Pattern) matches pattern but does not get the matching result. That is to say, this is a non-get match and is not stored for future use. This is useful when you use the "or" character (|) to combine each part of a pattern. For example, 'industr (? : Y | ies) is a simpler expression than 'industry | industries.
(? = Pattern) Forward pre-query: matches the search string at the beginning of any string that matches pattern. This is a non-get match, that is, the match does not need to be obtained for future use. For example, 'windows (? = 95 | 98 | NT | 2000) 'can match "Windows" in "Windows 2000", but cannot match "Windows" in "Windows 3.1 ". Pre-query does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, instead of starting after the pre-query characters.
(?! Pattern) negative pre-query: matches the search string at the beginning of any string that does not match pattern. This is a non-get match, that is, the match does not need to be obtained for future use. For example, 'windows (?! 95 | 98 | NT | 2000) 'can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000 ". Pre-query does not consume characters. That is to say, after a match occurs, the next matching search starts immediately after the last match, instead of starting after the pre-query characters.
X | y matches x or y. For example, 'z | food' can match "z" or "food ". '(Z | f) ood' matches "zood" or "food ".
[Xyz] Character Set combination. Match any character in it. For example, '[abc]' can match 'A' in "plain '.
[^ Xyz] combination of negative character sets. Match any character not included. For example, '[^ abc]' can match 'p' in "plain '.
[A-z] character range. Matches any character in the specified range. For example, '[a-z]' can match any lowercase letter in the range of 'A' to 'Z.
[^ A-z] negative character range. Matches any character that is not within the specified range. For example, '[^ a-z]' can match any character that is not in the range of 'A' to 'Z.
\ B matches a word boundary, that is, the position between a word and a space. For example, 'er \ B 'can match 'er' in "never", but cannot match 'er 'in "verb '.
\ B matches non-word boundaries. 'Er \ B 'can match 'er' in "verb", but cannot match 'er 'in "never '.
\ Cx matches the control characters specified by x. For example, \ cM matches a Control-M or carriage return character. The value of x must be either a A-Z or a-z. Otherwise, c is treated as an original 'C' character.
\ D matches a numeric character. It is equivalent to [0-9].
\ D matches a non-numeric character. It is equivalent to [^ 0-9].
\ F matches a break. It is equivalent to \ x0c and \ cL.
\ N matches a linefeed. It is equivalent to \ x0a and \ cJ.
\ R matches a carriage return. It is equivalent to \ x0d and \ cM.
\ S matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S matches any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T matches a tab. It is equivalent to \ x09 and \ cI.
\ V matches a vertical tab. It is equivalent to \ x0b and \ cK.
\ W matches any word characters that contain underscores. It is equivalent to '[A-Za-z0-9 _]'.
\ W matches any non-word characters. It is equivalent to '[^ A-Za-z0-9 _]'.
\ Xn matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be determined by the length of two numbers. For example, '\ x41' matches "". '\ X041' is equivalent to '\ x04' & "1 ". The regular expression can use ASCII encoding ..
\ Num matches num, where num is a positive integer. References to the obtained matching. For example, '(.)' matches two consecutive identical characters.
\ N identifies an octal escape value or a backward reference. If at least n subexpressions are obtained before \ n, n is backward referenced. Otherwise, if n is an octal digit (0-7), n is an octal escape value.
\ Nm identifies an octal escape value or a backward reference. If at least one child expression is obtained before \ nm, the nm is backward referenced. If at least n records are obtained before \ nm, n is a backward reference followed by text m. If none of the preceding conditions are met, if n and m are Octal numbers (0-7), \ nm matches the octal escape value nm.
\ Nml if n is an octal digit (0-3) and both m and l are octal digits (0-7), the octal escape value nml is matched.
\ Un matches n, where n is a Unicode character represented by four hexadecimal numbers. For example, \ u00A9 matches the copyright symbol ().

10:26:21 piao40993470 commented.
All Chinese (excluding punctuation ):
([\ XB0-\ xF7] [\ xA1-\ xFE]) +
All GB2312-80 Codes
([\ XA1-\ xFE] [\ xA1-\ xFE]) +
All Chinese Spaces
(\ XA1 \ xA1) +

Punctuation: [\ x20-\ x2F \ x3A-\ x40 \ x5B-\ x60 \ x7B-\ x7E]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More