tregexpr Regular Expressions

Source: Internet
Author: User
Tags alphabetic character ereg

tregexpr Regular Expressions

2006-10-24 10:55
The regexpr in Delphi
[2006-03-29 11:33:46 am | Author:admin]
In fact, this Pascal unit I have recommended on a few forums, but also the only one I will use the following regular expression implementation of Delphi.
Regular expressions are a very cumbersome and powerful thing, niche Caishuxueqian, and not ready to write regular expressions of the tutorial, through the introduction of this unit, there will be some obvious and useful examples.
The first introduction is the protagonist of this unit: TREGEXPR class, this class includes many members, here is simply a simple introduction to the general matching process. Here is the code that extracts the email address in the text:

Procedure GetName (texttocheck:string;alist:tstringlist);
Var
myexpr:tregexpr;
Begin
myexpr: = tregexpr.create;
Try
Myexpr.expression: = ' name= ' (. *?) ";
If Myexpr.exec (Texttocheck) Then
Repeat
Alist.add (myexpr.match[1]);
Until not myexpr.execnext;
Finally
Myexpr.free;
End
End

Here's a little bit of a brief explanation of this piece of code.
The first is myexpr.expression: = ' name= ' (. *?) "; This statement matches a string in the form of Name= "XXXXX".
“.*?” is a very common paragraph, representing a "non-greedy match" to any string, representing the shortest string that matches the matching criteria, which is explained later in the question about greedy non-greed.
Parentheses indicate a reference to this text, and a string matching the pattern in the match will be stored in the match array of tregexpr.
Next is the If Myexpr.exec (texttochceck) sentence, which begins by matching the texttocheck with the regular expression mentioned above. There are three overloads of the Exec method:
function Exec (const ainputstring:ansistring): Boolean; To match the ainputstring parameter
function Exec:boolean; overload; to match InputString members
function Exec (Aoffset:integer): boolean; overload; For inputstring members, start matching from aoffset position
The method returns a Boolean value that, if true, indicates that the pattern that contains the expression in the inputstring, such as ' name= ' hello.gif ' as a parameter, returns TRUE.

The next statement appears in the myexpr.match[1], then used to remove the match result

The last Execnext is actually using the third overload mentioned above, which is used to continuously match the repeated occurrences of the string, returning the result with the same meaning as exec

Next talk about the match member, where Match[0] represents the match for the entire expression, followed by an array element that represents the matching result in parentheses, and the element number is incremented from left to right in parentheses, and the nested parentheses are incremented from inside out. For example, a simple match to an e-mail address:
Quotes from???
Input string: ' [email protected] ', ' [email protected] '
Regular expression: ' "((. *?) @(.*?))",‘
The results of the implementation are as follows:
0 "[Email protected]",
1 [email protected]
2 Dirt
3 Sina.com
You can see the order in which the results are arranged in the match array.


And that appears in the above. *? often used for less rigorous occasions, such as the previous use of the email address extraction, someone wrote a hundreds of-word Fu verification expression. where "." Represents any single character, "*" means that the preceding character (string) appears at least once, and '? ' Here is the non-greedy qualifier, to cite a simple example: "AAA" "BBB", such a string, if used ' "(. *?)" " To match, then the content of match[1] is ' AAA ', if the '? ' is removed, then match[1] becomes ' aaa ' and ' BBB ', which shows the difference between greed and non-greed.

A basic matching process will be here, there will be time to continue to write some other related content, please drop bricks

Turn from:http://www.delphibbs.com/keylife/iblog_show.asp?xid=13902
Coolbaby

TREGEXPR is a good implementation of regular expressions in Delphi.
is a single unit that can be referenced directly when used. Also brought a few sample.

A few lines of comments are added to the selftest example:
{Basic Tests}

r: = tregexpr.create;

R.expression: = ' [A-z] ';
R.exec (' 234578923457823659ghjk38 ');
Check (0, 19, 1);
? Here to show * in non-greedy mode
R.expression: = ' [a-z]*? ';
R.exec (' 234578923457823659artzu38 ');
Check (0, 1, 0);

R.expression: = ' [a-z]+ ';
R.exec (' 234578923457823659artzu38 ');
Check (0, 19, 5);
and the above + way, the same function
R.expression: = ' [a-z][a-z]* ';
R.exec (' 234578923457823659artzu38 ');
Check (0, 19, 5);
? This indicates a match [a-z]0 or once
R.expression: = ' [a-z][a-z]? ';
R.exec (' 234578923457823659artzu38 ');
Check (0, 19, 2);
\d represents a number, ^ represents a non, and is always one or more non-numeric characters
R.expression: = ' [^\d]+ ';
R.exec (' 234578923457823659artzu38 ');
Check (0, 19, 5);

Half an hour proficient in regular expressions
Author: Web application Network Source: Web Application Network

Learn regular expressions with me!
A lot of people have a headache with regular expressions. Today, I know, plus some online articles, hoping to use ordinary people can understand the expression. To share your learning experience.
At the beginning, still have to say ^ and $ they are respectively used to match the start and end of the string, the following examples illustrate

"^the": The beginning must have "the" string;
"Of despair$": Must end with a string of "of despair";

So
"^abc$": A string that requires ABC to start and end with ABC, which is actually only ABC match
"Notice": matches a string containing notice

You can see if you're not using the two characters we mentioned (the last one), which means that the pattern (regular expression) can appear anywhere in the checked string, and you don't lock him to either side.
Then, say ' * ', ' + ', and '? ',
They are used to indicate the number or order in which a character can appear. They respectively said:
"Zero or more" equals {0,},
"One or more" equals {1,},
"Zero or one." Equivalent to {0,1}, here are some examples:

"ab*": Synonymous with ab{0,}, match with A, followed by a string of 0 or N B ("a", "AB", "abbb", etc.);
"ab+": Synonymous with Ab{1,}, same as above, but at least one B exists ("AB", "abbb", etc.);
"AB": Synonymous with ab{0,1}, can have no or only one B;
"a?b+$": matches a string that ends with one or 0 a plus more than one B.
Key points, ' * ', ' + ', and '? ' Just the character in front of it.

You can also limit the number of characters that appear in curly braces, such as

"Ab{2}": Requires a must be followed by two B (one can not be less) ("ABB");
"Ab{2,}": Requires a must have two or more than two B (such as "ABB", "abbbb", etc.);
"ab{3,5}": Requires a can have 2-5 B ("abbb", "abbbb", or "abbbbb") after a.

Now let's put a few characters into parentheses, for example:
"A (BC) *": Match a followed by 0 or a "BC";
"A (BC) {1,5}": one to 5 "BC."

There is also a character ' │ ', equivalent to or operation:

"Hi│hello": matches a string containing "hi" or "hello";
"(B│CD) EF": Matches a string containing "bef" or "cdef";
"(a│b) *c": Match contains so many (including 0) A or B, followed by a C
The string;

A point ('. ') can represent all single characters, not including "\ n"
What if you want to match all of the individual characters, including "\ n"?
Yes, with ' [\ n.] ' This mode.

"A.[0-9]": A plus one character plus a number 0 to 9
"^. {3}$ ": three characters end.


Bracketed content matches only one single character

"[AB]": matches a single A or B (as with "a│b");
"[A-d]": a single character matching ' a ' to ' d ' (same as "a│b│c│d" and "[ABCD]"); In general, we use [a-za-z] to specify a character for a case in English
"^[a-za-z]": matches a string that begins with a case letter
"[0-9]%": matches a string containing the form x percent
", [a-za-z0-9]$": matches a string that ends with a comma plus a number or letter

You can also put the words you don't want to be in brackets, you just need to use ' ^ ' as the first "%[^a-za-z]%" to match the two percent sign containing a non-alphabetic string.
Important: ^ When you start with brackets, you exclude the characters in parentheses.
In order for PHP to be able to explain, you must add ' ' to these character faces and escape some characters.
Do not forget that the characters inside the brackets are exceptions to this rule-in brackets, all special characters, including (' '), will lose their special properties "[*\+?{}.]" Matches a string containing these characters.
Also, as REGX's Handbook tells us: "If the list contains '] ', it is best to use it as the first character in the table (possibly following the ' ^ '). If it contains '-', it is best to put it on the front or the last side, or the '-' in the middle of the second end of a range [a-d-0-9] will be valid.
Looking at the above example, you should understand {n,m}. Note that both N and m cannot be negative integers, and n is always less than M. This way, you can match at least n times and up to M times. such as "p{1,5}" will match the first five p in "PVPPPPPP"
Let's start with the following words.
\b The book says he is used to match a word boundary, that is ... such as ' ve\b ', can match love in the VE and does not match very has ve
\b is exactly the opposite of the \b above. I'm not going to give you an example.
..... It suddenly occurred to me that .... can go tohttp://www.phpv.net/article.php/251Look at the other syntax that starts with \

OK, let's do an application:
How to build a pattern to match the input of a currency quantity
Build a matching pattern to check whether the information entered is a number that represents money. We think that there are four ways to represent money: "10000.00" and "10,000.00", or there are no decimal parts, "10000" and "10,000". Now let's start building this matching pattern:
^[1-9][0-9]*$
This is the variable that must start with a number other than 0. But it also means that a single "0" cannot pass the test. Here's how to fix it:
^ (0│[1-9][0-9]*) $
"Only 0 and numbers not starting with 0 match", we can also allow a minus sign before the number:
^ (0│-? [1-9] [0-9]*) $
This is: "0 or a number that starts with 0 and may have a minus sign in front of it." Well, now let's not be so rigorous, allow to start with 0. Now let's give up the minus sign, because we don't need it when it comes to representing coins. We now specify the pattern to match the fractional part:
^[0-9]+ (\.[ 0-9]+)? $
This implies that the matched string must begin with at least one Arabic numeral. But note that in the above mode "10." is mismatched, only "10" and "10.2" can be. (Do you know why?)
^[0-9]+ (\.[ 0-9]{2})? $
We have to specify two decimal places after the decimal point. If you think this is too harsh, you can change it to:
^[0-9]+ (\.[ 0-9]{1,2})? $
This will allow one to two characters after the decimal point. Now we add a comma for readability (every three bits), so we can say:
^[0-9]{1,3} (, [0-9]{3}) * (\.[ 0-9]{1,2})? $
Don't forget that ' + ' can be replaced by ' * ' If you want to allow blank strings to be entered (why?). Also do not forget that the backslash ' \ ' may have errors (very common errors) in the PHP string.
Now that we can confirm the string, we now take all the commas out of Str_replace (",", "", $money) and then treat the type as a double and we can do the math with him.

One more:
Constructs a regular expression to check e-mail
There are three sections in a full email address:
1. User name (everything on the left of ' @ '),
2. ' @ ',
3. The name of the server (which is the remaining part).
The user name can contain uppercase and lowercase Arabic numerals, a period ('. '), minus ('-'), and an underscore ('_'). The server name also conforms to this rule, except of course the underscore.
Now, the start and end of the user name cannot be a period. The same is true for servers. And you can't have two consecutive periods. There is at least one character between them, so let's take a look at how to write a matching pattern for the user name:
^[_a-za-z0-9-]+$
There is no time to allow the period to exist. We add it to:
^[_a-za-z0-9-]+ (\.[ _a-za-z0-9-]+) *$
This means: "At least one canonical character (except.) begins, followed by 0 or more strings starting with a dot."
To make it simpler, we can replace Ereg () with eregi (). Eregi () is not case sensitive, we do not need to specify two ranges "A-Z" and "A-Z" – just specify one:
^[_a-z0-9-]+ (\.[ _a-z0-9-]+) *$
The following server name is the same, but to remove the underscore:
^[a-z0-9-]+ (\.[ a-z0-9-]+) *$
All right, now just use "@" to connect the two parts:
^[_a-z0-9-]+ (\.[ _a-z0-9-]+) *@[a-z0-9-]+ (\.[ a-z0-9-]+) *$

This is the complete email authentication matching mode, only need to call
Eregi (' ^[_a-z0-9-]+ (\.[ _a-z0-9-]+) *@[a-z0-9-]+ (\.[ a-z0-9-]+) *$ ', $eamil)
You can get an email.
Other uses of regular expressions
Extracting a string
Ereg () and eregi () have a feature that allows a user to extract a portion of a string from a regular expression (you can read the manual for specific usage). For example, we want to extract the file name from Path/url – the following code is what you need:
Ereg ("([^\\/]*) $", $PATHORURL, $regs);
echo $regs [1];
High-level substitution
Ereg_replace () and Eregi_replace () are also useful: if we want to replace all the interval minus signs with commas:
Ereg_replace ("[\n\r\t]+", ",", Trim ($STR));
Finally, I put another string of check email regular expression to see the article you to analyze.
"^[-!#$%&\ ' *+\\./0-9=?" A-z^_ ' a-z{|} ~]+ '. ' @‘.‘ [-!#$%&\ ' *+\\/0-9=? A-z^_ ' a-z{|} ~]+\. '. ' [-!#$%&\ ' *+\\./0-9=? A-z^_ ' a-z{|} ~]+$"
If it is easy to read, then the purpose of this article is achieved.

Syntax rules for JScript and VBScript regular expressions

A regular expression is a text pattern consisting of ordinary characters, such as characters A through z, and special characters (called metacharacters). This pattern describes one or more strings to match when looking up a text body. A regular expression, as a template, matches a character pattern to the string you are searching for.

Here are some examples of regular expressions that you might encounter:



JScript VBScript Matching
/^\[\t]*$/"^\[\t]*$" matches a blank line.
/\d-\d/"\d-\d" verifies whether an ID number consists of a 2-digit number, a hyphen, and a 5-digit number.
/< (. *) >.*<\/>/"< (. *) >.*<\/>" matches an HTML tag.

The following table is a complete list of metacharacters and its behavior in the context of regular expressions:

Character description
\ marks the next character as a special character, or a literal character, or a backward reference, or an octal escape character. For example, ' n ' matches the character "n". ' \ n ' matches a line break. The sequence ' \ ' matches "\" and "\ (" Matches "(".
^ matches the starting position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after ' \ n ' or ' \ R '.
$ matches the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before ' \ n ' or ' \ R '.
* matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * Equivalent to.
+ matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but not "Z". + equivalent to.
? Matches the preceding subexpression 0 or one time. For example, "Do (es)?" can match "do" in "do" or "does".? Equivalent to.
N is a non-negative integer. Matches the determined n times. For example, ' O ' cannot match ' o ' in ' Bob ', but can match two o in ' food '.
N is a non-negative integer. Match at least n times. For example, ' O ' cannot match ' o ' in ' Bob ', but can match all o in ' Foooood '. ' O ' is equivalent to ' o+ '. ' O ' is equivalent to ' o* '.
Both M and n are non-negative integers, where n <= m. Matches at least n times and matches up to M times. For example, "O" will match the top three o in "Fooooood". ' O ' is equivalent to ' O? '. Note that there can be no spaces between a comma and two numbers.
? When the character immediately follows any other restriction (*, +,?,,,), the matching pattern is non-greedy. The non-greedy pattern matches the searched string as little as possible, while the default greedy pattern matches as many of the searched strings as possible. For example, for the string "oooo", ' o+? ' will match a single "O", while ' o+ ' will match all ' o '.
. Matches any single character except "\ n". To match any character including ' \ n ', use a pattern like ' [. \ n] '.
Pattern matches the pattern and gets the match. The obtained matches can be obtained from the resulting Matches collection, the Submatches collection is used in VBScript, and in JScript ... Property. To match the parentheses character, use ' \ (' or ' \ ').
(?:p Attern) matches the pattern but does not get a matching result, which means that this is a non-fetch match and is not stored for later use. This is useful when using the "or" character (|) to combine parts of a pattern. For example, ' Industr (?: y|ies) is a more abbreviated expression than ' industry|industries '.
(? =pattern) forward, matching the lookup string at the beginning of any string that matches the pattern. This is a non-fetch match, which means that the match does not need to be acquired for later use. For example, ' Windows (? =95|98| nt|2000) ' Can match Windows 2000 ', but does not match Windows 3.1 in Windows. Pre-checking does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, rather than starting with the character that contains the pre-check.
(?! pattern), which matches the lookup string at the beginning of any string that does not match the pattern. This is a non-fetch match, which means that the match does not need to be acquired for later use. For example ' Windows (?! 95|98| nt|2000) ' can match Windows 3.1 ', but does not match Windows 2000 in Windows. Pre-check does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, rather than starting with the character that contains the pre-check
X|y matches x or Y. For example, ' Z|food ' can match "z" or "food". ' (z|f) Ood ' matches "Zood" or "food".
[XYZ] Character set. Matches any one of the characters contained. For example, ' [ABC] ' can match ' a ' in ' plain '.
[^XYZ] negative character set. Matches any character that is not contained. For example, ' [^ABC] ' can match ' P ' in ' plain '.
A [A-z] character range. Matches any character within the specified range. For example, ' [A-z] ' can match any lowercase alphabetic character in the ' a ' to ' Z ' range.
[^a-z] negative character range. Matches any character that is not in the specified range. For example, ' [^a-z] ' can match any character that is not within the range of ' a ' to ' Z '.
\b Matches a word boundary, which is the position between a word and a space. For example, ' er\b ' can match ' er ' in ' never ', but not ' er ' in ' verb '.
\b Matches a non-word boundary. ' er\b ' can match ' er ' in ' verb ', but cannot match ' er ' in ' Never '.
\CX matches the control character indicated by X. For example, \cm matches a control-m or carriage return. The value of x must be one of a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.
\d matches a numeric character. equivalent to [0-9].
\d matches a non-numeric character. equivalent to [^0-9].
\f matches a page break. Equivalent to \x0c and \CL.
\ n matches a line break. Equivalent to \x0a and \CJ.
\ r matches a carriage return character. Equivalent to \x0d and \cm.
\s matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].
\s matches any non-whitespace character. equivalent to [^ \f\n\r\t\v].
\ t matches a tab character. Equivalent to \x09 and \ci.
\v matches a vertical tab. Equivalent to \x0b and \ck.
\w matches any word character that includes an underscore. Equivalent to ' [a-za-z0-9_] '.
\w matches any non-word character. Equivalent to ' [^a-za-z0-9_] '.
\XN matches N, where n is the hexadecimal escape value. The hexadecimal escape value must be two digits long for a determination. For example, ' \x41 ' matches ' A '. ' \x041 ' is equivalent to ' \x04 ' & ' 1 '. ASCII encoding can be used in regular expressions:
\num matches num, where num is a positive integer. A reference to the obtained match. For example, ' (.) ' matches two consecutive identical characters.
\ n identifies an octal escape value or a backward reference. n is a backward reference if \ n is preceded by at least one of the sub-expressions obtained. Otherwise, if n is the octal number (0-7), N is an octal escape value.
\NM identifies an octal escape value or a backward reference. If at least NM has obtained a subexpression before \nm, then NM is a backward reference. If there are at least N fetches before \nm, then n is a backward reference followed by the literal m. If none of the preceding conditions are met, if both N and M are octal digits (0-7), then \nm will match the octal escape value nm.
\NML if n is an octal number (0-3) and both M and L are octal digits (0-7), the octal escape value NML is matched.
\un matches N, where N is a Unicode character represented by four hexadecimal digits. For example, \? Match the copyright symbol (&copy;).

2005-5-23 10:26:21 piao40993470 commented.
All Chinese (not including punctuation):
([\xb0-\xf7][\xa1-\xfe]) +
All gb2312-80 Codes
([\xa1-\xfe][\xa1-\xfe]) +
All Chinese spaces
(\XA1\XA1) +

English punctuation: [\x20-\x2f\x3a-\x40\x5b-\x60\x7b-\x7e]

2005-5-23 10:32:03 piao40993470 commented.
Add:
The use of Tperlregex under the Delphi is also good

tregexpr Regular Expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.