GRETA regular expression template class library

Last Update:2018-12-07 Source: Internet

Author: User

Tags parse error uppercase letter

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article briefly introduces the Regular Expression Libraries such as ATL CAtlRegExp, GRETA, Boost: regex, these Expression Libraries allow us to easily take advantage of the power of the Regular Expression Library and facilitate our work.

Regular expression syntax

Character Element	Meaning
.	Match a single character
[]	Specifies a character class that matches any character in square brackets. For example, [abc] matches "a", "B" or "c ".
^	If ^ appears at the beginning of the character class, it denies the character class. The negative character class matches the characters except the characters in square brackets. For example, [^ abc] matches characters other than "a", "B", and "c. If ^ appears before the regular expression, it matches the start of the Input. For example, ^ [abc] matches the input starting with "a", "B", or "c.
-	Specifies the range of a character in a character class. For example, [0-9] matches numbers from "0" to "9.
?	? The preceding expression is optional and can be matched once or not. For example, [0-9] [0-9]? Match "2" or "12 ".
+	? The previous expression matches once or multiple times. For example, [0-9] + matches "1", "13", "666", and so on.
*	Indicates that the expression before * matches zero or multiple times.
??, + ?, *?	?, + The non-Greedy match version of * matches as few characters as possible ?, + And * are greedy versions and match as many characters as possible. For example, if you enter "<abc> <def>", then <. ?> Match "<abc>", and <. > match "<abc> <def> ".
()	Grouping operator. For example: (\ d +,) * \ d + matches a string of numbers separated by commas, for example, "1" or "456 ".
\	Escape Character, followed by escape characters. For example, if [0-9] + matches one or more numbers, and [0-9] \ + matches a number and then follows a plus sign. The backslash \ is also used to represent the abbreviation. \ a represents any number or letter. If \ followed by a number n, it matches the nth matching group (starting from 0), for example, <{. ?}>. ? </\ 0> match "
$	Put it at the end of the regular expression and match the end of the input. For example, [0-9] $ matches the last number entered.
\|	Delimiter, which is used to separate two expressions to match one of them correctly, for example, T \| the match "The" or "".

Abbreviation match

Abbreviations	Match
\	Letters, numbers ([a-zA-Z0-9])
\ B	Space (blank): ([\ t])
\ C	Letter ([a-zA-Z])
\ D	Decimal number ([0-9])
\ H	Hexadecimal number ([0-9a-fA-F])
\ N	Line feed: (\ r \| (\ r? \ N ))
\ Q	Reference string (\ "[^ \"] * \ ") \| (\ ''' [^ \ '''] * \ ''')
\ W	Text ([a-zA-Z] +)
\ Z	An integer ([0-9] +)

ATL CATLRegExp
ATL Server often needs to decode complex text fields such as addresses and commands, while regular expressions are powerful text parsing tools. Therefore, ATL provides regular expression interpretation tools.
Example:

#include "stdafx.h"#include <atlrx.h>int main(int argc, char* argv[]){   CAtlRegExp<> reUrl;   // five match groups: scheme, authority, path, query, fragment   REParseError status = reUrl.Parse(        "({[^:/?#]+}:)?(//{[^/?#]*})?{[^?#]*}(?{[^#]*})?(#{.*})?" );   if (REPARSE_ERROR_OK != status)   {      // Unexpected error.      return 0;   }   CAtlREMatchContext<> mcUrl;   if (!reUrl.Match(   "http://search.microsoft.com/us/Search.asp?qu=atl&boolean=ALL#results",      &mcUrl))   {      // Unexpected error.      return 0;   }   for (UINT nGroupIndex = 0; nGroupIndex < mcUrl.m_uNumGroups;        ++nGroupIndex)   {      const CAtlREMatchContext<>::RECHAR* szStart = 0;      const CAtlREMatchContext<>::RECHAR* szEnd = 0;      mcUrl.GetMatch(nGroupIndex, &szStart, &szEnd);      ptrdiff_t nLength = szEnd - szStart;      printf("%d: \"%.*s\"\n", nGroupIndex, nLength, szStart);   }}

Output:

0: "http"1: "search.microsoft.com"2: "/us/Search.asp"3: "qu=atl&boolean=ALL"4: "results"

The Match result is returned through the CAtlREMatchContext class pointed to by the second pContext parameter. The Match result and related information are stored in the CAtlREMatchContext class, you only need to access the CAtlREMatchContext method and members to obtain the matching results. CAtlREMatchContext provides the caller with matching result information through m_uNumGroups member and GetMatch () method. M_uNumGroups indicates the number of matched groups. GetMatch () returns the pStart and pEnd pointers of matched strings Based on the Index value of the Group passed to it, with these two pointers, the caller can easily obtain matching results.

For more information, see CAtlRegExp Class

GRETA
GRETA is a regular expression template class Library launched by Microsoft Research Institute. GRETA contains C ++ objects and functions, making pattern matching and replacement of strings easy. They are:

"Rpattern: Search Mode

"Match_results/subst_results: place the container that matches and replaces the result.

To perform search and replacement operations, you first need to explicitly Initialize an rpattern object using a string that describes matching rules, and then use the string to be matched as a parameter to call the rpattern function, for example, match () or substitute () can get the matching result. If the match ()/substitute () call fails, the function returns false. If the call is successful, the function returns true. In this case, the match_results object stores the matching results. See the sample code:

#include <iostream>#include <string>#include "regexpr2.h"using namespace std;using namespace regex;int main() {    match_results results;    string str( "The book cost $12.34" );    rpattern pat( "\\$(\\d+)(\\.(\\d\\d))?" );      // Match a dollar sign followed by one or more digits,    // optionally followed by a period and two more digits.    // The double-escapes are necessary to satisfy the compiler.    match_results::backref_type br = pat.match( str, results );    if( br.matched ) {        cout << "match success!" << endl;        cout << "price: " << br << endl;    } else {        cout << "match failed!" << endl;    }    return 0;}

Program output will be:

match success!price: $12.34

You can read the GRETA document to learn details about the rpattern object and learn how to customize search policies to improve efficiency.
Note: All declarations in the header file regexpr2.h are in the namespace regex. When you use the objects and functions, you must add the prefix "regex ::", or prefix "using namespace regex;". For simplicity, the "regex:" prefix will be omitted in the following sample code. The author generates the greta. lib and regexpr2.h files. You only need the support of these two files to use greta to parse regular expressions.

Low matching speed
Different Regular Expression matching engines are good at different matching modes. As a benchmark, the following modes are used: "^ ([0-9] +) (\-| $ )(. *) $ "Match string" 100-this is a line of ftp response which contains a message string ", GRETA match speed ratio boost (http://www.boost.org) the Regular Expression Library is about 7 times faster than the CATLRegExp of ATL7! The Boost Regex documentation contains a matching test Performance result of many modes. After comparing this result, I found that GRETA has the same performance as Boost Regex in most cases, but GRETA is slightly better When Visual Studio. Net 2003 is used for compiling.

Boost. Regex

Boost provides boost: basic_regex to support regular expressions. The boost: basic_regex design is very similar to std: basic_string:

namespace boost{template <class charT, class traits = regex_traits<charT>, class Allocator = std::allocator<charT> > class basic_regex;typedef basic_regex<char> regex;typedef basic_regex<wchar_t> wregex;}

The documents attached to the Boost Regex library are very rich, and the examples are even more exciting. For example, if there are two example programs and few codes, the program can directly highlight the syntax of the C ++ file, generate the corresponding HTML (converts a C ++ file to syntax highlighted HTML ). The following example splits a string into a string of tags (split a string into tokens ).

#include <list>#include <boost/regex.hpp>unsigned tokenise(std::list<std::string>& l, std::string& s){   return boost::regex_split(std::back_inserter(l), s);}#include <iostream>using namespace std;#if defined(BOOST_MSVC) || (defined(__BORLANDC__) && (__BORLANDC__ == 0x550))// problem with std::getline under MSVC6sp3istream& getline(istream& is, std::string& s){   s.erase();   char c = is.get();   while(c != ''''\n'''')   {      s.append(1, c);      c = is.get();   }   return is;}#endifint main(int argc){   string s;   list<string> l;   do{      if(argc == 1)      {         cout << "Enter text to split (or \"quit\" to exit): ";         getline(cin, s);         if(s == "quit") break;      }      else         s = "This is a string of tokens";      unsigned result = tokenise(l, s);      cout << result << " tokens found" << endl;      cout << "The remaining text is: \"" << s << "\"" << endl;      while(l.size())      {         s = *(l.begin());         l.pop_front();         cout << s << endl;      }   }while(argc == 1);   return 0;}

Http://topic.csdn.net/t/20040818/10/3285376.html
ASP. NET fully supports regular expression processing. Regular Expressions provide an advanced but not intuitive method for string matching and processing. If you have used regular expressions
As you all know, regular expressions are very powerful, but they are not so easy to learn.

For example:

Mailto: % 5E. + @. + \... + $

This effective but incomprehensible code is enough to make some programmers have a headache (I am) or let them give up using regular expressions. I believe that after reading this tutorial, you can
To understand the meaning of this Code.

Basic mode matching

Everything starts from the most basic. Pattern is the most basic element of a regular expression. They are a set of characters that describe character strings. The mode can be very simple, with common characters
It can also be very complex. special characters are often used to indicate characters in a range, repeated occurrences, or context. For example:

^ Once

This mode contains a special character ^, indicating that this mode only matches strings starting with once. For example, this mode matches the string "once upon a time,
It does not match "There once was a man from NewYork. Like a ^ symbol, $ is used to match strings ending in a given pattern.

Bucket $

This mode matches "Who kept all of this cash in a bucket" and does not match "buckets. Exact match (string and Mode 1)
Sample ). For example:

^ Bucket $

Only matches the string "bucket ". If a mode does not include ^ and $, it matches any string containing this mode. Example: Mode

Once

And string

There once was a man from NewYork
Who kept all of his cash in a bucket.

Is matched.

In this mode, letters (o-n-c-e) are literal characters, that is, they indicate the letter itself, and numbers are the same. Other slightly complex characters, such as punctuation
Escape sequences are used for symbols and white characters (such as spaces and tabs. All escape sequences start with a backslash. The escape sequence of the tab is \ t. So if I
You can use this mode to check whether a string starts with a Tab character:

^ \ T

Similarly, \ n is used to represent a new line, and \ r is used to represent a carriage return. Other special symbols can be used with a backslash in front. For example, the backslash itself is represented by \, and the period is represented \..
This type of push.

Character Cluster

In INTERNET programs, regular expressions are usually used to verify user input. After you submit a FORM, you must determine the entered phone number, address,
If the EMAIL address or credit card number is valid, it is not enough to use common literal-based characters.

Therefore, we need to use a more free way to describe the mode we want. It is a character cluster. To create a character cluster that represents all vowels
Put it in square brackets:

[AaEeIiOoUu]

This mode matches any vowel character, but can only represent one character. The font size can be used to indicate the range of a character, for example:

[A-z] // match all lowercase letters
[A-Z] // match all uppercase letters
[A-zA-Z] // match all letters
[0-9] // match all numbers
[0-9 \. \-] // match all numbers, periods, and periods
[\ F \ r \ t \ n] // match all white characters

Similarly, these are only one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit
For example, "z2", "t6", or "g7", but not "ab2", "r2d3", or "b52", use this mode:

^ [A-z] [0-9] $

Although [a-z] represents the range of 26 letters, it can only match strings with lowercase letters with the first character.

^ Indicates the start of a string, but it has another meaning. When ^ is used in square brackets, it indicates the meaning of "not" or "excluded", which is often used to remove
Except a character. In the preceding example, the first character must not be a number:

^ [^ 0-9] [0-9] $

This pattern matches "& 5", "g7", and "-2", but does not match "12", "66. The following are examples of how to exclude specific characters:

[^ A-z] // All characters except lowercase letters
[^ \/\ ^] // All characters except (\) (/) (^)
[^ \ "\ '] // All characters except double quotation marks (") and single quotation marks (')


The special character "." (point, period) is used to represent all characters except the "New Line" in a regular expression. So the pattern "^. 5 $" and any two characters end with a number 5 and
Other strings that do not start with a new line are matched. Mode "." can match any string, except empty strings and strings containing only one "New Line.

PHP regular expressions have some built-in general character clusters. The list is as follows:

Character cluster meaning
[[: Alpha:] Any letter
[[: Digit:] Any number
[[: Alnum:] Any letter or number
[[: Space:] any white characters
[[: Upper:] Any uppercase letter
[[: Lower:] Any lowercase letter
[[: Punct:] Any punctuation marks
[[: Xdigit:] Any hexadecimal number, equivalent to [0-9a-fA-F]
Confirm repeated occurrence

Until now, you know how to match a letter or number, but in more cases, you may need to match a word or a group of numbers. A word has several characters.
A group of numbers are composed of several singular numbers. Braces ({}) following the character or character cluster are used to determine the number of occurrences of the preceding content.

Character cluster meaning
^ [A-zA-Z _] $ all letters and underscores
^ [[: Alpha:] {3} $ all 3-letter words
^ A $ Letter
^ A {4} $ aaaa
^ A {2, 4} $ aa, aaa or aaaa
^ A {1, 3} $ a, aa or aaa
^ A {2, }$ contains more than two a strings
^ A {2,} For example, aardvark and aaab, but not apple
A {2,} such as baad and aaa, but not Nantucket
\ T {2} two tabs
. {2} All two characters

These examples describe three different usages of curly brackets. A number, {x} indicates "the character or character cluster appears only x times"; a number is added with a comma, meaning that {x ,}
This is "The preceding content appears x or more times"; two numbers separated by commas (,). {x, y} indicates that "the preceding content appears at least x times, but not more than y times ". We can
Mode extended to more words or numbers:

^ [A-zA-Z0-9 _] {1, }$ // All strings containing more than one letter, number, or underline
^ [0-9] {1, }$ // all positive numbers
^ \-{0, 1} [0-9] {1, }$ // All integers
^ \-{0, 1} [0-9] {0 ,}\. {0, 1} [0-9] {0 ,}$ // all decimals

The last example is hard to understand, right? Let's take a look: It starts with an optional negative sign (\-{0, 1}) (^), followed by 0 or more numbers ([0-9] {0,}), and
The optional decimal point (\. {0, 1}) is followed by 0 or Multiple Digits ([0-9] {0,}), and there is nothing else ($ ). Next you will know the simpler method that can be used.

Special Character "? "It is equal to {0, 1}, and both represent" 0 or 1 previous content "or" previous content is optional ". So the example just now can be simplified:

^ \-? [0-9] {0 ,}\.? [0-9] {0,} $

The special characters "*" and {0,} are equal. They all represent "0 or multiple front content ". Finally, the character "+" is equal to {1,}, indicating "1 or more
So the above four examples can be written:

^ [A-zA-Z0-9 _] + $ // All strings that contain more than one letter, number, or underline
^ [0-9] + $ // all positive numbers
^ \-? [0-9] + $ // All integers
^ \-? [0-9] * \.? [0-9] * $ // all decimals

Of course, this does not technically reduce the complexity of regular expressions, but it can make them easier to read.
 
BOOL CErrAnalyzer: IsExistErr ()
{
If (m_pStream = NULL)
{
Cout <"File not open" <endl;
Return FALSE;
}
 
CAtlRegExp <> reUrl;
REParseError status = reUrl. Parse (REGEX_PROJECTNAME );
If (REPARSE_ERROR_ OK! = Status)
{
Cout <"Parse error." <endl;
Return FALSE;
}
CString bufRead;
While (ReadString (bufRead )! = FALSE)
{
CAtlREMatchContext <> mcUrl;
McUrl. m_uNumGroups = 0;
 
If (! ReUrl. Match (bufRead, & mcUrl ))
{
Continue;
}
 
For (UINT nGroupIndex = 0; nGroupIndex <mcUrl. m_uNumGroups; ++ nGroupIndex)
{
Const CAtlREMatchContext <>:: RECHAR * szStart = 0;
Const CAtlREMatchContext <>:: RECHAR * szEnd = 0;
McUrl. GetMatch (nGroupIndex, & szStart, & szEnd );
 
Ptrdiff_t nLength = szEnd-szStart;
Printf ("% d: \" %. * s \ "\ n", nGroupIndex, nLength, szStart );
}
Break;
}
Return TRUE;
}
This article from the CSDN blog, reproduced please indicate the source: http://blog.csdn.net/lyl_98/archive/2006/07/04/874083.aspx

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More