C + + Boost Regular expression usage (from: Wu Biyu cnblog)

Source: Internet
Author: User
Tags perl regular expression posix regex expression

First look at an online classic example.

#include "stdafx.h" #include <cstdlib> #include <stdlib.h> #include <boost/regex.hpp> #include < String> #include <iostream> using namespace std; using namespace boost; Regex expression ("^select" ([a-za-z]*) from ([a-za-z]*));  int main (int argc, char* argv[]) {std::string in;  Cmatch what;  cout << "Enter test string" << Endl;  Getline (Cin,in); if (Regex_match (In.c_str (), what, expression)) {for (int i=0;i<what.size (); i++) cout<< "str:" <<what[i].  STR () <<endl;  else {cout<< "Error Input" <<endl; return 0; }

Results


Enter: Select name from table

Output: Str:select name from table

Str:name

Str:table

According to our request, the strings are matched and singled out.
This is useful when dealing with a large number of rules in text format, because it is flexible and Itong.

Let's start by introducing the regex in boost

The default regular expression syntax for Boost::regex is the Perl syntax
Boost::regex supports Perl regular expressions, posix-extended regular expressions, and posix-basic regular expressions, but the default expression syntax is Perl syntax, If you want to use the remaining two syntaxes, you need to specify them explicitly when you construct an expression.

For example, the following two methods have the same effect
E1 is a case sensitive Perl regular expression:
Since Perl is the default option there ' s no need to explicitly specify the syntax used:
Boost::regex E1 (my_expression);
E2 a case insensitive Perl regular expression:
Boost::regex E2 (my_expression, Boost::regex::p erl|boost::regex::icase);

Perl regular-expression syntax:

. Any character, which does not match the null character when using the MATCH_NO_DOT_NULL flag; Do not match newline characters when using Match_not_dot_newline

^ Match start of line
$ match end of line
* Repeat 0 times or more, for example a*b can be matched B,ab,aab,aaaaaaab
+ Repeat more than once, for example a+b can match Ab,aab,aaaaaaaab. But it doesn't match B.
? 0 times or once, such as Ca?b match cb,cab but not Caab
A{n} match character ' A ' repeats n times
A{n,}, character a repeats more than n times (including n times)
A{n,m} a repeats N to M times (incl.)

*? Match the previous atom more than 0 times
+? Match the previous atom more than once
?? Match the previous atom more than 0 times
{N,}? Match previous Atom n times above (incl.)
{n,m? Match before an atom N to M times (including)

| Or operations, such as AB (D|EF) matching Abd or abef
[] Character set operations, such as [ABC] will match any single character ' A ', ' ' A ', ' ' B ', ' ' C '
[A-d], representing A, B, C, D
^ No action, for example [^a-c] represents all characters from A to C


Boost::regex Support for Unicode encoding
Boost::regex uses ICU to support Unicode and Unicode variants, which requires that the ICU and the ICU directory be used when compiling boost. Otherwise, the compiled Boost::regex does not support Unicode encoding. Where Boost::wregex supports Unicode-encoded searches, you use Boost::u32regex if you want to search for UTF-8, UTF-16, UFT-32 encoded strings. Note Boost::wregex can only support Unicode encoding and cannot support UFT encoding.

How to ignore case when searching
If you want to ignore case (that is, case insensitive) while searching, use the expression option Boost::regex::icase, for example: Boost::regex E2 (My_expression, Boost::regex::p erl|boost:: Regex::icase);


Template class:
L BASIC_REGEX A class that is used to hold a "regular expression".
L Sub_match inherits from the Pair<iterator,iterator> iterator group and is used to represent a result of the match.
L match_results sub_match container, used to represent all results of a search or matching algorithm, similar to vector<sub_match>.
Algorithm:
L Regex_math Matching algorithm that tests whether a string matches a regular match and returns the result via Match_results.
L Regex_find Lookup algorithm, find a string with a regular match of strings, and return the result by match_results.
L Regex_format substitution algorithm to find all strings in a string that match the regular style and replace with the format character string.
Iterators:
L Regex_iterator enumerates all the matching strings in a string, regex_iterator the result is equivalent to match_results.
L Regex_token_iterator enumerates all the matching strings in a string, regex_iterator the result is equivalent to sub_match.


Detailed

L Basic_regex


Template <class CharT, class traits = Regex_traits<chart>, class allocator = std::allocator<chart> >

Class Basic_regex;

typedef basic_regex<char> Regex;

typedef basic_regex<wchar_t> Wregex;

Obviously, chart is a regular type of character, and regex and Wregex are the two special features of Basic_regex.
Note that the character type of the regular type is the same as the character type of the string that needs to be matched. For example: You cannot use string and Wregex most parameters in the Regex_find algorithm, either string and regex, or wstring and Wregex.


constructor function:
Basic_regex RE
Generate empty Regular formula

Basic_regex Re (str)
The regular formula is str,str can be basic_string, or it can be a char* string at the end of 0.

Basic_regex Re (RE2)
Copy construction.

Basic_regex Re (Str,flag)
The regular formula is STR, with the flag syntax option, flag is a combination of a set of constants. For example, Icase can make a regular match ignore case.

Basic_regex Re (beg,end)
Constructs a regular using an iterator. It can be a basic_string iterator, or it can be a const char*.

Basic_regex Re (Beg,end,flag)
Using iterators to construct regular expressions, flag is a syntax option.


Common syntax options:
Regex_constants::normal
The default syntax. Conforms to the regular formula in Emcascript,javascript.

Regex_constants::icase
ignores case when matching.

Regex_constants::nosubs
Does not save the matching substring into the match_results structure.

Regex_constants::collate
For [A-b] matches, consider the region



Syntax options are combined by or operation. These syntax options are also defined in Basic_regex, so you can write Regex::normal, which is a few more letters than regex_constants.


Assign member function:
Re.assign (Re2)
Copy a regular type

Re.assign (str)
The regular type is str.

Re.assign (STR, flag)
The regular formula is STR, with the flag syntax option, flag is a combination of a set of constants.

Re.assign (Beg, end)
Constructs a regular using an iterator.

Re.assign (Beg, end, flag)
Using iterators to construct regular expressions, flag is a syntax option.

In fact basic_regex many usages and basic_string are very similar, because the regular expression is also a string.


Iterators:
Regex::iterator it
Constant iterator type, i.e. const_iterator

Re.begin ()
The return is a constant iterator oh. Const_iterator

Re.end ()
There is no reverse iterator.


For example: Copy (Re.begin (), Re.end (), ostream_iterator<char> (cout));
Other:
Re.size ()
The length of the regular expression, which is the length of the Str.

Re.max_size ()
The maximum length of the regular expression.

Re.empty ()
Whether the length is 0

Re.mark_count ()
Returns the number of groups of the regular type, usually the parentheses logarithm +1. Use parentheses in Boost.regex to group, please see the following algorithm in detail.

Re.flags ()
Returns the syntax options.

Cout<<re
The flow output of a regular type is equivalent to the copy algorithm for the example above.

Swap
member functions, all global functions have

Re.imbue (Loc)
Set local to LOC and return to the original local

Re.getloc ()
Get current Local

==,!=,<,<=,>,>=
comparison operator overloading



L Sub_match
Sub_match is an iterator group that represents a match in a regular formula.


Template <class bidirectionaliterator>

Class Sub_match:public std::p air<bidirectionaliterator, bidirectionaliterator>;

Boost does not provide any special sub_match because we are not going to show a sub_match variable in the declaration. Sub_match is used as an element of match_results. For example: Match_results's operator[] and iterator return is a special sub_match.
Unique member Variable:
Whether the bool matched match.


member functions:
Length ()
Returns the length, that is, the distance between two iterators.

Operator basic_string< value_type> ()
An implicit basic_string conversion.

STR ()
An explicit basic_string conversion.


There is a lot of comparison operator overload, here is not much to say.

L Match_results
Match_results is equivalent to a sub_match container that represents the return result of a regular algorithm.


Template <class Bidirectionaliterator,

Class allocator = allocator<sub_match<bidirectionaliterator> >

Class match_results;

typedef match_results<const Char*> Cmatch;

typedef match_results<const Wchar_t*> Wcmatch;

typedef match_results<string::const_iterator> Smatch;

typedef match_results<wstring::const_iterator> Wsmatch;

The declaration is simple, with four special features that can be used directly, but note that the match_results used by string and char* strings are different.


member functions:
M.size ()
Capacity.

M.max_size ()
Maximum capacity.

M.empty ()
Whether the capacity is 0.

M[n]
nth element, i.e. Sub_match

M.prefix ()
Returns the sub_match that represents the prefix, which refers to the beginning of the string to the beginning of the first match.

M.suffix ()
Returns the sub_match that represents the suffix, the end of the last match of the suffix to the end of the string.

M.length (N)
Returns the length of the nth element, that is, m[n].size ().

M.position (N)
Returns the position of the nth element.

Cout<<m
Stream output, output whole match, equivalent to Cout<<m[0]. Because the No. 0 element is the entire match, see the explanation below for details.

M.format (FMTSTR)
Use a formatted string, format the result, and return a string

M.format (Fmtstr,flags)
Using a formatted string, formatting the result, returning the string, flags is the formatting option.

M.format (OUT,FMTSTR)
Ditto, but outputs the results using an output iterator.

M.format (Out.fmtstr,flags)
Ditto, but outputs the results using an output iterator.


Iterators:
Smatch::iterator
Iterator, Constant iterator

Smatch::const_iterator
Ditto

M.begin ()
Returns a constant iterator

M.end ()

Regex_match

The regex_match algorithm is used to test whether a string matches the regular formula exactly. Let's take a look at the use of Regex_match:

if (Regex_match (str, M, re)) {...}

STR is a string that can be String,wstring,char * or wchar_t *

M is match_results, which holds the matching result by referencing the incoming argument, m to match the type of STR, can be Smatch,wsmatch,cmatch or wcmatch, to correspond to String,wstring,char * or wchar_t* str.

Re is a regular expression, usually a regex or wregex.

The types of Str,m,re are as follows:

STR type

M type

RE type

String

Smatch (match_results<string::const_iterator>)

Regex (basic_regex<char>)

Wstring

Wsmatch (match_results<wstring::const_iterator>)

Wregex (basic_regex<wchar_t>)

char*

Cmatch (Match_results<const char*>)

Regex (basic_regex<char>)

wchar_t*

Wcmatch (Match_results<const wchar_t*>)

Wregex (basic_regex<wchar_t>)

The return value of the function indicates whether the string exactly matches the regular expression, and when True is returned, M holds the result of the match; returns false,m undefined.

Let's take a look at how m is when the function returns True.

m.size () = = Re.mark_count ()

Remember what Re.mark_count () returned. In the previous article it is said that Re.mark_count () returns the Shizhong "group number", which is not explained in detail. Here I want to explain in detail.

In fact, this "group number" is called sub-expression in Boost's regex. Sub-expression is a part of a regular formula that is enclosed in parentheses, and the regular form itself is a sub-expression, so Re.mark_count () equals the parentheses logarithm +1.

M.prefix () and M.suffix ()

These two return the Sub_match type (equivalent to an iterator group). In the Regex_match algorithm, the two returned sub_match are empty, and their values are as follows: (Sub_match inherits from pair, so there are the members of the one and the second.)

M.prefix (). Str.begin ()

M.prefix (). Second = = Str.begin ()

M.prefix (). matched = = False

M.suffix (). Str.end ()

M.suffix (). Second = = Str.end ()

M.suffix (). matched = = False

Because the regex_match is an exact match, that is, the entire string and the regular match, the prefix and suffix are empty.

M[0]

Returns the No. 0 match, because the regex_match is exactly matched, so

M[0].first = = Str.begin ()

M[0].second = = Str.end ()

M[0].matched = = True

M[n], n<m.size ()

Returns the nth matching sub-expression.

The m[n].matched indicates whether the nth sub-expression exists in the string. The entire regex matches, but Sub_exp may match empty, for example "(A *)" There can be a match.

M[n].first and M[n].second represent the range of matches. If the match is empty, it is str.end ().

According to my tests, the order of the m[1],m[2],..., M[n] is in the order of the regular left parenthesis, for example, for regular "((a) BC) d (EFG)", if a string is matched (the string may only be "ABCDEFG"), the

M[0] = = "ABCDEFG" (sub_match overloads the = = operator so that it can be compared with a string)

M[1] = = "ABC"

M[2] = = "a"

M[3] = = "EFG"

Regex_match Other uses of

Regex_match (Str,re)

Test only for matches and do not need matching results

Regex_match (Beg,end,re)

The input is an iterator

Regex_match (Beg,end,m,re)

Note the type of M is match_results<iterator>

Regex_match (Str,m,re,flag)

Flag is the matching option, default Yes Regex_constants::match_default


Regex_search

The use of regex_search is basically the same as regex_match.

if (Regex_search (str, M, re)) {...}

Regex_search does not require Str to exactly match the RE, as long as a string in str matches the RE. Therefore, M.prefix () and M.suffix () are not necessarily empty.

Regex_search are matched from left to right and try to match long strings.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.