Boost::regex Detailed (Turn)

Source: Internet
Author: User
Tags assert first string repetition expression engine

To use Boost.regex, you need to include the header file "Boost/regex.hpp". A regex is one of two libraries in this book that needs to be compiled independently (the other is boost.signals). You'll be glad to know that if you've built a boost--, you can simply line up at a command prompt--and automatically link (to the compiler under Windows), so you don't have to bother to point out those library files.

The first thing you have to do is declare a variable of type Basic_regex. This is one of the core classes of the library and is where regular expressions are stored. Creating such a variable is simple: Just pass a string containing the regular expression you want to use to the constructor. Boost::regex Reg ("(a.*)");

This regular expression has three interesting features. The first is to enclose a subexpression in parentheses so that you can later reference it in the same regular expression, or take out the text that matches it. We'll discuss it in detail later, so don't worry if you don't know what it's for. The second is the wildcard character (wildcard), the dot. This wildcard character has a very special meaning in regular expressions, which can match any character. The last one is that the expression uses a repeating character, *, called Kleene star, which indicates that the previous expression can be matched 0 or more times. This regular expression is already available for an algorithm, as follows: BOOL B=boost::regex_match (
"This exdivssion could the match from A and beyond."
REG);

As you can see, you pass the regular expression and the string to be parsed to the algorithm regex_match. If there does exist a match to the regular expression, then the function call returns the result true, otherwise it returns false. In this case, the result is false because Regex_match returns true only if the entire input data is successfully matched by a regular expression. Do you know why this is? Look at that regular expression again. The first character is an uppercase A, and it is clear that the first character of the expression can be matched. So, part of the input "A and beyond." Can match this expression, but this is not the whole input. Let's try another input string. BOOL B=boost::regex_match (
"As this string starts with A, does it match?",
REG);

This time, Regex_match returns True. When the regular expression engine matches a, it goes on to see what the follow-up has. In our regex variable, A is followed by a wildcard character and a Kleene star, which means that any character can be matched any time. Thus, the parsing process begins to throw away the remainder of the input string, which matches all the parts of the input.

Next, let's look at how to use regexes and regex_match for data validation.

Validating input

Regular expressions are often used to validate the format of the input data. The application software usually requires that the input conform to a certain structure. Consider an application that requires input that must conform to the following format, "3 digits, one word, any character, 2 digits or string" N/A, "a space, then repeat the first word." Writing code to validate this input is tedious and error prone, and these formats are likely to change , you may need to support other formats before you figure it out, and your specially crafted parser may need to be modified and debugged. Let's write a regular expression that validates this input. First, we need an expression that matches 3 digits. For numbers, we should use a special abbreviation,/d. To indicate that it has been repeated 3 times, a specific repetition called bounds operator is required, which is enclosed in curly braces. Putting these two together is the beginning of our regular expression. Boost::regex reg ("//d{3}");

Note that we need to add an escape character before the escape character (/), that is, in our string, abbreviation/d becomes//d. This is because the compiler throws the first/as an escape character, and we need to escape it so that it can be present in our regular expression.

Next, we need to define a method of a word that defines a sequence of characters that ends with a non-alphanumeric character. There are more than one way to implement it, and we will do so by using character categories (also known as character sets) and the characteristics of the two regular expressions of scope. A character category is an expression enclosed in square brackets. For example, a character category of one matching character A, b, and C is represented as: [ABC]. If we use the range to express the same thing, we have to write: [A-c]. To write a character type that contains all the letters, we might be a little crazy, if we want it to be written as: [ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ], but not so; we can use the scope to represent: [ A-ZA-Z]. It is to be noted that the use of a range like this depends on the current locale, if the BASIC_REGEX::COLLATE flag of the regular expression is opened. Using the above tool and the repeat character +, it means that the preceding expression can be repeated, but at least once again, we can now represent a word. Boost::regex Reg ("[a-za-z]+");

The above regular expressions work, but because they often have to represent a word, there is a simpler way:/w. This symbol matches all the words, not just the ASCII words, so it is not only shorter, but also more applicable to the internationalized environment. The next character is an arbitrary character, which we already know to be represented in dots. Boost::regex Reg (".");

Then there are 2 numbers or strings "N/A." To match it, we need to use a feature called selection. The selection is to match any of two or more subexpression, with each selection using the | Apart. Just like this: Boost::regex reg ("//d{2}| N/a) ");

Note that this expression is enclosed in parentheses to ensure that the entire expression is considered to be two selections. It is easy to add a space to a regular expression; By merging each of these things together, you get the following expression: Boost::regex reg ("//d{3}[a-za-z]+." d{2}| N/a)//s ");

Now things are getting a little complicated. We need some way to verify that the words in the next input data match the first word (that is, the word we captured with the expression [a-za-z]+)]. The key is to use a forward reference (back reference), which is a reference to the preceding subexpression. In order to be able to reference an expression [a-za-z]+, we must enclose it in parentheses first. This makes the expression ([a-za-z]+) the first subexpression in our regular expression, and we can use the index to create a forward reference.

In this way, we get the entire regular expression, which means "3 digits, one word, any character, 2 digits or string" N/A, a space, and then repeat the first word. " : Boost::regex reg ("//d{3}" ([a-za-z]+). d{2}| N/a)//s//1 ");

Well done. Here is a simple program that uses this expression for algorithmic regex_match, validating two input strings. #include <iostream>
#include <cassert>
#include <string>
#include "boost/regex.hpp"

int main () {
3 digits, a word, any character, 2 digits or "N/a",
A space, then the again
Boost::regex reg ("//d{3}" ([a-za-z]+). d{2}| N/a)//s//1 ");

std::string correct= "123Hello N/a Hello";
std::string incorrect= "123Hello Hello";

ASSERT (Boost::regex_match (correct,reg) ==true);
ASSERT (Boost::regex_match (incorrect,reg) ==false);
}

The first string, 123Hello N/A Hello, is correct; 123 is 3 digits, Hello is a word followed by any character (a space), then n/A and another space, and finally repeats the word hello. The second string is not correct because the word Hello is not strictly repeated. By default, regular expressions are case-sensitive, and therefore reverse references cannot match.

One key to writing a regular expression is to successfully decompose the problem. Look at the final expression you just built, it's really hard to understand for untrained people. However, if the expression is decomposed into smaller parts, it is less complicated.

Find

Now let's take a look at another Boost.regex algorithm,  regex_search . Unlike  regex_match ,,regex_search  Does not require an exact match of the entire input data, only part of the data is required to match. As a description, considering a programmer's problem, he may have forgotten to call  delete  in his program one to two times. Although he knew that the simple test might not make sense, he decided to calculate the number of new  and  delete  occurrences to see if the numbers were in line. This regular expression is simple; we have two choices, new and delete. Boost::regex reg (New) | ( Delete) ");

There are two reasons we want the handle expressions enclosed in parentheses: one is to show that our choice is two groups. Another reason is that we want to refer to these subexpression when we call regex_search , so that we can determine which selection is matched. We use an overload of regex_search , which accepts an argument of a match_results  type. When  regex_search  performs a match, it reports a matching subexpression through an object of type match_results . The class template  match_results  is parameterized using an iterator type used for an input sequence. Template <class iterator,
Class allocator=std::allocator<sub_match<iterator> >
Class Match_ Results

typedef match_results<const char*> Cmatch;
typedef match_results<const Wchar_t> Wcmatch;
typedef match_results<std::string::const_iterator> Smatch;
typedef match_results<std::wstring::const_iterator> Wsmatch;

We will use  std::string  and so be mindful of  typedef smatch  Iterator> 's initials. If  regex_search  returns &NBSP;TRUE&NBSP, the  match_results  reference passed to the function will contain the matching subexpression result. In match_results , an indexed sub_match  is used to represent each subexpression in the regular expression. Let's take a look at how we can help this puzzled programmer to compute the call to new  and  delete . Boost::regex reg (New) | ( Delete) ");
Boost::smatch m;
std::string s=
"Calls to New must is followed by delete. /
Calling simply new results in a leak! ";

if (Boost::regex_search (S,m,reg)) {
///Did new match?
if (m[1].matched)
Std::cout << "The Exdivssion (new) matched!/n";
if (m[2].matched)
Std::cout << "The Exdivssion (delete) matched!/n";
}

The above program finds new or delete in the input string and reports which one is found first. By passing a type Smatch object to Regex_search, we can learn how the algorithm performs the details of success. We have two subexpression in our expression, so we can get the subexpression new by using index 1 of match_results. So we get a sub_match instance that has a Boolean member, matched, that tells us if this subexpression is involved in a match. Therefore, for the input of the above example, the run result will output "the Exdivssion (new) matched!/n". Now, you have some work to do. You need to continue to apply the regular expression to the remainder of the input, and to do so, you use another regex_search overload, which accepts two iterators indicating the sequence of characters to look for. Because std::string is a container, it provides an iterator. Now, at each match, you must update the iterator that indicates the starting point of the range to the end point of the last match. Finally, add two variables to record the new and delete times. Here's the complete program: #include <iostream>
#include <string>
#include "boost/regex.hpp"

int main () {
Whether the "new" and "delete" occurrences are the same.
Boost::regex reg (New) | ( Delete) ");
Boost::smatch m;
std::string s=
"Calls to New must is followed by delete." /
Calling simply new results in a leak! ";
int new_counter=0;
int delete_counter=0;
Std::string::const_iterator It=s.begin ();
Std::string::const_iterator End=s.end ();

while (Boost::regex_search (It,end,m,reg)) {
Is it new or delete?
M[1].matched? ++new_counter: ++delete_counter;
It=m[0].second;
}

if (New_counter!=delete_counter)
Std::cout << "Leak detected!/n";
Else
Std::cout << "seems ok.../n";
}

Note that this program always sets the iterator it to M[0].second. Match_results[0] Returns a reference to the substring that matches the entire regular expression, so we can confirm that the end point of the match is the starting point for the next run Regex_search. Running this program will output "leak detected!" because there are two times new, and only one delete. Of course, a variable can also be deleted in two places, there may be calls to new[] and delete[], and so on.

Now, you should have a better understanding of how the subexpression is grouped. Now it's time to go to the last Boost.regex algorithm, which is used to perform replacement work.

Replace

The third algorithm in the Regex algorithm family is regex_replace. As the name suggests, it is used to perform text substitution. It searches through the entire input data to find all the matches of the regular expression. For each match of the expression, the algorithm invokes the Match_results::format and enters the output iterator of the result into an incoming function.

In the Introduction section of this chapter, I give an example of replacing the English spelling colour with American spelling color. Making this spelling change without using a regular expression can be tedious and error-prone. The problem is that there may be different case, and many words will be affected, such as colourize. To solve this problem correctly, we need to divide the regular expression into three subexpression expressions. Boost::regex Reg ("(Colo) (U) (r)",
Boost::regex::icase|boost::regex::p erl);

We are going to remove the letter U independent, in order to be able to delete it easily in all matches. Also, note that this regular expression is case-insensitive, and we want to pass the format flag boost::regex::icase to the Regex constructor. You also have to pass other flags that you want to set. A common mistake when setting a flag is to ignore those flags that the regex defaults on, and if you don't set these flags, they won't open and you have to set all the flags you want to open.

When calling Regex_replace, we are going to provide a format string in the form of arguments. The format string determines how the substitution is made. In this format string, you can refer to the matching subexpression, which is exactly what we want. You want to keep the first and third matching subexpression, and remove the second (U). An expression $N represents a matching subexpression, and N is a subexpression index. So our format string should be "$1$3", which means that the replacement text is the first and third subexpression. By referencing a matching subexpression, we can preserve all the casing in the matching text, and we can't do that if we use strings to replace text. Here is a complete procedure for solving this problem. #include <iostream>
#include <string>
#include "boost/regex.hpp"

int main () {
Boost::regex Reg ("(Colo) (U) (r)",
Boost::regex::icase|boost::regex::p erl);

std::string s= "colour, colours, color, colourize";

S=boost::regex_replace (S,reg, "$1$3");
Std::cout << S;
}

The output of the program is "color, colors, color, colorize". Regex_replace is useful for this kind of text substitution.

Common misconceptions for users

The most common problems I have seen related to Boost.regex are related to the semantics of Regex_match. It's easy to forget that you have to match all the input of regex_match to a given regular expression. As a result, users often assume that the following code returns TRUE. Boost::regex Reg ("//d*");
BOOL B=boost::regex_match ("is Prime", Reg);

There is no doubt that this call will never be a successful match. Returns true only if all inputs are matched by regex_match. Almost all users will ask why Regex_search is not like this and Regex_match is. Boost::regex Reg ("//d*");
BOOL B=boost::regex_search ("is Prime", Reg);

This must return true this time. It's worth noting that you can use certain buffer operators to let Regex_search run like Regex_match. /A matches the starting point of the buffer, and the/z matches the end point of the buffer, so if you place the/A at the beginning of the regular expression and/z at the end, you can make the regex_search work like Regex_match, that is, you must match all the inputs. The following regular expression requires that all input be matched, regardless of whether you are using Regex_match or regex_search. Boost::regex Reg ("//a//d*//z");

Keep in mind that this does not mean that you do not need to use regex_match; instead, it can clearly indicate the semantics we have just mentioned, that is, all input must be matched.

about repetition and greed

Another easy place to confuse is greed for repetition. Some repetitions, such as + and *, are greedy. In other words, they will consume as much input as possible. The following regular expression is not uncommon and is used to capture two digits after a greedy repetition. Boost::regex Reg ("(. *) (//d{2})");

This regular expression is true, but it may not match the subexpression you want. expression. * Will swallow everything and subsequent subexpression will not match. Here is an example of this behavior: int main () {
Boost::regex Reg ("(. *) (//d{2})");
Boost::cmatch m;
Const char* Text = "Note that I ' m to years old, not 32.";
if (Boost::regex_search (Text,m, Reg)) {
if (m[1].matched)
Std::cout << "(. *) Matched:" << m[1].str () << '/n ';
if (m[2].matched)
Std::cout << "Found The Age:" << m[2] << '/n ';
}
}

In this program, we use another special version of Match_results, the type Cmatch. It is the typedef of the Match_results<const char*>, and we must use it instead of the smatch we used before, because we are now calling Regex_search with a sequence of characters instead of a type std:: String to invoke. What you expect the results of this program to run. Typically, a user who is just starting to use a regular expression will first think that both m[1].matched and m[2].matched are true, and that the second subexpression will result in "31". Then, after recognizing the effect of greedy repetition, repetition consumes as much input as possible, and the user thinks that only the first subexpression is true, that is. * Successfully swallowed all input. Finally, the new user gets the conclusion that two of the subexpression are matched, but the second expression matches the last possible sequence. That is, the first subexpression matches the "note I ' m to years old, not" and the second matches "32".

So, what if you want to use repetition and match the first occurrence of another subexpression. Use a non greedy repetition. Add one after the repeat character? , the repetition becomes not greedy. This means that the expression attempts to find the shortest match possible and no longer blocks the remainder of the expression from matching. So, to make the previous regular expression work correctly, we need to change it to that. Boost::regex Reg ("(. *?) (//d{2}) ");

If we use this regular expression to modify the program, then m[1].matched and m[2].matched will be true. An expression. *? Consumes only the least possible input, that is, it will stop at the first character 3, because that is what the expression needs to match successfully. Therefore, the first subexpression matches "note that I ' M" and the second matches "31".

Take a look at Regex_iterator .

We've seen how to handle all input with several regex_search calls, but on the other hand, the more elegant approach is to use Regex_iterator. This iterator type uses a sequence to enumerate all the matches of the regular expression. The dereference of a regex_iterator produces a reference to a match_results instance. When constructing a regex_iterator, you pass the iterator that indicates the input sequence to it and provide the corresponding regular expression. Let's look at an example where the input data is a comma-delimited set of integers. The corresponding regular expression is simple. Boost::regex Reg ("(//d+),?");

Add one at the end of the regular expression? (Match 0 or more times) to ensure that the last digit is parsed successfully, even if the input sequence does not end with a comma. In addition, we also use another repeat character +. This is a repeat character that matches one or more times. Now, without having to call regex_search multiple times, we create a regex_iterator and invoke the algorithm for_each, passing it a function object that is invoked with the iterator's dereference. The following is a function object that accepts an arbitrary form of match_results, which has a generic invocation operator. What it does is add the current matching value to a sum (in our regular expression, the first subexpression is what we want to use). Class Regex_callback {
int sum_;
Public
Regex_callback (): sum_ (0) {}

Template <typename t> void operator () (const t& what) {
Sum_+=atoi (What[1].str (). C_STR ());
}

int sum () const {
return sum_;
}
};

Now pass an instance of this function object to Std::for_each, and the result is to invoke the function object for each iterator it's dereference, that is, to call each matched subexpression. int m

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.