Use of boost RegEx

Source: Internet
Author: User
Tags character classes manual writing repetition try catch expression engine

 

To use boost. RegEx, You need to include the header file "Boost/RegEx. HPP". RegEx is one of the two libraries in this book that need to be compiled independently (the other is boost. signals ). You will be glad to know that if you have built boost--, you only need to run a command at the command prompt-to automatically link it (for Windows compilers ), therefore, you do not need to bother to point out the library files to use.

The first thing you need to do is declare a variable of the type basic_regex. This is one of the core classes of the library and also the place where regular expressions are stored. It is easy to create such a variable. Just pass a string containing the regular expression you want to use to the constructor.

Boost: RegEx Reg ("(A. *)"); this regular expression has three interesting features. The first one is to enclose a subexpression with parentheses, so that you can reference it in the same regular expression later or retrieve the text matching it. We will discuss it in detail later, so you don't have to worry if you don't know how to use it. The second is the wildcard (wildcard) character, Dot. This wildcard has a special meaning in the regular expression; this can match any character. The last one is that this expression uses a duplicate character *, called Kleene star, which indicates that the previous expression can be matched zero or multiple times. This regular expression can be used forAlgorithmAs follows:

Bool B = boost: regex_match ("this expression cocould match from a and beyond. ", Reg); as you can see, you pass the regular expression and the string to be analyzed to the algorithm regex_match. if a regular expression does match the regular expression, true is returned for the function call. Otherwise, false is returned. in this example, the result is false, because regex_match returns true only when the entire input data is successfully matched by a regular expression. Do you know why? Let's look at the regular expression. The first character is a in upper case. It is obvious that it can match the first character of the expression. Therefore, a part of the input "A and Beyond." can match this expression, but this is not the whole input. Let's try another input string.

Bool B = boost: regex_match ("as this string starts with a, does it match? ", Reg); this time, regex_match returns true. When the Regular Expression Engine matches a, it then looks at what will happen in the future. In our RegEx variable, A is followed by a wildcard and a Kleene star, which means that any character can be matched at any time. Therefore, the analysis process begins to discard the rest of the input string, that is, matching all the parts of the input.

Next, let's take a look at how to use regexes and regex_match for data verification.

Verification Input
Regular Expressions are often used to verify the format of input data. Application Software usually requires that the input conform to a certain structure. Consider an application that requires the input to conform to the following format: "3 digits, 1 word, any character, 2 digits or string" N/A, "a space, repeat the first word. "manual writingCodeTo verify that the input is both dull and error-prone, and these formats may change. Before you understand, you may need to support other formats, A well-written analyzer may need to be modified and debugged again. Let's write a regular expression that can verify this input. First, we need an expression that matches three numbers. For numbers, we should use a special abbreviation \ D. To indicate that it is repeated three times, a specific repetition called bounds operator is required, which is enclosed in curly brackets. Combining the two is the starting part of our regular expression.

Boost: RegEx Reg ("\ D {3}"); note that we need to add an escape character before the Escape Character (\), that is, in our string, the abbreviation \ D is \ D. This is because the compiler will discard the first \ as an escape character. We need to escape \ so that \ can appear in our regular expression.

Next, we need to define a word method, that is, to define a character sequence that ends with a non-letter character. There are more than one way to implement it. We will use the features of the character class (also called character set) and range regular expressions. A character category is an expression enclosed in square brackets. For example, a character category that matches characters A, B, and C is [ABC]. if a range is used to represent the same thing, we need to write: [A-C]. to write a character type that contains all letters, we may be a little crazy. If we want to write it as: [abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz], but this is not necessary. We can use a range: [A-Za-Z]. note that the range used in this example depends on the locale currently used. If the basic_regex: collate flag of the regular expression is enabled. Using the above tool and repeat +, it indicates that the previous expression can be repeated, but at least once, we can now represent a word.

Boost: RegEx Reg ("[A-Za-Z] +"); The above regular expression can work, but since it is often used to represent a word, there is a simpler method: \ W. this symbol matches all words, not only ASCII words, so it is not only shorter, but also more suitable for international environments. The following character is an arbitrary character, which is represented by vertices.

Boost: RegEx Reg ("."); followed by two numbers or strings "N/A." To match it, we need a feature called selection. Select is to match any of two or more subexpressions. Use | to separate each choice. Like this:

Boost: RegEx Reg ("(\ D {2} | N/A)"); note that this expression is enclosed in parentheses, to make sure that the entire expression is viewed as two choices. Adding a space in a regular expression is very easy. Use \ s to combine all the above things and get the following expression:

Boost: RegEx Reg ("\ D {3} [A-Za-Z] +. (\ D {2} | n/a) \ s "); now it is a bit complicated. We need some way to verify whether the words in the following input data match the first word (that is, the word we captured using the expression [A-Za-Z] + ). The key is to use back reference, that is, to reference the previous subexpression. To reference the expression [A-Za-Z] +, we must enclose it in parentheses. This makes the expression ([A-Za-Z] +) The first subexpression in our regular expression. We can use index 1 to create a backward reference.

In this way, we get the entire regular expression, which is used to represent "three numbers, one word, any character, two numbers or strings" N/A, "a space, repeat the first word. ":

Boost: RegEx Reg ("\ D {3} ([A-Za-Z] + ). (\ D {2} | n/a) \ s \ 1 "); Good job! Below is a simpleProgramUse this expression in the regex_match algorithm to verify two input strings.

# Include <iostream> # include <cassert> # include <string> # include "Boost/RegEx. HPP "int main () {// 3 digits, a word, any character, 2 digits or" N/A ", // a space, then the first word again boost :: regEx Reg ("\ D {3} ([A-Za-Z] + ). (\ D {2} | n/a) \ s \ 1 "); STD: String correct =" 123 Hello N/A hello "; STD :: string incorrect = "123 Hello 12 hello"; Assert (boost: regex_match (correct, Reg) = true); Assert (boost: regex_match (Incorrect, Reg) = false);} the first string, 123 Hello N/A Hello, is correct; 123 is 3 digits, hello is a word followed by any character (a space), followed by N/A and another space, and the word hello is repeated. The second string is incorrect because the word hello is not strictly repeated. By default, regular expressions are case-sensitive, so reverse references cannot match.

The key to writing regular expressions is the successful decomposition. Let's take a look at the final expression you just created. It is really hard for untrained people. However, if we break down this expression into a small part, it will not be too complicated.

Search
Now let's take a look at another boost. RegEx algorithm, regex_search. Unlike regex_match, regex_search does not require full matching of the entire input data, so only part of the data can be matched. As an example, a programmer may forget to call delete one or two times in his program. Although he knows that this simple test may be meaningless, he decided to calculate the number of times new and delete appear and check whether the numbers match. This regular expression is simple. We have two options: New and delete.

Boost: RegEx Reg ("(new) | (delete)"); there are two reasons to enclose the subexpression in parentheses: one is to indicate that our choice is two groups. Another reason is that we want to reference these subexpressions when calling regex_search, so that we can determine which option is matched. We use a reload of regex_search, which accepts a match_results type parameter. When regex_search performs a match, it uses a match_results type object to report the matched subexpression. The class template match_results uses the iterator type used by an input sequence to parameterize.

template class match_results; typedef match_results cmatch; typedef match_results wcmatch; typedef match_results smatch; typedef match_results wsmatch; STD :: string, so pay attention to typedef smatch, which is the abbreviation of match_results . If regex_search returns true, the match_results reference passed to the function will contain the matched subexpression results. In match_results, the indexed sub_match is used to represent each subexpression in the regular expression. Let's take a look at how we can help this confused programmer calculate the call to new and delete.

boost: RegEx Reg ("(new) | (delete)"); Boost: smatch m; STD :: string S = "callto new must be followed by Delete. \ calling simply new results in a leak! "; If (boost: regex_search (S, M, REG) {// did New Match? If (M [1]. Matched) STD: cout <"the expression (new) matched! \ N "; if (M [2]. Matched) STD: cout <" the expression (delete) matched! \ N ";} The above program searches for new or delete in the input string, and reports which one is first found. By passing an smatch object of the type to regex_search, we can learn details about how the algorithm is successfully executed. Our expression has two subexpressions, so we can use index 1 of match_results to get the subexpression new. in this case, we get a sub_match instance, which has a Boolean Member, matched, which tells us whether this subexpression is involved in matching. Therefore, for the input in the previous example, the running result will output "the expression (new) matched! \ N ". Now, you have some work to do. You need to apply the regular expression to the rest of the Input. Therefore, you need to use another regex_search overload. It accepts two iterators to indicate the character sequence to be searched. Because STD: string is a container, it provides an iterator. Now, for each matching, you must update the iterator indicating the starting point of the range to the ending point of the last matching. Finally, add two variables to record the times of new and delete. The following is a complete program:

# Include <iostream> # include <string> # include "Boost/RegEx. HPP" int main () {// "new" and "delete" appear the same number of times? Boost: RegEx Reg ("(new) | (delete)"); Boost: smatch m; STD: String S = "callto new must be followed by Delete. \ calling simply new results in a leak! "; Int new_counter = 0; int delete_counter = 0; STD: String: const_iterator it = S. begin (); STD: String: const_iterator end = S. end (); While (boost: regex_search (it, end, M, REG) {// is it new or delete? M [1]. matched? ++ New_counter: ++ delete_counter; it = m [0]. Second;} If (new_counter! = Delete_counter) STD: cout <"Leak detected! \ N "; else STD: cout <" seems OK... \ n ";}note that this program always sets it as M [0]. Second. Match_results [0] returns a reference to the Child match that matches the entire regular expression. Therefore, we can confirm that the end point of the match is the starting point of the next running of regex_search. Running this program will output "Leak detected! ", Because there are two new, and only one Delete. Of course, a variable may also be deleted in two places, new [] and delete [] may be called, and so on.

Now you have a good understanding of how to group and use subexpressions. Now it is time to enter the last boost. RegEx algorithm, which is used for replacement.

Replace
The third algorithm in the RegEx family is regex_replace. As the name suggests, it is used to execute text replacement. It searches the entire input data to find all the matches of the regular expression. For each matching expression, the algorithm calls match_results: Format and input the result to an output iterator of the input function.

In the introduction section of this chapter, I provide an example to replace the color of English spelling with the color of American spelling. if you do not use a regular expression to modify the spelling, it will be tedious and error-prone. The problem is that there may be different cases, and many words may be affected, such as colourize. To solve this problem correctly, we need to divide the regular expression into three subexpressions.

Boost: RegEx Reg ("(Colo) (u) (R)", boost: RegEx: icase | boost: RegEx: Perl ); the letter U to be removed is opened independently, so that it can be easily deleted in all matches. Note that this regular expression is case-insensitive. We need to pass the format sign boost: RegEx: icase to the RegEx constructor. You also need to pass the other flag you want to set. A common error when setting a flag is to ignore the default flag opened by RegEx. If you do not set these flags, they will not be opened. You must set all the flags you want to open.

When regex_replace is called, a formatted string is provided as a parameter. The formatted string determines how to replace it. In this formatted string, you can reference the matched sub-expression, which is exactly what we want. You want to keep the First and Third Matching subexpressions, and remove the second (u ). Expression $ n indicates the matched subexpression. n is the index of the subexpression. Therefore, the formatted string is "$1 $3", which indicates replacing the text with the First and Third subexpressions. By referencing a matched sub-expression, we can retain all the cases in the matched text. This is not possible if we use a string to replace the text. The following is a complete program to solve this problem.

# Include <iostream> # include <string> # include "Boost/RegEx. HPP "int main () {boost: RegEx Reg (" (Colo) (u) (R) ", boost: RegEx: icase | boost: RegEx :: perl); STD: String S = "colour, colours, color, colourize"; S = boost: regex_replace (S, Reg, "$1 $3"); STD:: cout <s;} program output is "color, colors, color, colorize ". regex_replace is very useful for text replacement.

Common User misunderstandings
The most common problems I have seen related to boost. RegEx are related to the semantics of regex_match. It is easy to forget that all input of regex_match must match the given regular expression. Therefore, users often think that the following code will return true.

Boost: RegEx Reg ("\ D *"); bool B = boost: regex_match ("17 is prime", Reg ); undoubtedly, this call will never be successful. True is returned only when all input values are matched by regex_match! Almost all users will ask why regex_search is not like this and regex_match is.

Boost: RegEx Reg ("\ D *"); bool B = boost: regex_search ("17 is prime", Reg); true is certainly returned this time. it is worth noting that you can use some specific buffer operators to run regex_search like regex_match. \ A matches the starting point of the buffer, while \ Z matches the ending point of the buffer. Therefore, if you place \ A at the beginning of the regular expression, put \ Z at the end, you can use regex_search as regex_match, that is, all input must be matched. The following regular expression requires that all input values be matched, regardless of regex_match or regex_search.

Boost: RegEx Reg ("\ A \ D * \ Z"); remember that this does not mean regex_match is not required. On the contrary, it clearly indicates the semantics we just mentioned, that is, all input must be matched.

About repetition and greed
Another easy obfuscation is about repeated greed. Some duplicates, such as + and *, are greedy. That is, they consume as much input as possible. The following regular expression is not uncommon. It is used to capture two numbers after a greedy repetition.

Boost: RegEx Reg ("(. *) (\ D {2})"); this regular expression is correct, but it may not match your desired subexpression! Expression. * will swallow everything, and subsequent subexpressions will not be able to match. The following is an example of this behavior:

int main () {boost: RegEx Reg ("(. *) (\ D {2}) "); Boost: cmatch m; const char * text =" note that I'm 31 years old, not 32. "; if (boost: regex_search (text, M, REG) {If (M [1]. matched) STD: cout <"(. *) matched: " . We must use it instead We use the previously used smatch because we use a character sequence to call regex_search instead of STD: string. What do you expect the running result of this program? Generally, a user who is using a regular expression will first think of M [1]. matched and M [2]. all matched values are true, and the result of the second subexpression is "31 ". after realizing the effect of greedy repetition, that is, repetition consumes input as much as possible. The user will think that only the first subexpression is true, that is. * All input is swallowed successfully. Finally, the new user gets a conclusion that both subexpressions are matched, but the second expression matches the last possible sequence. That is, the first subexpression matches "note that I'm 31 years old, not", and the second one matches "32".

So what if you want to use a duplicate expression and match another subexpression for the first time? Use non-Greedy duplicates. Add? , Repetition becomes non-greedy. This means that the expression will try to find that the shortest match may not stop the rest of the expression from matching. Therefore, we need to change the regular expression to the correct one.

Boost: RegEx Reg ("(.*?) (\ D {2}) "); if we use this regular expression to modify the program, then M [1]. matched and M [2]. all matched values are true. expression. *? It only consumes the least possible input, that is, it will stop at the first character 3, because this is required for the expression to successfully match. Therefore, the first subexpression matches "note that I'm" and the second one matches "31 ".

Let's take a look at regex_iterator.
We have seen how to use regex_search several times to process all input, but on the other hand, the more elegant method is to use regex_iterator. This iterator type uses a sequence to list all matching regular expressions. Resolving a regex_iterator will generate a reference to a match_results instance. When constructing a regex_iterator, you need to pass the iterator indicating the input sequence to it and provide the corresponding regular expression. Let's look at an example. The input data is a group of integers separated by commas. The corresponding regular expression is simple.

Boost: RegEx Reg ("(\ D + ),? "); Add? (Match zero or once) Make sure that the last number can be analyzed successfully, even if the input sequence does not end with a comma. In addition, we also use another duplicate character +. This duplicate character indicates matching once or multiple times. Now, you do not need to call regex_search multiple times. We create a regex_iterator, call the algorithm for_each, and pass it to a function object. This function object is called using the iterator for unreferencing. The following is a function object that accepts any form of match_results. It has a generic call operator. It adds the value of the current match to a sum (in our regular expression, the first subexpression is used ).

Class regex_callback {int sum _; public: regex_callback (): Sum _ (0) {} template <typename T> void operator () (const T & what) {sum _ + = atoi (what [1]. STR (). c_str ();} int sum () const {return sum _ ;}}; now pass an instance of this function object to STD: for_each, the result is that this function object is called for the unreference of each iterator it, that is, each matching subexpression is called.

Int main () {boost: RegEx Reg ("(\ D + ),? "); STD: String S =", "; Boost: sregex_iterator it (S. begin (), S. end (), Reg); Boost: sregex_iterator end; regex_callback C; int sum = for_each (it, end, c ). sum () ;}as you can see, the end iterator passed to for_each is a default constructor instance of regex_iterator. The IT and end types are boost: sregex_iterator, that is, typedef of regex_iterator <STD: String: const_iterator>. this method of using regex_iterator is clearer than the Multiple matching methods we interviewed before, in Multiple matching methods, we have to keep the starting iterator moving forward in a loop and call regex_search.

Use regex_token_iterator to separate strings
Another iterator type, or more accurate, is boost: regex_token_iterator. it is similar to regex_iterator, but it is used to list each character sequence that does not match a regular expression, which is useful for splitting strings. It can also be used to select which subexpression is interested. When regex_token_iterator is referenced, only the subexpression subscribed to is returned. Consider an application that accepts data items separated by diagonal lines as input. The data between two diagonal lines forms the items to be processed by the application. It is easy to use regex_token_iterator to separate this string. This regular expression is simple.

Boost: RegEx Reg ("/"); this RegEx matches the delimiter between each item. To separate the input, you just need to pass the specified index 1 to the regex_token_iterator constructor. The following is a complete program:

Int main () {boost: RegEx Reg ("/"); STD: String S = "Split/values/separated/by/slashes,"; STD :: vector <STD: String> VEC; Boost: sregex_token_iterator it (S. begin (), S. end (), Reg,-1); Boost: sregex_token_iterator end; while (it! = END) Vec. push_back (* It ++); Assert (VEC. size () = STD: Count (S. begin (), S. end (), '/') + 1); Assert (VEC [0] = "split");} like regex_iterator, regex_token_iterator is a template class, it uses the iterator type of the packaged sequence for special purposes. Here, we use sregex_token_iterator, which is the typedef of regex_token_iterator <STD: String: const_iterator>. Each time this iterator is referenced, it returns the current sub_match. When this iterator advances, it tries to match the regular expression again. The two iterator types, regex_iterator and regex_token_iterator, are very useful. You should understand that you should use them when you consider repeatedly calling regex_search.

More regular expressions
You have seen many regular expression syntaxes, but you need to know more about them. This section briefly demonstrates some other features of the regular expression that you use every day. At the beginning, let's take a look at a complete set of duplicates. We have seen *, +, and use {} to limit duplicates. There is also a duplicate character, that is ?. You may have noticed that it can also be used to declare non-Greedy repetition, but for itself, it indicates that an expression must appear zero or once. It is also worth mentioning that the qualifier can be flexible. The following are three different usages:

Boost: RegEx reg1 ("\ D {5}"); Boost: RegEx reg2 ("\ D {2, 4}"); Boost :: regEx reg3 ("\ D {2,}"); the First Regular Expression matches five numbers. The second matches two, three, or four digits. The third matches two or more numbers, with no upper limit.

Another important feature of regular expressions is to use metacharacters ^ to represent non-character classes. It is used to represent a character that matches any character that is not in the specified character category. For example, see the following regular expression.

Boost: RegEx Reg ("[^ 13579]"); it contains a non-character category that matches any character that is not an odd number. Take a look at the following small program and try to give the program output.

Int main () {boost: RegEx reg4 ("[^ 13579]"); STD: String S = "0123456789"; Boost: sregex_iterator it (S. begin (), S. end (), reg4); Boost: sregex_iterator end; while (it! = END) STD: cout <* It ++;} Have you provided the answer? The output is "02468", that is, all even numbers. Note that this character category not only matches an even number, but also all if the input string is "alfabetagamma.

We can see this metacharacter ^, which has another meaning. It can be used to indicate the beginning of a row. The metacharacter $ indicates the end of a row.

Incorrect Regular Expression
An incorrect regular expression is a regular expression that does not comply with rules. For example, you may have forgotten an angle bracket, so that the regular expression engine will not be able to compile this regular expression. A bad_expression type exception is thrown. As I mentioned earlier, the exception name will be modified in the next boost. RegEx version, and will be added to the library technical report version. The exception type bad_expression will be renamed to regex_error.

If all the regular expressions in your application are hard-coded, you may not need to handle the error expression, but if you accept the user input as a regular expression, you must prepare for error handling. Here is a program that prompts you to enter a regular expression and then enter a string to match the regular expression. Input by users may always lead to invalid input.

int main () {STD: cout <"enter a regular expression: \ n"; STD: String s; STD: Getline (STD: Cin, s); try {boost: RegEx Reg (s); STD: cout <"enter a string to be matched: \ n"; STD: Getline (STD :: cin, S); If (boost: regex_match (S, REG) STD: cout <"That's right! \ N "; else STD: cout <" no, sorry, That doesn' t match. \ n ";} catch (const boost: bad_expression & E) {STD: cout <" that's not a valid regular expression! (Error: "

Enter a regular expression: \ D {5} enter a string to be matched: 12345that's right! Now, for some wrong data, try to enter an incorrect regular expression.

Enter a regular expression :( \ W *) That's not a valid regular expression! (Error: unmatched (or \ () exiting... during RegEx Reg construction, an exception is thrown because this regular expression cannot be compiled. Therefore, in the catch processing routine, the program prints an error message and exits. You only need to know three possible exceptions. One is to construct a regular expression, as you just saw; the other is to use the member function assign to assign a regular expression to a RegEx. The last one is that the RegEx iterator and algorithm may also throw an exception if the memory is insufficient or the matching complexity increases too fast.

RegEx Summary
Undoubtedly, regular expressions are very useful and important. This Library provides C ++ with powerful regular expression functions. Traditionally, users have no choice but to use posix c APIs to implement regular expressions. For text processing verification, regular expressions are much more flexible and reliable than manual analysis code writing. For Search and replacement, regular expressions can be used to easily solve many related problems.

Boost. RegEx is a powerful library, so it is impossible to fully cover all its contents in this chapter. Similarly, the perfect expression and wide application of regular expressions mean that this chapter not only briefly introduces them. This topic can be written into a separate book. To learn more, you can learn the online documentation of boost. RegEx and find a book about regular expressions (think about the suggestions in the bibliography ). No matter how powerful boost. RegEx is and how deep the regular expressions are, beginners can still effectively use the regular expressions in this library. For programmers who have chosen other languages because C ++ does not support regular expressions, welcome home.

Boost. RegEx is not the only regular expression library that can be used by C ++ programmers, but it is indeed the best. It is easy to use and lightning fast when matching your regular expression. You should try to use it.

Boost. RegEx is written by dr. John Maddock.

From: http://blog.csdn.net/alai04/archive/2006/01/25/588400.aspx

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/ck4918/archive/2007/09/12/1781926.aspx

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.