Example and source code analysis of Regular Expressions in C ++ 11
Regular expression Regex (regular expression) is a powerful tool for describing character sequences. Regular Expressions exist in many languages. C ++ 11 also incorporates regular expressions into the new standard. In addition, it also supports the syntax of six different regular expressions, ECMASCRIPT, basic, extended, awk, grep, and egrep. ECMASCRIPT is the default syntax. You can specify the syntax when constructing a regular expression.
Note: ECMAScript is a script design language standardized by ECMA-262 by European Computer Manufacturers Association, formerly known as Ecma International as European Computer Manufacturers Association. It is often called JavaScript, but the latter is actually the implementation and extension of the ECMA-262 standard.
Next we will take this blog page http://www.cnblogs.com/ittinybird/p/4853532.html) source code as an example, from scratch demonstrate how to use regular expressions in C ++ to extract all the available http links in a Web page source code. If I have time, I want to use the new features of C ++ 11 recently. I will rewrite the previous C ++ crawler and share it with me.
Make sure your compiler supports Regex
If your compiler is GCC-4.9.0 or earlier than VS2013, upgrade it before using it.The C ++ compiler I used earlier is GCC 4.8.3 and has a regex header file. However, GCC is not very well implemented and the syntax is fully supported, but the library hasn't kept up with it yet, so there is no problem during compilation, but an exception will be thrown directly when it is run, a perfect pitfall! The specific error is as follows:
Terminate called after throwing an instance of 'std: regex_error'
What (): regex_error
Aborted (core dumped)
If you have encountered this problem, please do not doubt yourself first. GCC is very bad !!! I spent half a day searching for this. Therefore, before using C ++ regular expressions, upgrade your compiler to ensure that your compiler supports it.
Regex database Overview
The header file <regex> contains multiple components that we need to use when using a regular expression. The components include:
Basic_regex |
A regular expression object is a common template with typedef basic_regex <char> regex and typedef basic_regex <char_t> wregex; |
Regex_match |
Match a character sequence with a regular expression |
Regex_search |
Find the matching results of the substring in the character sequence with the regular expression,After the first matching result is found, the search will stop. |
Regex_replace |
Use formatted Replace text to replace the regular expression to match the Character Sequence |
Regex_iterator |
Iterator used to match all substrings |
Match_results |
Container class to save the matching results of regular expressions. |
Sub_match |
Container class, which stores the Character Sequence matched by the sub-regular expression. |
ECMASCRIPT regular expression syntax
The syntax of the regular expression is basically the same, so it is a waste of time. For the syntax knowledge of ECMASCRIPT regular expressions, see W3CSCHOOL.
Construct a regular expression
To construct a regular expression, use the class basic_regex. Basic_regex is a general class template for regular expressions. It has special features for both char and wchar_t types:
Typedef basic_regex <char> regex;
Typedef basic_regex <wchar_t> wregex;
There are many constructors, but they are very simple:
// Default constructor, which will match any Character Sequence
Basic_regex ();
// Use the string s ending with '\ 0' to construct a regular expression
Explicit basic_regex (const CharT * s, flag_type f = std: regex_constants: ECMAScript );
// Same as above, but the length of the string s used for construction is specified as count
Basic_regex (const CharT * s, std: size_t count, flag_type f = std: regex_constants: ECMAScript );
// Copy the structure. Do not repeat it
Basic_regex (const basic_regex & other );
// Mobile Constructor
Basic_regex (basic_regex & other );
// Construct a regular expression with 'str' of the 'basic_string type' type
Template <class ST, class SA>
Explicit basic_regex (const std: basic_string <CharT, ST, SA> & str, flag_type f = std: regex_constants: ECMAScript );
// String within the specified range [first, last) to construct a regular expression
Template <class ForwardIt>
Basic_regex (ForwardIt first, ForwardIt last, flag_type f = std: regex_constants: ECMAScript );
// Use initializer_list to construct
Basic_regex (std: initializer_list <CharT> init, flag_type f = std: regex_constants: ECMAScript );
All the constructors except the default constructor have a flag_type parameter used to specify the regular expression syntax. ECMASCRIPT, basic, extended, awk, grep, and egrep are optional values. There are several other possible signs used to change the rules and behavior of Regular Expression matching:
Flag_type |
Effects |
Icase |
Case Insensitive during Matching |
Nosubs |
Do not save matched child expressions |
Optimize |
Execution speed is faster than construction speed |
With the constructor, we can create a regular expression for extracting http links first:
Std: string pattern ("http (s )? : // ([\ W-] + \.) + [\ w-] + (/[\ w -./? % & =] *)? "); // The matching rule is very simple. If you have any questions, you can view it against the syntax.
Std: regex r (pattern );
It is worth mentioning that the '\' character in C ++ needs to be escaped. Therefore, '\' in all ECMASCRIPT Regular Expression syntaxes must be written "\".. During the test, if this regex is not escaped, a warning will be given in gcc. After vs2013 is compiled, the running will crash.
Correct Input Processing
Let's start with an additional question. If we are not using a Web page that is automatically downloaded from the program using a network library, after we manually download the web page and save it to a file, first, we need to save the webpage content (html source code) to a std: string. We may use this error method:
Int main ()
{
Std: string tmp, html;
While (std: cin> tmp)
Html + = tmp;
}
In this way, all the blank characters in the source code are processed by us. This is obviously not suitable. Here we still use the getline () function for processing:
Int main ()
{
Std: string tmp, html;
While (getline (std: cin, tmp ))
{
Html + = tmp;
Html + = '\ n ';
}
}
In this way, the original text can be correctly entered. Of courseI personally think these small details are worth noting. When debugging fails, I think we are more skeptical about whether our regular expression is valid..
Regex_search () only finds the first matched subsequence
According to the literal meaning of the function, we may incorrectly select the regex_search () function for matching. Its function prototype also has six overloaded versions, which are similar in usage,The Return Value of the function is bool., True is returned for success, and false is returned for failure. In view of the length, we only look at the following:
Template <class STraits, class SAlloc, class Alloc, class CharT, class Traits>
Bool regex_search (const std: basic_string <CharT, STraits, SAlloc> & s,
Std: match_results <typename std: basic_string <CharT, STraits, SAlloc >:: const_iterator, Alloc> & m,
Const std: basic_regex <CharT, Traits> & e,
Std: regex_constants: match_flag_type flags = std: regex_constants: match_default );
The first parameter s is of the std: basic_string type. It is the character sequence to be matched. The Parameter m is a match_results container used to store the matching results, parameter e is used to store the previously constructed Regular Expression objects. The flags parameter is worth mentioning. Its type is std: regex_constants: match_flag_type, which indicates the semantic match mark. Just as when constructing a regular expression object, we can specify the option to process the regular expression, we can still specify another flag to control the matching rules during the matching process. The specific meanings of these marks are referenced from cppreference.com. You can check them when using them:
Constant |
Explanation |
match_not_bol |
The first character in[First, last)Will be treated as if it isNotAt the beginning of a line (I. e.^Will not match[First, first) |
match_not_eol |
The last character in[First, last)Will be treated as if it isNotAt the end of a line (I. e.$Will not match[Last, last) |
match_not_bow |
"\ B"Will not match[First, first) |
match_not_eow |
"\ B"Will not match[Last, last) |
match_any |
If more than one match is possible, then any match is an acceptable result |
match_not_null |
Do not match empty sequences |
match_continuous |
Only match a sub-sequence that beginsFirst |
match_prev_avail |
-- FirstIs a valid iterator position. When set, causesMatch_not_bolAndMatch_not_bowTo be ignored |
format_default |
Use ECMAScript rules to construct strings in std: regex_replace (syntax documentation) |
format_sed |
Use POSIXSedUtility rules in std: regex_replace. (syntax documentation) |
format_no_copy |
Do not copy un-matched strings to the output in std: regex_replace |
According to the parameter type, we constructed such a call:
Std: smatch results; <br> regex_search (html, results, r );
However,The standard library requires regex_search () to stop searching after the first matched substring is found.! In this program, the results parameter only brings back the first http link that meets the conditions. This obviously does not satisfy our need to extract all HTTP links on the webpage.
Use regex_iterator to match all substrings
Strictly speaking, regex_iterator is an iterator adapter used to bind the character sequence to be matched with the regex object.The default constructor of regex_iterator is special, and a post-iteration is constructed directly.. Another constructor prototype:
Regex_iterator (BidirIt a, BidirIt B, // The first iterator and the last iterator of the string to be matched respectively.
Const regex_type & re, // regex object
Std: regex_constants: match_flag_type m = std: regex_constants: match_default); // flag, as shown in regex_search () above
Like regex_search () above, regex_iterator constructor also has parameters of the std: regex_constants: match_flag_type type. The usage is the same. In fact, the internal implementation of regex_iterator is to call regex_search (), which is used to pass to regex_search. You can use gif to compare the image. Specifically, the color deepening part of the image is used to represent a child sequence that can be matched ):
When constructing regex_iterator, the constructor first calls regex_search () to point it to the first matching subsequence. In the next iteration process + + it), you will continue to call regex_search () in the remaining subsequences until the iterator reaches the end. It always points to the matched subsequence.
Once we understand the principle, it is much easier to write code. In combination with the previous sections, this program is basically written:
# Include <iostream>
# Include <regex>
# Include <string>
Int main ()
{
Std: string tmp, html;
While (getline (std: cin, tmp ))
{
Tmp + = '\ n ';
Html + = tmp;
}
Std: string pattern ("http (s )? : // ([\ W-] + \.) + [\ w-] + (/[\ w -./? % & =] *)? ");
Pattern = "[[: alpha:] *" + pattern + "[[: alpha:] *";
Std: regex r (pattern );
For (std: sregex_iterator it (html. begin (), html. end (), r), end; // end is the post-end iterator, and regex_iterator is the string-type version of regex_iterator.
It! = End;
++ It)
{
Std: cout <it-> str () <std: endl;
}
}
The HTML source code of the downloaded page is saved as test.html. Compile the source code to test the code:
[Regex] g ++ regex. cpp-std = c ++ 11-omain
[Regex] main <test.html
Http://www.cnblogs.com/ittinybird/rss
Http://www.cnblogs.com/ittinybird/rsd.xml
Http://www.cnblogs.com/ittinybird/wlwmanifest.xml
Http://common.cnblogs.com/script/jquery.js
Http://files.cnblogs.com/files/ittinybird/mystyle.css
Http://www.cnblogs.com/ittinybird/
Http://www.cnblogs.com/ittinybird/
Http://www.cnblogs.com/ittinybird/
Http:// I .cnblogs.com/EditPosts.aspx? Opt = 1
Http://msg.cnblogs.com/send/%E6%88%91%E6%98%AF%E4%B8%80%E5%8F%AAC%2B%2B%E5%B0%8F%E5%B0%8F%E9%B8%9F
Http://www.cnblogs.com/ittinybird/rss
Http://www.cnblogs.com/ittinybird/rss
Http://www.cnblogs.com/images/xml.gif
Http:// I .cnblogs.com/
Http://www.cnblogs.com/ittinybird/p/4853532.html
Http://www.cnblogs.com/ittinybird/p/4853532.html
Http://www.w3school.com.cn/jsref/jsref_obj_regexp.asp
Http://www.cnblogs.com/ittinybird/
Http:// I .cnblogs.com/EditPosts.aspx? Postid = 4853532
Http://www.cnblogs.com/
Http://q.cnblogs.com/
Http://news.cnblogs.com/
Http://home.cnblogs.com/ing/
Http://job.cnblogs.com/
Http://kb.cnblogs.com/
Regex and Exception Handling
If our regular expression has an error, the standard library will throw a regex_error exception during runtime, which has a member named code to mark the error type, the specific error values and semantics are shown in the following table:
Code |
Description |
Error_collate |
Invalid element Verification |
Error_ctype |
Invalid character class |
Error_escape |
Invalid transfer character or invalid tail escape |
Error_backref |
Invalid backward reference |
Error_brack |
Square brackets do not match |
Error_paren |
Parentheses do not match |
Error_brace |
The braces do not match. |
Error_badbrace |
The range in braces is invalid. |
Error_range |
Invalid invalid) character range |
Error_space |
Insufficient memory |
Error_badrepeat |
No regular expression * +?) before repeated characters ?) |
Error_complexity |
It's too complicated to hold the standard library. |
Error_stack |
Insufficient stack space |
The basic content about exception handling is not discussed in this article.
Summary
Some of the regular expressions in the C ++ 11 standard library are not involved in this article. I personally think that after having mastered the above content, I will know how to use the interface, we will not waste space here.
Thank you for reading this article. Please correct the error. Thank you very much.