Regular expressions can also be used in C + +

Source: Internet
Author: User

The regular expression regex (regular expression) is a powerful tool for describing character sequences. There are regular expressions in many languages, and c++11 also incorporates regular expressions into the new standard, which in turn supports the syntax of 6 different regular expressions: ECMASCRIPT, Basic, Extended, awk, grep, and Egrep, respectively. Where ECMAScript is the default syntax, which syntax we use can be specified when constructing a regular expression.

Note: ECMAScript is a script programming language that is standardized by the ECMA International (formerly the European Association of Computer manufacturers, the English name is European Computer Manufacturers Association) through ECMA-262. It is often called JavaScript, but in fact the latter is the implementation and extension of the ECMA-262 standard.

Let's Take this blog page (http://www.cnblogs.com/ittinybird/p/4853532.html) source code as an example, starting from zero to demonstrate how the C + + Use regular expressions to extract all available HTTP links from a Web page source. If there is time, recently I want to use the new features of c++11, rewrite the previous C + + crawler, share it.

Make sure your compiler supports the Regex

If your compiler is GCC-4.9.0 or VS2013 the following version, please upgrade and then use. I used to use the C + + compiler, is the GCC 4.8.3, there is a Regex header file, but GCC is not very kind of implementation, the syntax is fully supported, but the library has not been followed, so the compile time is no problem, but a run will directly throw an exception, very perfect a pit has wood! The specific errors are as follows:

Terminate called after throwing a instance of ' Std::regex_error ' What  ():  regex_erroraborted (Core dumped)

If you also encounter this problem, please do not first doubt yourself, gcc this is very pit Daddy!!! I spent half a day on this to find out. So before you try out the regular expressions of C + +, upgrade your compiler and make sure that your compiler supports it.

Regex Library Overview

The header file <regex> contains a number of components that we need to use with regular expressions, roughly:

Basic_regex The regular Expression object, which is a generic template, has typedef basic_regex<char> regex and typedef basic_regex<char_t>wregex;
Regex_match Match a sequence of characters to a regular expression
Regex_search Looks for the result of matching a regular expression in a substring in a sequence of characters and stops finding after the first match is found
Regex_replace Replace the regular expression to match the character sequence with the formatted replacement text
Regex_iterator Iterator to match all substrings
Match_results The container class that holds the result of the regular expression match.
Sub_match A container class that holds a sequence of characters that match a child regular expression.

ECMAScript Regular Expression syntax

The syntax of regular expressions is basically the same, and here is a waste of space. ECMAScript Regular expression Syntax knowledge can refer to w3cschool.

Constructing regular Expressions

Constructs a regular expression for a class: Basic_regex. Basic_regex is a generic class template for a regular expression that has a corresponding specificity for both char and wchar_t types:

typedef basic_regex<char>    regex;typedef basic_regex<wchar_t> Wregex;

The constructors are much more, but they are very simple:

//the default constructor, which will match any character sequence basic_regex ();//construct a regular expression with a string s ending with '% ' explicit basic_ Regex (const chart* s,flag_type f =std::regex_constants::ecmascript);//Ibid., but the length of the string s used for construction is Countbasic_regex (const chart* s, std::size_t count,flag_type f = std::regex_constants::ecmascript);//copy construction, do not repeat basic_regex (const basic_regex& Amp Other); Move constructor Basic_regex (basic_regex&& other);//str constructed with basic_string type regular expression template< class ST, class SA > Explicit Basic_regex (const std::basic_string<chart,st,sa>& str, Flag_type f = std::regex_constants:: ECMAScript);//string construction within the specified range [First,last] Regular expression template< class ForwardIt >basic_regex (ForwardIt first, ForwardIt Last, Flag_type f = std::regex_constants::ecmascript);//Using Initializer_list construction Basic_regex (std::initializer_list< Chart> init, Flag_type f = std::regex_constants::ecmascript); 

Above constructors other than the default construct, there is a parameter of type Flag_type for specifying the syntax of the regular expression, and ECMASCRIPT, Basic, Extended, awk, grep, and egrep are optional values. There are several other possible flags that change the rules and behaviors of regular expression matching:

Flag_type Effects
Icase Ignore case during matching process
Nosubs Do not save matching sub-expressions
Optimize Execution speed is better than construction speed

With the constructor, we can now construct a regular expression that extracts the HTTP link:

std::string pattern ("http (s)?:/ /([\\w-]+\\.) +[\\w-]+ (/[\\w-./?%&=]*)?);    The matching rule is very simple, if you have doubts, you can view Std::regex R (pattern) against the syntax;

It is worth mentioning that the ' \ ' character in C + + needs to be escaped, so all ECMAScript regular expression syntax ' \ ' needs to be written in the form of "\ \" . When I test, this regex if not escaped, in GCC will give a warning, vs2013 after compiling the run directly crashes.

Handle input correctly

First of all, let's say that we are not using the network library automatically downloaded in the program page, after we manually downloaded the page and saved to the file, first we have to do is the content of the Web page (HTML source) into a std::string, we may use the wrong way:

int main () {    std::string tmp,html;    while (std::cin >> tmp)        HTML + = tmp;}

In this way, all whitespace characters in the source code are inadvertently handled by us, which is obviously inappropriate. Here we still use the getline () function to handle:

int main () {    std::string tmp,html;    while (Getline (std::cin,tmp))    {        html + = tmp;        html + = ' \ n ';    }}

This way the original text can be entered correctly. Of course, personally think these small details are worth noting, when the error when debugging, I think we are more suspicious of our regular expression is valid .

Regex_search () only finds the first matched subsequence

Depending on the literal semantics of the function, we may mistakenly choose the regex_search () function to match. Its function prototype also has 6 overloaded version , the usage is also very similar, the function return value is bool value , succeeds returns TRUE, the failure returns false. In view of the space, we only look at the following we want to use this:

template< class STraits, Class Salloc,class Alloc, Class CharT, class Traits >bool regex_search (const STD::BASIC_ST ring<chart,straits,salloc>& s,                   std::match_results<typename std::basic_string<chart,straits, Salloc>::const_iterator, alloc>& m,                   const Std::basic_regex<chart, traits>& E,                   std:: Regex_constants::match_flag_type flags = Std::regex_constants::match_default);

The first parameter s is the std::basic_string type, which is the sequence of characters we want to match, the parameter m is a match_results container to hold the matching result, and the parameter e is used to hold the regular expression object we constructed earlier. The flags parameter is worth mentioning, and its type is std::regex_constants::match_flag_type, which semantically matches the meaning of the flag. As we can specify how the options handle regular expressions when constructing a regular expression object, we can still specify additional flags to control the matching rules during the match. The specific meaning of these signs, I quote from cppreference.com , use the time to check on it :

Constant Explanation
match_not_bol The first character in [First,last] would be treated as if it was not on the beginning of a line (i.e. ^ won't match [First,first]
match_not_eol The last character in [First,last] would be treated as if it was not on the end of a line (i.e. $ Won't match[Last,last]
match_not_bow "\b" would not match [First,first)
match_not_eow "\b" would not match [Last,last)
match_any If more than one match was possible, then any match was an acceptable result
match_not_null Do not match empty sequences
match_continuous Match a sub-sequence that begins at first
match_prev_avail --first is a valid iterator position. When set, causes Match_not_bol and Match_not_bow to be ignored
format_default Use ECMAScript rules to construct strings in std::regex_replace (syntax documentation)
format_sed Use the POSIX sed utility rules in std::regex_replace. (Syntax documentation)
format_no_copy Do not copy un-matched strings to the output in std::regex_replace

Based on the parameter type, we construct a call like this:

Std::smatch results;
Regex_search (HTML,RESULTS,R);

However, the standard library specifies that Regex_search () will stop looking after the first matching substring is found ! In this program, the results parameter only brings back the first HTTP link that satisfies the condition. This obviously does not satisfy our need to extract all HTTP links in the webpage.

Match all substrings with Regex_iterator

Strictly Regex_iterator is an iterator adapter that is used to bind the sequence of characters to match and the Regex object. the default constructor for Regex_iterator is special, and a post-tail iterator is constructed directly . Another constructor prototype:

Regex_iterator (Bidirit A, Bidirit B,                                                           //is the first iterator to match the character sequence and the trailing iterator               const regex_type& RE,                                                           //regex object               Std::regex_constants::match_flag_type m = std::regex_constants::match_default); Mark, with the above Regex_search () in the

Like the Regex_search () above, the Regex_iterator constructor also has a std::regex_constants::match_flag_type type of argument, using the same method. In fact, the internal implementation of Regex_iterator is called regex_search (), this parameter is used to pass to Regex_search (). A bit of a comparison image that can be demonstrated with GIF, specifically working like this (the Color burn-in section, which represents a subsequence that can be matched):

First, when constructing the Regex_iterator, the constructor first calls the Regex_search () to point the iterator it to the first matched sub-sequence. In the course of each subsequent iteration (++it), the Regex_search () will continue to be called in the remainder of the subsequent subsequence until the iterator goes to the end. It has been "pointing" to the matching subsequence.

Knowing the principle, we write code much more easily. Combined with the previous part of us, this program is basically written:

#include <iostream> #include <regex> #include <string>int main () {    std::string tmp,html;    while (Getline (std::cin,tmp))    {        tmp + = ' \ n ';        HTML + = tmp;    } std::string pattern ("http (s)?:/ /([\\w-]+\\.) +[\\w-]+ (/[\\w-./?%&=]*)?); Pattern = "[: alpha:]]*" + pattern + "[[: alpha:]]*"; Std::regex r (pattern); for (Std::sregex_iterator It (Html.begin (), Html.end (), R), end;     End is the trailing iterator, Regex_iterator is the version of the Regex_iterator string type it! = end;++it) {std::cout << it->str () << std:: Endl;}}

Download this page of the HTML source Save as test.html, compile this source test, done:

[Regex]g++ Regex.cpp-std=c++11-omain[regex]main < test.htmlhttp://www.cnblogs.com/ittinybird/rsshttp:// www.cnblogs.com/ittinybird/rsd.xmlhttp://www.cnblogs.com/ittinybird/wlwmanifest.xmlhttp://common.cnblogs.com/ script/jquery.jshttp://files.cnblogs.com/files/ittinybird/mystyle.csshttp://www.cnblogs.com/ittinybird/http:// www.cnblogs.com/ittinybird/http://www.cnblogs.com/ittinybird/http://i.cnblogs.com/EditPosts.aspx?opt=1http:// msg.cnblogs.com/send/%e6%88%91%e6%98%af%e4%b8%80%e5%8f%aac%2b%2b%e5%b0%8f%e5%b0%8f%e9%b8%9fhttp:// www.cnblogs.com/ittinybird/rsshttp://www.cnblogs.com/ittinybird/rsshttp://www.cnblogs.com/images/xml.gifhttp:/ /i.cnblogs.com/http://www.cnblogs.com/ittinybird/p/4853532.htmlhttp://www.cnblogs.com/ittinybird/p/4853532. htmlhttp://www.w3school.com.cn/jsref/jsref_obj_regexp.asphttp://www.cnblogs.com/ittinybird/http:// I.cnblogs.com/editposts.aspx?postid=4853532http://www.cnblogs.com/http://q.cnblogs.com/http://news.cnblogs.com /http://home.cnblogs.com/ing/http://job.cnblogs.com/http://kb.cnblogs.com/ 
Regex and Exception handling

If there is an error in our regular expression, the standard library throws a Regex_error exception at run time, and he has a member named code that is used to mark the wrong type, with the exact error value and semantics as shown in the following table:

Invalid range in
code meaning
error_collate Invalid element proofing
error_ctype Invalid character class
error_escape Invalid transfer character or invalid trailing escape
error_backref Invalid backward reference
error_brack square brackets do not match
error_paren parentheses do not match
error_brace curly braces do not match
error_badbrace curly braces
error_range Invalid (illegal) character range
error_space Not enough memory
error_badrepeat repeated characters before the regular expression (* +?)
error_complexity It's too complicated, but the standard library can't hold.
error_stack Stack is running out of space

The basic content of exception handling, not the content to be discussed in this article, do not repeat.

Summary

C++11 the regular expression in the standard library part of the content of this article is not covered, the individual thought mastered the above content, basically look at the interface to know how to use, here is not wasted space.

Thank you for your reading, the wrong place also please correct me, I will be extremely grateful:).

Regular expressions can also be used in C + +

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.