Example and source code analysis of Regular Expressions in C ++ 11

Last Update:2015-10-12 Source: Internet

Author: User

Tags traits egrep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Example and source code analysis of Regular Expressions in C ++ 11

Regular expression Regex (regular expression) is a powerful tool for describing character sequences. Regular Expressions exist in many languages. C ++ 11 also incorporates regular expressions into the new standard. In addition, it also supports the syntax of six different regular expressions, ECMASCRIPT, basic, extended, awk, grep, and egrep. ECMASCRIPT is the default syntax. You can specify the syntax when constructing a regular expression.

Note: ECMAScript is a script design language standardized by ECMA-262 by European Computer Manufacturers Association, formerly known as Ecma International as European Computer Manufacturers Association. It is often called JavaScript, but the latter is actually the implementation and extension of the ECMA-262 standard.

Next we will take this blog page http://www.cnblogs.com/ittinybird/p/4853532.html) source code as an example, from scratch demonstrate how to use regular expressions in C ++ to extract all the available http links in a Web page source code. If I have time, I want to use the new features of C ++ 11 recently. I will rewrite the previous C ++ crawler and share it with me.

Make sure your compiler supports Regex

If your compiler is GCC-4.9.0 or earlier than VS2013, upgrade it before using it.The C ++ compiler I used earlier is GCC 4.8.3 and has a regex header file. However, GCC is not very well implemented and the syntax is fully supported, but the library hasn't kept up with it yet, so there is no problem during compilation, but an exception will be thrown directly when it is run, a perfect pitfall! The specific error is as follows:

Terminate called after throwing an instance of 'std: regex_error'
What (): regex_error
Aborted (core dumped)

If you have encountered this problem, please do not doubt yourself first. GCC is very bad !!! I spent half a day searching for this. Therefore, before using C ++ regular expressions, upgrade your compiler to ensure that your compiler supports it.

Regex database Overview

The header file <regex> contains multiple components that we need to use when using a regular expression. The components include:

Basic_regex	A regular expression object is a common template with typedef basic_regex <char> regex and typedef basic_regex <char_t> wregex;
Regex_match	Match a character sequence with a regular expression
Regex_search	Find the matching results of the substring in the character sequence with the regular expression,After the first matching result is found, the search will stop.
Regex_replace	Use formatted Replace text to replace the regular expression to match the Character Sequence
Regex_iterator	Iterator used to match all substrings
Match_results	Container class to save the matching results of regular expressions.
Sub_match	Container class, which stores the Character Sequence matched by the sub-regular expression.

ECMASCRIPT regular expression syntax

The syntax of the regular expression is basically the same, so it is a waste of time. For the syntax knowledge of ECMASCRIPT regular expressions, see W3CSCHOOL.

Construct a regular expression

To construct a regular expression, use the class basic_regex. Basic_regex is a general class template for regular expressions. It has special features for both char and wchar_t types:

Typedef basic_regex <char> regex;
Typedef basic_regex <wchar_t> wregex;

There are many constructors, but they are very simple:

// Default constructor, which will match any Character Sequence
Basic_regex ();
// Use the string s ending with '\ 0' to construct a regular expression
Explicit basic_regex (const CharT * s, flag_type f = std: regex_constants: ECMAScript );
// Same as above, but the length of the string s used for construction is specified as count
Basic_regex (const CharT * s, std: size_t count, flag_type f = std: regex_constants: ECMAScript );
// Copy the structure. Do not repeat it
Basic_regex (const basic_regex & other );
// Mobile Constructor
Basic_regex (basic_regex & other );
// Construct a regular expression with 'str' of the 'basic_string type' type
Template <class ST, class SA>
Explicit basic_regex (const std: basic_string <CharT, ST, SA> & str, flag_type f = std: regex_constants: ECMAScript );
// String within the specified range [first, last) to construct a regular expression
Template <class ForwardIt>
Basic_regex (ForwardIt first, ForwardIt last, flag_type f = std: regex_constants: ECMAScript );
// Use initializer_list to construct
Basic_regex (std: initializer_list <CharT> init, flag_type f = std: regex_constants: ECMAScript );

All the constructors except the default constructor have a flag_type parameter used to specify the regular expression syntax. ECMASCRIPT, basic, extended, awk, grep, and egrep are optional values. There are several other possible signs used to change the rules and behavior of Regular Expression matching:

Flag_type	Effects
Icase	Case Insensitive during Matching
Nosubs	Do not save matched child expressions
Optimize	Execution speed is faster than construction speed

With the constructor, we can create a regular expression for extracting http links first:

Std: string pattern ("http (s )? : // ([\ W-] + \.) + [\ w-] + (/[\ w -./? % & =] *)? "); // The matching rule is very simple. If you have any questions, you can view it against the syntax.
Std: regex r (pattern );

It is worth mentioning that the '\' character in C ++ needs to be escaped. Therefore, '\' in all ECMASCRIPT Regular Expression syntaxes must be written "\".. During the test, if this regex is not escaped, a warning will be given in gcc. After vs2013 is compiled, the running will crash.

Correct Input Processing

Let's start with an additional question. If we are not using a Web page that is automatically downloaded from the program using a network library, after we manually download the web page and save it to a file, first, we need to save the webpage content (html source code) to a std: string. We may use this error method:

Int main ()
{
Std: string tmp, html;
While (std: cin> tmp)
Html + = tmp;
}

In this way, all the blank characters in the source code are processed by us. This is obviously not suitable. Here we still use the getline () function for processing:

Int main ()
{
Std: string tmp, html;
While (getline (std: cin, tmp ))
{
Html + = tmp;
Html + = '\ n ';
}
}

In this way, the original text can be correctly entered. Of courseI personally think these small details are worth noting. When debugging fails, I think we are more skeptical about whether our regular expression is valid..

Regex_search () only finds the first matched subsequence

According to the literal meaning of the function, we may incorrectly select the regex_search () function for matching. Its function prototype also has six overloaded versions, which are similar in usage,The Return Value of the function is bool., True is returned for success, and false is returned for failure. In view of the length, we only look at the following:

Template <class STraits, class SAlloc, class Alloc, class CharT, class Traits>
Bool regex_search (const std: basic_string <CharT, STraits, SAlloc> & s,
Std: match_results <typename std: basic_string <CharT, STraits, SAlloc >:: const_iterator, Alloc> & m,
Const std: basic_regex <CharT, Traits> & e,
Std: regex_constants: match_flag_type flags = std: regex_constants: match_default );

The first parameter s is of the std: basic_string type. It is the character sequence to be matched. The Parameter m is a match_results container used to store the matching results, parameter e is used to store the previously constructed Regular Expression objects. The flags parameter is worth mentioning. Its type is std: regex_constants: match_flag_type, which indicates the semantic match mark. Just as when constructing a regular expression object, we can specify the option to process the regular expression, we can still specify another flag to control the matching rules during the matching process. The specific meanings of these marks are referenced from cppreference.com. You can check them when using them:

Constant	Explanation
`match_not_bol`	The first character in[First, last)Will be treated as if it isNotAt the beginning of a line (I. e.^Will not match[First, first)
`match_not_eol`	The last character in[First, last)Will be treated as if it isNotAt the end of a line (I. e.$Will not match[Last, last)
`match_not_bow`	"\ B"Will not match[First, first)
`match_not_eow`	"\ B"Will not match[Last, last)
`match_any`	If more than one match is possible, then any match is an acceptable result
`match_not_null`	Do not match empty sequences
`match_continuous`	Only match a sub-sequence that beginsFirst
`match_prev_avail`	-- FirstIs a valid iterator position. When set, causesMatch_not_bolAndMatch_not_bowTo be ignored
`format_default`	Use ECMAScript rules to construct strings in std: regex_replace (syntax documentation)
`format_sed`	Use POSIXSedUtility rules in std: regex_replace. (syntax documentation)
`format_no_copy`	Do not copy un-matched strings to the output in std: regex_replace

According to the parameter type, we constructed such a call:

Std: smatch results; <br> regex_search (html, results, r );

However,The standard library requires regex_search () to stop searching after the first matched substring is found.! In this program, the results parameter only brings back the first http link that meets the conditions. This obviously does not satisfy our need to extract all HTTP links on the webpage.

Use regex_iterator to match all substrings

Strictly speaking, regex_iterator is an iterator adapter used to bind the character sequence to be matched with the regex object.The default constructor of regex_iterator is special, and a post-iteration is constructed directly.. Another constructor prototype:

Regex_iterator (BidirIt a, BidirIt B, // The first iterator and the last iterator of the string to be matched respectively.
Const regex_type & re, // regex object
Std: regex_constants: match_flag_type m = std: regex_constants: match_default); // flag, as shown in regex_search () above

Like regex_search () above, regex_iterator constructor also has parameters of the std: regex_constants: match_flag_type type. The usage is the same. In fact, the internal implementation of regex_iterator is to call regex_search (), which is used to pass to regex_search. You can use gif to compare the image. Specifically, the color deepening part of the image is used to represent a child sequence that can be matched ):

When constructing regex_iterator, the constructor first calls regex_search () to point it to the first matching subsequence. In the next iteration process + + it), you will continue to call regex_search () in the remaining subsequences until the iterator reaches the end. It always points to the matched subsequence.

Once we understand the principle, it is much easier to write code. In combination with the previous sections, this program is basically written:

# Include <iostream>
# Include <regex>
# Include <string>

Int main ()
{
Std: string tmp, html;
While (getline (std: cin, tmp ))
{
Tmp + = '\ n ';
Html + = tmp;
}
Std: string pattern ("http (s )? : // ([\ W-] + \.) + [\ w-] + (/[\ w -./? % & =] *)? ");
Pattern = "[[: alpha:] *" + pattern + "[[: alpha:] *";
Std: regex r (pattern );
For (std: sregex_iterator it (html. begin (), html. end (), r), end; // end is the post-end iterator, and regex_iterator is the string-type version of regex_iterator.
It! = End;
++ It)
{
Std: cout <it-> str () <std: endl;
}
}

The HTML source code of the downloaded page is saved as test.html. Compile the source code to test the code:

[Regex] g ++ regex. cpp-std = c ++ 11-omain
[Regex] main <test.html

Http://www.cnblogs.com/ittinybird/rss

Http://www.cnblogs.com/ittinybird/rsd.xml

Http://www.cnblogs.com/ittinybird/wlwmanifest.xml

Http://common.cnblogs.com/script/jquery.js

Http://files.cnblogs.com/files/ittinybird/mystyle.css

Http://www.cnblogs.com/ittinybird/

Http:// I .cnblogs.com/EditPosts.aspx? Opt = 1

Http://msg.cnblogs.com/send/%E6%88%91%E6%98%AF%E4%B8%80%E5%8F%AAC%2B%2B%E5%B0%8F%E5%B0%8F%E9%B8%9F

Http://www.cnblogs.com/ittinybird/rss

Http://www.cnblogs.com/images/xml.gif

Http:// I .cnblogs.com/

Http://www.cnblogs.com/ittinybird/p/4853532.html

Http://www.w3school.com.cn/jsref/jsref_obj_regexp.asp

Http://www.cnblogs.com/ittinybird/

Http:// I .cnblogs.com/EditPosts.aspx? Postid = 4853532

Http://www.cnblogs.com/

Http://q.cnblogs.com/

Http://news.cnblogs.com/

Http://home.cnblogs.com/ing/

Http://job.cnblogs.com/

Http://kb.cnblogs.com/

Regex and Exception Handling

If our regular expression has an error, the standard library will throw a regex_error exception during runtime, which has a member named code to mark the error type, the specific error values and semantics are shown in the following table:

Code	Description
Error_collate	Invalid element Verification
Error_ctype	Invalid character class
Error_escape	Invalid transfer character or invalid tail escape
Error_backref	Invalid backward reference
Error_brack	Square brackets do not match
Error_paren	Parentheses do not match
Error_brace	The braces do not match.
Error_badbrace	The range in braces is invalid.
Error_range	Invalid invalid) character range
Error_space	Insufficient memory
Error_badrepeat	No regular expression * +?) before repeated characters ?)
Error_complexity	It's too complicated to hold the standard library.
Error_stack	Insufficient stack space

The basic content about exception handling is not discussed in this article.

Summary

Some of the regular expressions in the C ++ 11 standard library are not involved in this article. I personally think that after having mastered the above content, I will know how to use the interface, we will not waste space here.

Thank you for reading this article. Please correct the error. Thank you very much.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More