Example and source code analysis of Regular Expressions in C ++ 11

Source: Internet
Author: User
Tags traits egrep

Example and source code analysis of Regular Expressions in C ++ 11

Regular expression Regex (regular expression) is a powerful tool for describing character sequences. Regular Expressions exist in many languages. C ++ 11 also incorporates regular expressions into the new standard. In addition, it also supports the syntax of six different regular expressions, ECMASCRIPT, basic, extended, awk, grep, and egrep. ECMASCRIPT is the default syntax. You can specify the syntax when constructing a regular expression.

Note: ECMAScript is a script design language standardized by ECMA-262 by European Computer Manufacturers Association, formerly known as Ecma International as European Computer Manufacturers Association. It is often called JavaScript, but the latter is actually the implementation and extension of the ECMA-262 standard.

Next we will take this blog page http://www.cnblogs.com/ittinybird/p/4853532.html) source code as an example, from scratch demonstrate how to use regular expressions in C ++ to extract all the available http links in a Web page source code. If I have time, I want to use the new features of C ++ 11 recently. I will rewrite the previous C ++ crawler and share it with me.

Make sure your compiler supports Regex

If your compiler is GCC-4.9.0 or earlier than VS2013, upgrade it before using it.The C ++ compiler I used earlier is GCC 4.8.3 and has a regex header file. However, GCC is not very well implemented and the syntax is fully supported, but the library hasn't kept up with it yet, so there is no problem during compilation, but an exception will be thrown directly when it is run, a perfect pitfall! The specific error is as follows:

Terminate called after throwing an instance of 'std: regex_error'
What (): regex_error
Aborted (core dumped)

If you have encountered this problem, please do not doubt yourself first. GCC is very bad !!! I spent half a day searching for this. Therefore, before using C ++ regular expressions, upgrade your compiler to ensure that your compiler supports it.

Regex database Overview

The header file <regex> contains multiple components that we need to use when using a regular expression. The components include:

Basic_regex

A regular expression object is a common template with typedef basic_regex <char> regex and typedef basic_regex <char_t> wregex;

Regex_match

Match a character sequence with a regular expression

Regex_search

Find the matching results of the substring in the character sequence with the regular expression,After the first matching result is found, the search will stop.

Regex_replace

Use formatted Replace text to replace the regular expression to match the Character Sequence

Regex_iterator

Iterator used to match all substrings

Match_results

Container class to save the matching results of regular expressions.

Sub_match

Container class, which stores the Character Sequence matched by the sub-regular expression.

ECMASCRIPT regular expression syntax

The syntax of the regular expression is basically the same, so it is a waste of time. For the syntax knowledge of ECMASCRIPT regular expressions, see W3CSCHOOL.

Construct a regular expression

To construct a regular expression, use the class basic_regex. Basic_regex is a general class template for regular expressions. It has special features for both char and wchar_t types:

Typedef basic_regex <char> regex;
Typedef basic_regex <wchar_t> wregex;

There are many constructors, but they are very simple:

// Default constructor, which will match any Character Sequence
Basic_regex ();
// Use the string s ending with '\ 0' to construct a regular expression
Explicit basic_regex (const CharT * s, flag_type f = std: regex_constants: ECMAScript );
// Same as above, but the length of the string s used for construction is specified as count
Basic_regex (const CharT * s, std: size_t count, flag_type f = std: regex_constants: ECMAScript );
// Copy the structure. Do not repeat it
Basic_regex (const basic_regex & other );
// Mobile Constructor
Basic_regex (basic_regex & other );
// Construct a regular expression with 'str' of the 'basic_string type' type
Template <class ST, class SA>
Explicit basic_regex (const std: basic_string <CharT, ST, SA> & str, flag_type f = std: regex_constants: ECMAScript );
// String within the specified range [first, last) to construct a regular expression
Template <class ForwardIt>
Basic_regex (ForwardIt first, ForwardIt last, flag_type f = std: regex_constants: ECMAScript );
// Use initializer_list to construct
Basic_regex (std: initializer_list <CharT> init, flag_type f = std: regex_constants: ECMAScript );

All the constructors except the default constructor have a flag_type parameter used to specify the regular expression syntax. ECMASCRIPT, basic, extended, awk, grep, and egrep are optional values. There are several other possible signs used to change the rules and behavior of Regular Expression matching:

Flag_type

Effects

Icase

Case Insensitive during Matching

Nosubs

Do not save matched child expressions

Optimize

Execution speed is faster than construction speed

With the constructor, we can create a regular expression for extracting http links first:

Std: string pattern ("http (s )? : // ([\ W-] + \.) + [\ w-] + (/[\ w -./? % & =] *)? "); // The matching rule is very simple. If you have any questions, you can view it against the syntax.
Std: regex r (pattern );

It is worth mentioning that the '\' character in C ++ needs to be escaped. Therefore, '\' in all ECMASCRIPT Regular Expression syntaxes must be written "\".. During the test, if this regex is not escaped, a warning will be given in gcc. After vs2013 is compiled, the running will crash.

Correct Input Processing

Let's start with an additional question. If we are not using a Web page that is automatically downloaded from the program using a network library, after we manually download the web page and save it to a file, first, we need to save the webpage content (html source code) to a std: string. We may use this error method:

Int main ()
{
Std: string tmp, html;
While (std: cin> tmp)
Html + = tmp;
}

In this way, all the blank characters in the source code are processed by us. This is obviously not suitable. Here we still use the getline () function for processing:

Int main ()
{
Std: string tmp, html;
While (getline (std: cin, tmp ))
{
Html + = tmp;
Html + = '\ n ';
}
}

In this way, the original text can be correctly entered. Of courseI personally think these small details are worth noting. When debugging fails, I think we are more skeptical about whether our regular expression is valid..

Regex_search () only finds the first matched subsequence

According to the literal meaning of the function, we may incorrectly select the regex_search () function for matching. Its function prototype also has six overloaded versions, which are similar in usage,The Return Value of the function is bool., True is returned for success, and false is returned for failure. In view of the length, we only look at the following:

Template <class STraits, class SAlloc, class Alloc, class CharT, class Traits>
Bool regex_search (const std: basic_string <CharT, STraits, SAlloc> & s,
Std: match_results <typename std: basic_string <CharT, STraits, SAlloc >:: const_iterator, Alloc> & m,
Const std: basic_regex <CharT, Traits> & e,
Std: regex_constants: match_flag_type flags = std: regex_constants: match_default );

The first parameter s is of the std: basic_string type. It is the character sequence to be matched. The Parameter m is a match_results container used to store the matching results, parameter e is used to store the previously constructed Regular Expression objects. The flags parameter is worth mentioning. Its type is std: regex_constants: match_flag_type, which indicates the semantic match mark. Just as when constructing a regular expression object, we can specify the option to process the regular expression, we can still specify another flag to control the matching rules during the matching process. The specific meanings of these marks are referenced from cppreference.com. You can check them when using them:

Constant

Explanation

match_not_bol

The first character in[First, last)Will be treated as if it isNotAt the beginning of a line (I. e.^Will not match[First, first)

match_not_eol

The last character in[First, last)Will be treated as if it isNotAt the end of a line (I. e.$Will not match[Last, last)

match_not_bow

"\ B"Will not match[First, first)

match_not_eow

"\ B"Will not match[Last, last)

match_any

If more than one match is possible, then any match is an acceptable result

match_not_null

Do not match empty sequences

match_continuous

Only match a sub-sequence that beginsFirst

match_prev_avail

-- FirstIs a valid iterator position. When set, causesMatch_not_bolAndMatch_not_bowTo be ignored

format_default

Use ECMAScript rules to construct strings in std: regex_replace (syntax documentation)

format_sed

Use POSIXSedUtility rules in std: regex_replace. (syntax documentation)

format_no_copy

Do not copy un-matched strings to the output in std: regex_replace

According to the parameter type, we constructed such a call:

Std: smatch results; <br> regex_search (html, results, r );

However,The standard library requires regex_search () to stop searching after the first matched substring is found.! In this program, the results parameter only brings back the first http link that meets the conditions. This obviously does not satisfy our need to extract all HTTP links on the webpage.

Use regex_iterator to match all substrings

Strictly speaking, regex_iterator is an iterator adapter used to bind the character sequence to be matched with the regex object.The default constructor of regex_iterator is special, and a post-iteration is constructed directly.. Another constructor prototype:

Regex_iterator (BidirIt a, BidirIt B, // The first iterator and the last iterator of the string to be matched respectively.
Const regex_type & re, // regex object
Std: regex_constants: match_flag_type m = std: regex_constants: match_default); // flag, as shown in regex_search () above

Like regex_search () above, regex_iterator constructor also has parameters of the std: regex_constants: match_flag_type type. The usage is the same. In fact, the internal implementation of regex_iterator is to call regex_search (), which is used to pass to regex_search. You can use gif to compare the image. Specifically, the color deepening part of the image is used to represent a child sequence that can be matched ):

When constructing regex_iterator, the constructor first calls regex_search () to point it to the first matching subsequence. In the next iteration process + + it), you will continue to call regex_search () in the remaining subsequences until the iterator reaches the end. It always points to the matched subsequence.

Once we understand the principle, it is much easier to write code. In combination with the previous sections, this program is basically written:

# Include <iostream>
# Include <regex>
# Include <string>

Int main ()
{
Std: string tmp, html;
While (getline (std: cin, tmp ))
{
Tmp + = '\ n ';
Html + = tmp;
}
Std: string pattern ("http (s )? : // ([\ W-] + \.) + [\ w-] + (/[\ w -./? % & =] *)? ");
Pattern = "[[: alpha:] *" + pattern + "[[: alpha:] *";
Std: regex r (pattern );
For (std: sregex_iterator it (html. begin (), html. end (), r), end; // end is the post-end iterator, and regex_iterator is the string-type version of regex_iterator.
It! = End;
++ It)
{
Std: cout <it-> str () <std: endl;
}
}

The HTML source code of the downloaded page is saved as test.html. Compile the source code to test the code:

[Regex] g ++ regex. cpp-std = c ++ 11-omain
[Regex] main <test.html

Http://www.cnblogs.com/ittinybird/rss


Http://www.cnblogs.com/ittinybird/rsd.xml


Http://www.cnblogs.com/ittinybird/wlwmanifest.xml


Http://common.cnblogs.com/script/jquery.js


Http://files.cnblogs.com/files/ittinybird/mystyle.css


Http://www.cnblogs.com/ittinybird/


Http://www.cnblogs.com/ittinybird/


Http://www.cnblogs.com/ittinybird/


Http:// I .cnblogs.com/EditPosts.aspx? Opt = 1


Http://msg.cnblogs.com/send/%E6%88%91%E6%98%AF%E4%B8%80%E5%8F%AAC%2B%2B%E5%B0%8F%E5%B0%8F%E9%B8%9F


Http://www.cnblogs.com/ittinybird/rss


Http://www.cnblogs.com/ittinybird/rss


Http://www.cnblogs.com/images/xml.gif


Http:// I .cnblogs.com/


Http://www.cnblogs.com/ittinybird/p/4853532.html


Http://www.cnblogs.com/ittinybird/p/4853532.html


Http://www.w3school.com.cn/jsref/jsref_obj_regexp.asp


Http://www.cnblogs.com/ittinybird/


Http:// I .cnblogs.com/EditPosts.aspx? Postid = 4853532


Http://www.cnblogs.com/


Http://q.cnblogs.com/


Http://news.cnblogs.com/


Http://home.cnblogs.com/ing/


Http://job.cnblogs.com/


Http://kb.cnblogs.com/

Regex and Exception Handling

If our regular expression has an error, the standard library will throw a regex_error exception during runtime, which has a member named code to mark the error type, the specific error values and semantics are shown in the following table:

Code

Description

Error_collate

Invalid element Verification

Error_ctype

Invalid character class

Error_escape

Invalid transfer character or invalid tail escape

Error_backref

Invalid backward reference

Error_brack

Square brackets do not match

Error_paren

Parentheses do not match

Error_brace

The braces do not match.

Error_badbrace

The range in braces is invalid.

Error_range

Invalid invalid) character range

Error_space

Insufficient memory

Error_badrepeat

No regular expression * +?) before repeated characters ?)

Error_complexity

It's too complicated to hold the standard library.

Error_stack

Insufficient stack space

The basic content about exception handling is not discussed in this article.

Summary

Some of the regular expressions in the C ++ 11 standard library are not involved in this article. I personally think that after having mastered the above content, I will know how to use the interface, we will not waste space here.

Thank you for reading this article. Please correct the error. Thank you very much.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.