Open source Regular Expression Library and its usage

Source: Internet
Author: User
Tags expression engine

Some friends may use regular expressions every day, such as grep, Vim, sed, and awk, but they may not be familiar with this term. Regular Expressions are generally abbreviated to RegEx, Regexp, or even re. There are many articles about regular expressions. You can find good instructions by searching for them using search engines. However, how to use it in C/C ++ is lacking. Most C standard libraries contain RegEx, which can be viewed through/usr/include/RegEx. h or man RegEx. Perl, PHP, and other languages provide powerful regular expressions. The most famous C language Regular Expression Library is PCRE (Perl Compatible Regular Expression ). This article introduces RegEx and PCRE.

1. RegEx
The use of RegEx is very simple. You only need to take a look at sample code 1 to understand it. (The sample code is extracted from the article "getting started with gnu c rule expressions, ).

# Include <stdio. h> <br/> # include <string. h> <br/> # include <RegEx. h> </P> <p> # define subslen 10/* Number of matched substrings */<br/> # define ebuflen 128/* error message buffer length */<br /># define buflen 1024/* length of the matched string buffer */</P> <p> int main () <br/>{< br/> size_t Len; <br/> regex_t re;/* stores compiled regular expressions, regular Expressions must be compiled before use */<br/> regmatch_t subs [subslen];/* store the matched string position */<br/> char matched [buflen]; /* store matched strings */<br/> char errbuf [ebuflen];/* store error messages */<br/> int err, I; </P> <p> char SRC [] = "111 <title> Hello World </title> 222 "; /* Source string */<br/> char pattern [] = "<title> (. *) </title> ";/* pattern string */</P> <p> printf (" string: % s/n ", Src ); <br/> printf ("pattern:/" % S/"/N", pattern ); </P> <p>/* compile regular expression */<br/> err = regcomp (& re, pattern, reg_extended ); </P> <p> If (ERR) {<br/> Len = regerror (ERR, & re, errbuf, sizeof (errbuf )); <br/> printf ("error: regcomp: % s/n", errbuf); <br/> return 1; <br/>}< br/> printf ("Total has subexpression: % d/N", re. re_nsub); <br/>/* execution mode matching */<br/> err = regexec (& re, SRC, (size_t) subslen, subs, 0 ); </P> <p> If (ERR = reg_nomatch) {/* No matching successful */<br/> printf ("sorry, no match... /n "); <br/> regfree (& re); <br/> return 0; <br/>} else if (ERR) {/* Other errors */<br/> Len = regerror (ERR, & re, errbuf, sizeof (errbuf); <br/> printf ("error: regexec: % s/n ", errbuf); <br/> return 1; <br/>}</P> <p>/* if it is not reg_nomatch and there are no other errors, then the pattern matches */<br/> printf ("/nok, has matched... /n "); <br/> for (I = 0; I <= Re. re_nsub; I ++) {<br/> Len = Subs [I]. rm_eo-subs [I]. rm_so; <br/> if (I = 0) {<br/> printf ("begin: % d, Len = % d", subs [I]. rm_so, Len);/* Comment 1 */<br/>} else {<br/> printf ("subexpression % d begin: % d, Len = % d ", i, subs [I]. rm_so, Len); <br/>}< br/> memcpy (matched, SRC + subs [I]. rm_so, Len); <br/> matched [Len] = '/0'; <br/> printf ("Match: % s/n", matched ); <br/>}</P> <p> regfree (& re);/* release after use */<br/> return (0 ); <br/>}< br/>

 

The execution result is:

String: 111 <title> Hello World </title> 222 <br/> pattern: "<title> (. *) </title> "<br/> total has subexpression: 1 </P> <p> OK, has matched... </P> <p> begin: %, Len = 4 match: <title> Hello World </title> <br/> subexpression 1 begin: 11, Len = 11 match: hello World

 

From the example program, we can see that we first compile regcomp () and then call regexec () for actual matching. If you only want to check whether the matching is successful, you can understand the usage of the two functions. Sometimes we want to obtain the matched subexpression. For example, to obtain the title in the example, we need to enclose the subexpression with parentheses "()". "<title> (. *) </title> ", the expression engine records the strings matching the expressions contained in parentheses. When obtaining the matching result, the string that matches the expression in parentheses can be obtained separately. The example program is used to obtain the title of an HTTP webpage.

Regmatch_t subs [subslen] is used to store the matching position. Subs [0] stores the matching string position, and subs [1] stores the matching position of the first subexpression, that is, the title in the example can be obtained through rm_so and rm_eo in the structure. Many people do not pay much attention to this point.

Note 1: When debugging code, it is performed on FreeBSD 6.2, and Len is always 0, but the string printed is correct and confusing, it is completely normal to put it on Linux. After careful check, we found that rm_so is 32-bit on Linux and 64-bit on FreeBSD, if % d is used, the actual value is the 32-bit high of rm_so, instead of the actual Len. Change the print rm_so location to % LlU.

Although RegEx is simple and easy to use, its support for regular expressions is not strong enough, and there are also problems with Chinese processing. Therefore, the following PCRE is introduced.

2. PCRE (http://www.pcre.org)
The PCRE name indicates that it is Perl Compatible. It is no problem for people familiar with Perl and PHP. PCRE has rich usage instructions and sample code (you can see pcredemo. C to understand the basic usage). The following program only changes the above RegEx to PCRE.

/* Compile thuswise: <br/> * gcc-wall pcre1.c-I/usr/local/include-L/usr/local/lib-r/usr/local/lib-lpcre <br/> * <br/> */</P> <p> # include <stdio. h> <br/> # include <string. h> <br/> # include <PCRE. h> </P> <p> # define oveccount 30/* shocould be a multiple of 3 */<br/> # define ebuflen 128 <br/> # define buflen 1024 </P> <p> int main () <br/>{ <br/> PCRE * Re; <br/> const char * error; <br/> int erroffset; <br/> int ovector [oveccount]; <br/> int RC, I; </P> <p> char SRC [] = "111 <title> Hello World </title> 222 "; <br/> char pattern [] = "<title> (. *) </title> "; </P> <p> printf (" string: % s/n ", Src); <br/> printf (" pattern: /"% S/"/N ", pattern); </P> <p> Re = pcre_compile (pattern, 0, & error, & erroffset, null ); <br/> If (RE = NULL) {<br/> printf ("PCRE compilation failed at offset % d: % s/n", erroffset, error ); <br/> return 1; <br/>}</P> <p> rc = pcre_exec (Re, null, SRC, strlen (SRC), 0, 0, ovector, oveccount); <br/> If (RC <0) {<br/> If (rc = pcre_error_nomatch) printf ("sorry, no match... /n "); <br/> else printf (" Matching Error % d/N ", RC); <br/> free (re); <br/> return 1; <br/>}</P> <p> printf ("/nok, has matched... /n "); </P> <p> for (I = 0; I <RC; I ++) {<br/> char * substring_start = SRC + ovector [2 * I]; <br/> int substring_length = ovector [2 * I + 1]-ovector [2 * I]; <br/> printf ("% 2D: %. * s/n ", I, substring_length, substring_start); <br/>}</P> <p> free (re); <br/> return 0; <br/>}< br/>

Execution result

String: 111 <title> Hello World </title> 222 <br/> pattern: "<title> (. *) </title> "</P> <p> OK, has matched... </P> <p> 0: <title> Hello World </title> <br/> 1: Hello World <br/>

 

By comparing the two examples, we can see that regcomp () and regexec () are used in RegEx, while PCRE uses pcre_compile () and pcre_exec () in almost identical usage.

Pcre_compile () has many options. For more information, see http://www.pcre.org/pcre.txt. For multi-line text, you can set the pcre_dotall option pcre_complie (Re, pcre_dotall,...), indicating that '.' also matches the carriage return line "/R/N ".

3. PCRE ++
PCRE ++ (http://www.daemon.de/PCRE) on pcre c ++ encapsulation, more convenient to use.

/* <Br/> * g ++ pcre2.cpp-I/usr/local/include-L/usr/local/lib-r/usr/local/lib-lpcre ++- lpcre <br/> */<br/> # include <string> <br/> # include <iostream> <br/> # include <PCRE ++. h> </P> <p> using namespace STD; <br/> using namespace pcrepp; </P> <p> int main () <br/>{< br/> string SRC ("111 <title> Hello World </title> 222"); <br/> string pattern ("<title> (. *) </title> "); </P> <p> cout <" string: "<SRC <Endl; <br/> cout <" pattern: "<pattern <Endl; </P> <p> PCRE Reg (pattern, pcre_dotall); <br/> If (Reg. search (SRC) = true) {// <br/> cout <"/nok, has matched... /n "; <br/> for (INT Pos = 0; POS <Reg. matches (); POS ++) {<br/> cout <POS <":" <Reg [POS] <Endl; <br/>}< br/>}else {<br/> cout <"sorry, no match... /n "; <br/> return 1; <br/>}</P> <p> return 0; <br/>}< br/>

 

Execution result

String: 111 <title> Hello World </title> 222 <br/> pattern: <title> (. *) </title> </P> <p> OK, has matched... </P> <p> 0: Hello World <br/>

 

4. oniguruma
There is also a regular expression library oniguruma (http://www.geocities.jp/kosako3/oniguruma/), for East Asian text support is better, started on Ruby, can also be used for C ++, is written by Japanese developers. Most people will not use it, so they will not introduce it. If you have any questions, you can use email to discuss its usage.

5. deelx

Deelx (http://www.regexlab.com/zh/deelx/) is a Perl-Compatible Regular Expression Engine in a C ++ environment. It is a research and development project carried out by regexlab. All the code is in a. h file.

# Include "deelx. H "<br/> # include <stdio. h> </P> <p> int find_remark (const char * string, Int & START, Int & End) <br/> {<br/> // declare <br/> static cregexpt <char> Regexp ("///*((?! //*/).)*(//*/)? | // ([^ // X0a-// x0d //] | ////.) * "); </P> <p> // find and match <br/> matchresult result = Regexp. match (string); </P> <p> // result <br/> If (result. ismatched () <br/>{< br/> Start = result. getstart (); <br/> end = result. getend (); <br/> return 1; <br/>}< br/> else <br/>{< br/> return 0; <br/>}</P> <p> int main (INT argc, char * argv []) <br/>{< br/> char * code1 = "int A;/* a */"; <br/> char * code2 = "int ;"; </P> <p> int start, end; </P> <p> If (find_remark (code1, start, end) <br/> printf ("in code1, found: %. * s/n ", end-start, code1 + start); <br/> else <br/> printf (" in code1, not found. /n "); </P> <p> If (find_remark (code2, start, end) <br/> printf (" in code2, found: %. * s/n ", end-start, code2 + start); <br/> else <br/> printf (" in code2, not found. /n "); </P> <p> return 0; <br/>}

Execution result

In code1, found:/* a */<br/> In code2, not found.

 

 

 

6. Internal Implementation of Regular Expression
With regard to the implementation of regular expressions, we have used a lot of automata theory knowledge. If you are interested, you can refer to this reference. This book "Introduction to automata theory, ages, and computation "is well written, and the books on compilation principles also contain this aspect.

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.