Some friends may use regular expressions every day, such as grep, Vim, sed, and awk, but they may not be familiar with this term. Regular Expressions are generally abbreviated to RegEx, Regexp, or even re. There are many articles about regular expressions. You can find good instructions by searching for them using search engines. However, how to use it in C/C ++ is lacking. Most C standard libraries contain RegEx, which can be viewed through/usr/include/RegEx. h or man RegEx. Perl, PHP, and other languages provide powerful regular expressions. The most famous C language Regular Expression Library is PCRE (Perl Compatible Regular Expression ). This article introduces RegEx and PCRE.
1. RegEx
The use of RegEx is very simple. You only need to take a look at sample code 1 to understand it. (The sample code is extracted from the article "getting started with gnu c rule expressions, ).
Code:
#include <stdio.h>
#include <string.h>
#include <regex.h>
# Define subslen 10/* Number of matched substrings */
# Define ebuflen 128/* buffer length of the error message */
# Define buflen 1024/* length of the matched string buffer */
Int main ()
{
Size_t Len;
Regex_t re;/* stores compiled regular expressions. Regular Expressions must be compiled before use */
Regmatch_t subs [subslen];/* store the matched string position */
Char matched [buflen];/* store matched strings */
Char errbuf [ebuflen];/* store error messages */
Int err, I;
Char SRC [] = "111 <title> Hello World </title> 222";/* Source string */
Char pattern [] = "<title> (. *) </title>";/* pattern string */
printf("String : %s/n", src);
printf("Pattern: /"%s/"/n", pattern);
/* Compile a regular expression */
Err = regcomp (& re, pattern, reg_extended );
If (ERR ){
Len = regerror (ERR, & re, errbuf, sizeof (errbuf ));
Printf ("error: regcomp: % s/n", errbuf );
Return 1;
}
Printf ("Total has subexpression: % d/N", re. re_nsub );
/* Execution mode matching */
Err = regexec (& re, SRC, (size_t) subslen, subs, 0 );
If (ERR = reg_nomatch) {/* No matching successful */
Printf ("sorry, no match.../N ");
Regfree (& re );
Return 0;
} Else if (ERR) {/* Other errors */
Len = regerror (ERR, & re, errbuf, sizeof (errbuf ));
Printf ("error: regexec: % s/n", errbuf );
Return 1;
}
/* If it is not reg_nomatch and there are no other errors, the mode will match */
Printf ("/nok, has matched.../n ");
For (I = 0; I <= Re. re_nsub; I ++ ){
Len = Subs [I]. rm_eo-subs [I]. rm_so;
If (I = 0 ){
Printf ("begin: % d, Len = % d", subs [I]. rm_so, Len);/* Comment 1 */
} Else {
Printf ("subexpression % d begin: % d, Len = % d", I, subs [I]. rm_so, Len );
}
Memcpy (matched, SRC + subs [I]. rm_so, Len );
Matched [Len] = '/0 ';
Printf ("Match: % s/n", matched );
}
Regfree (& re);/* release after use */
Return (0 );
}
The execution result is
Code:
String : 111 <title>Hello World</title> 222
Pattern: "<title>(.*)</title>"
Total has subexpression: 1
OK, has matched ...
begin: %, len = 4 match: <title>Hello World</title>
subexpression 1 begin: 11, len = 11 match: Hello World
From the example program, we can see that we first compile regcomp () and then call regexec () for actual matching. If you only want to check whether the matching is successful, you can understand the usage of the two functions. Sometimes we want to obtain the matched subexpression. For example, to obtain the title in the example, we need to enclose the subexpression with parentheses "()". "<title> (. *) </title> ", the expression engine records the strings matching the expressions contained in parentheses. When obtaining the matching result, the string that matches the expression in parentheses can be obtained separately. The example program is used to obtain the title of an HTTP webpage.
Regmatch_t subs [subslen] is used to store the matching position. Subs [0] stores the matching string position, and subs [1] stores the matching position of the first subexpression, that is, the title in the example can be obtained through rm_so and rm_eo in the structure. Many people do not pay much attention to this point.
Note 1: When debugging code, it is performed on FreeBSD 6.2, and Len is always 0, but the string printed is correct and confusing, it is completely normal to put it on Linux. After careful check, we found that rm_so is 32-bit on Linux and 64-bit on FreeBSD, if % d is used, the actual value is the 32-bit high of rm_so, instead of the actual Len. Change the print rm_so location to % LlU.
Although RegEx is simple and easy to use, its support for regular expressions is not strong enough, and there are also problems with Chinese processing. Therefore, the following PCRE is introduced.
2. PCRE (http://www.pcre.org)
The PCRE name indicates that it is Perl Compatible. It is no problem for people familiar with Perl and PHP. PCRE has rich usage instructions and sample code (you can see pcredemo. C to understand the basic usage). The following program only changes the above RegEx to PCRE.
Code:
/* Compile thuswise:
* gcc -Wall pcre1.c -I/usr/local/include -L/usr/local/lib -R/usr/local/lib -lpcre
*
*/
#include <stdio.h>
#include <string.h>
#include <pcre.h>
#define OVECCOUNT 30 /* should be a multiple of 3 */
#define EBUFLEN 128
#define BUFLEN 1024
int main()
{
pcre *re;
const char *error;
int erroffset;
int ovector[OVECCOUNT];
int rc, i;
char src [] = "111 <title>Hello World</title> 222";
char pattern [] = "<title>(.*)</title>";
printf("String : %s/n", src);
printf("Pattern: /"%s/"/n", pattern);
re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
if (re == NULL) {
printf("PCRE compilation failed at offset %d: %s/n", erroffset, error);
return 1;
}
rc = pcre_exec(re, NULL, src, strlen(src), 0, 0, ovector, OVECCOUNT);
if (rc < 0) {
if (rc == PCRE_ERROR_NOMATCH) printf("Sorry, no match .../n");
else printf("Matching error %d/n", rc);
free(re);
return 1;
}
printf("/nOK, has matched .../n/n");
for (i = 0; i < rc; i++) {
char *substring_start = src + ovector[2*i];
int substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s/n", i, substring_length, substring_start);
}
free(re);
return 0;
}
The execution result is:
Code:
String : 111 <title>Hello World</title> 222
Pattern: "<title>(.*)</title>"
OK, has matched ...
0: <title>Hello World</title>
1: Hello World
By comparing the two examples, we can see that regcomp () and regexec () are used in RegEx, while PCRE uses pcre_compile () and pcre_exec () in almost identical usage.
Pcre_compile () has many options. For more information, see http://www.pcre.org/pcre.txt. For multi-line text, you can set the pcre_dotall option pcre_complie (Re, pcre_dotall,...), indicating that '.' also matches the carriage return line "/R/N ".
3. PCRE ++
PCRE ++ (http://www.daemon.de/PCRE) on pcre c ++ encapsulation, more convenient to use.
Code:
/*
* g++ pcre2.cpp -I/usr/local/include -L/usr/local/lib -R/usr/local/lib -lpcre++ -lpcre
*/
#include <string>
#include <iostream>
#include <pcre++.h>
using namespace std;
using namespace pcrepp;
int main()
{
string src("111 <title>Hello World</title> 222");
string pattern("<title>(.*)</title>");
cout << "String : " << src << endl;
cout << "Pattern : " << pattern << endl;
Pcre reg(pattern, PCRE_DOTALL);
if (reg.search(src) == true) { //
cout << "/nOK, has matched .../n/n";
for(int pos = 0; pos < reg.matches(); pos++) {
cout << pos << ": " << reg[pos] << endl;
}
} else {
cout << "Sorry, no match .../n";
return 1;
}
return 0;
}
The execution result is:
Code:
String : 111 <title>Hello World</title> 222
Pattern : <title>(.*)</title>
OK, has matched ...
0: Hello World
4. oniguruma
There is also a regular expression library oniguruma (http://www.geocities.jp/kosako3/oniguruma/), for East Asian text support is better, started on Ruby, can also be used for C ++, is written by Japanese developers. Most people will not use it, so they will not introduce it. If you have any questions, you can use email to discuss its usage.
5. Internal Implementation of Regular Expression
With regard to the implementation of regular expressions, we have used a lot of automata theory knowledge. If you are interested, you can refer to this reference. This book "Introduction to automata theory, ages, and computation "is well written, and the books on compilation principles also contain this aspect.