Title: simple manual for use of regular expressions in MacOSX
Author glider at 14:49:18, 2003.8.25
A Concise Manual that should be written by the author. All programs are compiled and approved. If there is any negligence, I still forget to correct it.
<H1 align = 'center'> use regular expressions in C <H2> 1. What is a regular expression </H2>
Regular Expression is a very important and effective string search mode in UNIX systems. He can search for strings in the text according to people's specified rules, with high efficiency and performance. Many UNIX tools (SED, grep, find, etc.) and scripting languages (awk, Perl, etc) we can all find it.
Speaking of the simplest example, If you search for all files starting with 'G' in UNIX, run the following command: 'ls G *', 'g * 'is an example of the simplest regular expression. It represents all files with a letter g headers followed by any long string.
When we write applications related to string search in UNIX, getting familiar with and mastering Regular Expressions will get twice the result with half the effort.
<H2> II. Regular Expression definition </H2>
Regular Expressions are a set of rule characters that can form the Search rules we need. These characters include:
Character meaning example
* Any length string (including none) A * represents: Null String, A, AA, AAA?
? A string of 0 or 1? Representative: A and an empty string
+ One or more strings, A +, A, AA, AAA, and AAAA?
. Any character AB. represents: AB followed by any character
{} Indicates the number of duplicates in the previous rule. A {3} indicates three A members, that is, 'aaa; A {1, 3} indicates that one to three A members are allowed, that is,, AA, AAAA {3,} represents: can be greater than or equal to 3
[] Set, which represents any character in square brackets. '-' can be used as the range symbol in the set, for example, [A-Z] indicates the characters from 'A' to 'Z. [ABC] indicates A, B, or C.
() Group, representing a group of strings (ABC) {2} stands for: abcabc string. If there is no group of ABC {2}, it represents ABCC.
A/B indicates that the string meets Rule A. The prerequisite is that the string after rule a must satisfy rule B ABC/DEF. To ensure that string ABC is followed by def, the string meets our rules.
A | B is used in parallel to indicate that all strings that comply with the or B rules are correct. AB | cd indicates that 'AB' or 'cd' are all strings we need.
^ If this symbol is placed at the beginning of the rule, it indicates that the rule must start with a string (in the middle of the rule, it indicates the '^' character itself ); if it is placed at the beginning of [], it indicates the inverse of the Set (if it is not at the beginning, it indicates the '^' character itself) ^ (ABC) it indicates that 'abc' must start with a string to conform to the rules. [^ ABC] indicates that character sets other than 'A', 'B', and 'C' are excluded.
$ If this symbol is placed at the end of the rule, it indicates that the rule must be at the end of the string (if it is at another position of the rule, it indicates the '$' character itself) ABC $ indicates that 'abc' must be placed at the end of the string to conform to the rule.
Other regular expressions define many internal rules for developers to use. These rules include [: alnum:] [: cntrl:] [: lower:] [: Space:]. [: Alpha:] [: digit:] [: Print:] [: Upper:] [: blank "] [: Graph:] [: punct:] [: xdigit:] for details, refer to man RegEx.
<H2> 3. How to use regular expressions in C </H2>
The following describes how to use regular expressions using the RegEx functions in the POSIX function library.
.
Int regcomp (regex_t * Reg, char * pattern, int cflag );
Compile a function using a regular expression. A regular expression rule must be compiled into a specific data structure before it can be used in subsequent functions.
Parameter 1: return the compiled regular expression data structure;
Parameter 2: Regular Expression string.
Parameter 3: Compile Switch
The compilation switch can control the features generated by rules. For example, reg_extend indicates that we use the extended regular expression mode (the system will compile according to the basic mode by default). reg_icase indicates that the strings in the rules are not case sensitive, reg_nosub indicates that only check whether the string has a sub-string that complies with the rules, but you do not need to know its position.
Int regexec (const regex_t * Reg, const char * string, size_t nmatch, regmatch_t pmatch [], int eflag );
This function is used to match the regular expression (REG) in a specific string (string parameter). The matching results are stored in the pmatch data structure. If reg_nosub is used in the regcomp function, the nmatch and pmatch parameters can be omitted (nmatch = 0). regmatch_t is a structure with only two fields. Its definition is as follows:
Typedef struct {
Regoff_t rm_so;
Regoff_t rm_eo;
} Regmatch_t
Rm_so: the starting offset of the substring that meets the rule in the string;
Rm_eo: the offset of the substring that meets the rule;
For example, we match the string 'this a test string for RegEx functions. 'for' substring, rm_so points to the 'F' character of 'for ', while rm_eo points to the space after 'for.
To obtain the position of a string, we need to bring in an array of regmatch_t. pmatch [0] points to the position of the string that meets the regular expression, the subsequent pmatch points to the position of the group (')' in the regular expression.
For example, the regular expression '([ABC] +) ([de])' represents one or more strings consisting of the characters 'A', 'B', or 'C, the subsequent characters are 'D' or 'e '.
Evaluate the test string "dddaabcde". Only the sub-string aabcd meets the conditions.
There are two groups in the rule, and pmatch [3] is required to store the results.
Pmatch [0] stores the location of 'aabcd', pmatch [1] stores the substring 'aabc' (meets the rules of the first group [ABC] +, pmatch [2] location where the substring 'D' is stored (meeting the rules of the second [de] group );
Eflag can control the search feature of regexec. When a text is very large, we may want to search one row at a time, in this case, we can use eflags to indicate whether the current row is the first line (reg_notbol) or the last line (reg_noteol). This will affect the rules with the characters '^' and '$.
Int regerror (INT errcode, const regex_t * Reg, char * errbuf, size_t errbuf_size );
This function can obtain the corresponding error string from the return code of the above two functions, which can be printed on the screen, making it easier to understand.
Void regfrree (regex_t * REG );
After using the compiled regular expression, we need to release the corresponding data structure.
<H2> 4. Example </H2>
The following two examples illustrate how to use these functions to match strings:
Example 1: Check whether the entered URL string meets the requirements. The request must start with WWW (or not) and start. com, .com.cn ,. the end of Org is a combination of more than one character, number, underline, and dash.
As needed, we can get the Patten: "^ (www .)? ([A-zA-Z0-9 _-] +) (.com.cn | com |. org) $ ", we do not care about the URL case, so the parameters need to be included in reg_icase, and we do not need to know their location, so we need to include the parameter reg_nosub;
# Include "stdio. H"
# Include "sys/types. H"
# Include "RegEx. H"
// Check whether the URL meets our definition requirements. If the return value is 0, the URL does not meet the requirements. If the return value is 1, the URL does not meet the requirements. If the return value is-1, the URL is incorrect.
Int match (const char * URL ){
Char * pattern = "^ (www .)? ([A-zA-Z0-9 _-] +) (.com.cn | com |. org) $ ";
Int RTN;
Regex_t reg;
RTN = regcomp (, pattern, reg_nosub | reg_extended | reg_icase );
If (RTN ){
Fprintf (stderr, "compile regular expression failed! \ N ");
Return-1;
}
RTN = regexec (, URL, 0, null, 0 );
If (RTN = reg_nomatch)
RTN = 0;
Else if (RTN = 0)
RTN = 1;
Else
RTN =-1;
Regfree ();
Return RTN;
}
Int main (INT argc, char * argv []) {
Int RTN;
If (argc! = 2 ){
Fprintf (stderr, "Usage: chkurl <URL string> \ n ");
Return 1;
}
RTN = match (argv [1]);
If (RTN = 1)
Fprintf (stderr, "url matched. \ n ");
Else if (RTN = 0)
Fprintf (stderr, "url not matched. \ n ");
Else
Fprintf (stderr, "execute regual expression failed! \ N ");
Return! RTN;
}
Run the chkurl program to check whether the introduced parameters start with www. and end with. com. Save the program as chkurl. c
> Make chkurl
> Lf
> Chkurl * chkurl. c
>
> Chkurl "www.easycon.com.cn"
> URL matched.
> Chkurl "www.gnete.com"
> URL matched.
> Chkurl "www.gnu.org"
> URL matched.
> Chkurl "sina.com.cn"
> URL matched.
Note: The reg_extended parameter must be included when compiling regcomp (); otherwise, the reg_extended parameter cannot be passed.
Example 2: print all URL strings in a file.
We still use the above rules (except the '^' and '$' Rules ). To provide a more comprehensive description of the RegEx series functions, we will use a single row to search for this text.
# Include <stdio. h>
# Include <sys/types. h>
# Include <RegEx. h>
Int chk_line (INT lineno, regex_t * Reg, char * Line ){
Int RTN, I, Len;
Regmatch_t pmatch;
Char * URL, * pbuf;
Fprintf (stderr, "% 4d:", lineno );
RTN = regexec (Reg, line, 1, & pmatch, 0 );
Pbuf = line;
While (RTN = 0 ){
Len = pmatch. rm_eo-pmatch. rm_so;
Url = (char *) malloc (LEN + 1) * sizeof (char ));
Memset (URL, 0, (LEN + 1) * sizeof (char ));
Memcpy (URL, & pbuf [pmatch. rm_so], Len );
Fprintf (stderr, "% s", URL );
Free (URL );
Pbuf + = pmatch. rm_eo;
RTN = regexec (Reg, pbuf, 1, & pmatch, reg_notbol );
}
Fprintf (stderr, "\ n ");
Return 0;
}
Int chk_file (const char * filename ){
File * FP;
Char * pattern = "(www .)? ([A-zA-Z0-9 _-] +) (.com.cn | com |. org )";
Char Buf [1024], line [1024];
Int RTN, lineno, flag;
Regex_t reg;
Fp = fopen (filename, "R ");
If (FP = NULL ){
Fprintf (stderr, "Open File '% s' failed! \ N ", filename );
Return-1;
}
RTN = regcomp (, Patten, reg_icase | reg_extended );
If (RTN ){
Fprintf (stderr, "compile failed. \ n ");
Fclose (FP );
Return-1;
}
Lineno = 1;
Memset (line, 0, sizeof (line ));
While (fgets (line, sizeof (line), FP )! = NULL)
Chk_line (lineno ++, line );
Fclose (FP );
Regfree ();
Return 0;
}
Int main (INT argc, char * argv []) {
Int RTN;
If (argc! = 2 ){
Fprintf (stderr, "Usage: chkfileurl <File> \ n ");
Return 1;
}
RTN = chk_file (argv [1]);
Return RTN;
}
Save the file as chkfilerul. C.
> Make chkfileurl
> Lf
> Chkfileurl * chkfileurl. c
> Chkfileurl url.txt
> 1: www.sinomac.com
> 2: www.w.zgc.com www.pconline.com.cn
> 3:
> 4: www.gnu.org zhnx.com.cn