Design and Implementation of the configuration file "text filtering system based on keyword matching" (C/C ++ source code)

Source: Internet
Author: User

The original link: http://blog.csdn.net/liigo/archive/2009/10/29/4744700.aspx

Author: liigo, 2009/10/29

Reprinted please indicate the source: http://blog.csdn.net/liigo

 

Suppose there is a text filtering system based on keyword matching, or a similar system, a configuration file is required to set the list of keywords to be filtered. How to design such a configuration file? How can we implement code? This article provides a feasible solution. This is a new article by myself (liigo) repeatedly inventing the wheel series.

Because it is a small application system, my requirements for configuration files are: simple, intuitive, easy to implement, while maintaining sufficient flexibility and scalability.

My design results for the configuration file are as follows:

Files are plain text and are processed in the unit of action;
The line and line separators can be any combination of/R/N, that is, multiple line breaks such as win, Linux, UNIX, and Mac are supported;
The first line is prefixed with "text:" or "literal:". It can be followed by a keyword constant or multiple keyword constants separated by commas;
The first line is prefixed with "RegEx:", followed by a regular expression text;
":" Or "," can be followed by any number of blank characters, which can be skipped automatically during parsing;
If there is no ambiguity, You can omit "text:" or "literal:" at the beginning of the line :".

Based on the above design results, the following is a valid configuration file content example:

Hello <br/> love, home, Java, C ++ <br/> literal: A, B, C, <br/> RegEx: [A-Z0-9. _ %-] + @ [A-Z0-9. -] + /. [A-Z] {2, 4} <br/> by liigo

The configuration file contains simple content, no special syntax, And the rules are also intuitive.

Next, let's talk about how code parses this file.

First, read the file into the memory, and add '/0' at the end to ensure that it is a legal C language text. Then, the text is traversed from start to end. In case of carriage return or line feed, the system stops temporarily, rewrite this character to '/0' to get a line of text (the variable line always points to the beginning of each line); Continue to traverse backward and skip consecutive carriage returns or line breaks, enter the next line. After this loop goes on, the content of each line is obtained in turn, which can be separately handed over to the following function for special processing; traversal to the end, whether or not a carriage return or line feed is encountered, to obtain and process the last line of text (if not empty.

Next, write another function to process each row in the file: to judge the first line of a row, if it is "RegEx:", the subsequent text is regarded as a regular expression and recorded; if the first row is "text:" or "literal:" or there is no such prefix, it is treated as a text constant. Text constants can be separated by commas (,) and non-comma (,). The latter can be regarded as a special case of the former and can be processed in a unified manner. The method is similar to the preceding one, the comma (,) is found in the calendar by character. If the comma (,) character is rewritten to '/0', a keyword text constant is obtained, which is recorded. This loop goes on.

After the parsing is completed, two arrays are obtained, one is the constant text pointer array of the keyword, and the other is the regular expression Object Pointer array used to match the keyword. For keyword filtering in the future, you only need to traverse the two arrays and check the matching conditions of the keywords one by one. To further improve the execution efficiency, you can put the keyword constant text into a hash table and other quick query containers before filtering.

This text parsing method avoids text separation and avoids copying sub-texts and memory application and release.

The complete text Parsing Code (C/C ++) is provided below ):

/* <Br/> the file is plain text and is processed in the unit of action (the delimiter between rows can be any combination of/R/N ). <br/> the beginning of the line is prefixed with "text:" or "literal:", followed by a name or multiple names separated by commas; <br/> the row is prefixed with "RegEx:" and can be followed by a regular expression text. <br/> ":" or ", "There can be any number of blank characters after the parsing, And the parsing will be skipped automatically; <br/> if there is no ambiguity, You can omit the" text: "or" literal: "at the beginning of the line :"; <br/> by liigo, 2009/10/29 <br/> */<br/> static bool parsesymfile (bufferedmem & filedata, bufferedmem & names, bufferedmem & regexs) <br/> {<br/> filedata. appendcha R ('/0'); <br/> char * P = (char *) filedata. getdata (); <br/> char * line = P; <br/> char C; </P> <p> while (C = * P )! = '/0') <br/>{< br/> If (C ='/R' | C = '/N ') <br/>{< br/> * P = '/0'; <br/> parselineofsymfile (line, names, regexs); <br/> P ++; <br/> while (* P = '/R' | * P ='/N') P ++; <br/> line = P; <br/>}< br/> else <br/> P ++; <br/>}</P> <p> If (P> line) <br/> parselineofsymfile (line, names, regexs); </P> <p> return true; <br/>}

Static bool parselineofsymfile (char * line, bufferedmem & names, bufferedmem & regexs) <br/>{< br/> // printf ("line: % S/R/N ", line); <br/> char * P = line; </P> <p> If (strstr (line, "RegEx:") = line) <br/> {<br/> line + = 6/* strlen ("RegEx:") */; <br/> while (isspace (* Line )) line ++; <br/> cregexpt <char> * pregex = new cregexpt <char> (line, 0); <br/> regexs. appendpointer (pregex); <br/> return true; <br/>}</P> <p> If (strstr (line, "text:") = line) <br/> line + = 5/* strlen ("text:") */; <br/> else if (strstr (line, "literal:") = line) <br/> line + = 8/* strlen ("literal:") */; </P> <p> while (isspace (* Line) line ++; </P> <p> char * name = line; <br/> P = line; <br/> while (* P) <br/>{< br/> If (* P = ',') <br/>{< br/> * P = '/0 '; <br/> names. appendpointer (name); <br/> P ++; <br/> If (* P = ',') {P ++; name = P; continue ;} // Process ', 'immediately follows another', '<br/> while (isspace (* p) P ++; <br/> name = P; <br/>}< br/> P ++; <br/>}< br/> If (P> name) <br/> names. appendpointer (name); </P> <p> return true; <br/>}

The above code has just been completed today. After preliminary tests are available, but no complete and strict unit tests have been conducted. There may be bugs or defects. Please make sure that you can correct them in time. The bufferedmem class in the Code does not involve the core text parsing algorithm, so no additional code is provided. In addition, because this is an internal application specific to an application system, it cannot be universally used (for example, I assume that regular expressions do not start with blank characters, and keywords cannot be empty texts ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.