My own research experience on using regular expressions in Delphi

Source: Internet
Author: User

If you use regular expressions in Delphi, PerlRegEx should be the first choice. Here are some of your experiences.

Http://www.regular-expressions.info/delphi.html
Download: http://www.regular-expressions.info/download/TPerlRegEx.zip

Take the Delphi7 Chinese version as an example to describe the installation method:

1, extract TPerlRegEx and then open Delphi, this article from [enlightening forum] http://7di.Net, reprint please indicate the source, and then create an application.

2. Click "project" => "add to project" => Find the extracted file and double-click "PerlRegEx. this file is introduced into the library unit, so that we can use it directly.

3. First reference PerlRegEx in uses

4. Click "F12" to open the form editing mode. In the event panel of the object viewer, find "onCreate" and double-click it.

5. Compile the Event code of the Form as follows:
Procedure TForm1.FormCreate (Sender: TObject );
Var
Reg: TPerlRegEx; // declare a regular expression variable
Begin
Reg: = TPerlRegEx. Create (nil); // Create
Try
Reg. Subject: = 'sssssss'; // This is the source string to be replaced.
Reg. RegEx: = 's'; // This is an expression. Here it is a substring to be replaced.
Reg. Replacement: = '|'; // new string to be replaced
Reg. ReplaceAll; // execute replace all
Showmessage (reg. Subject); // return the replacement result: | S
Finally
FreeAndNil (reg); // because nil is assigned by the owner during creation, reg. Free is not used here.
End;
End;

6. Press "F9" to run. OK.

As long as you can learn and use it, you will be able to develop powerful programs. [Em12]

========================================================== ========================================
Key Points of using and writing regular expressions

Regular Expressions are usually used for verifying strings, extracting information, and processing strings.

When used for verification, it is usually the overall verification of the body. For example, it is usually to determine whether the body is the correct email address, rather than whether the body contains the correct email address. therefore, add the row start anchor ^ to both ends of the Regular Expression and the row end anchor $. if the text to be verified allows spaces at both ends, '*' or '\ s *' (allow spaces to match with TAB) should be used before and after the anchor.

The key to designing a regular expression is segmentation. the content to be matched can be easily divided into segments. there are usually several segmented prompts in the text to be matched, such as logical unit, recurrence mode, or characters (strings) that cannot be repeated consecutively ).

Taking the regular condition for designing verification input numbers as an example, you can first list the conditions that meet the conditions:
1234/12 .34/-12.34/12.3e4/12.3e-4/. 12E-34
It can be seen that the logical unit is: symbol, integer part, decimal point, decimal part, e (or E), exponent symbol, exponent part
Technical segmentation marks include:

Symbol: may appear once after start and E
Decimal point: it can only appear once. If it appears, it must be followed by a decimal point.
E: it can only appear once. If it appears, then it must have an exponential part.

All "if it appears, it must be followed ..." Can be considered into a group. Preliminary scheme available: [+ \-]? \ D * (\. \ d + )? ([Ee] [+ \-] * \ d + )?

But there is a problem with this design, the first half of the \ d * (\. \ d + )? Yes, it can match an empty string, and the requirement is that if there is an integer part, the fractional part is optional. If there is no integer, there must be a fractional part. The intuitive way is to change to (\ d + (\. \ d + )? | \. \ D + ). After careful observation, we can find that this choice type ends with \ d +, regardless of the situation. In fact, we do not care whether this \ d + matches the integer or decimal part, the decimal point and integer are optional. Therefore, this part can be rewritten to \ d *\.? \ D +

So the final validation formula is: ^ * [+ \-]? \ D *\.? \ D + ([Ee] [+ \-] * \ d + )? * $

Extracting information using regular expressions is a difficult issue, but it also reflects the powerful strength of regular expressions. The regular expression used to extract information must consider four aspects: No misjudgment (there should be a certain syntax test capability), no missed judgment, and the substring can be matched to the correct position. Some texts with complex structures or circular structures may need to be processed multiple times or extracted using the circular structure of the development language. I still think the specific skills are vague. Here are just a few examples:

1. Search for and analyze short-form inequalities such as XX1> XX2 <XX3 <= XX4 = XX5, where XXn is not included>, <, = ,!, Space, any string of the TAB symbol. Space or TAB is allowed between the child and the element.

Since this inequality may be in the context, we need to isolate the inequality of the syntax in the body first, otherwise the following loop part will match the part of the next inequality. In this example, it is still simple. Find '\ B [^ <> =! \ S] + (unequal symbols) [^ <> =! \ S] +) + \ B ). However, in this way, only the entire sub-string can be matched, but the sub-string information in the sub-string cannot be extracted separately. When (...) + appears, the corresponding sub-string content is only the last matched string.

For each matching result, we need to extract the result twice separately, and use a loop for the second time.

Obtain the first substring. This is very easy. You can simply use '^ \ s * ([^ <> = \ s] +, note that ^ * is used to remove unnecessary spaces and tabs at the beginning. The real XX1 is in the matched substring 1.

Then start to analyze '(unequal symbols) [^ <> =! \ S] +) +. The entire (...) + structure needs to be extracted in a loop of external development languages. List valid unequal symbols first: >=, >,=, <=, <,<>,=,!> ,! <,!> = ,! <= ,! =. Therefore, the unequal sign must be (!?> =? | !? <=? | !? ==? | <> ).

Therefore, to analyze the text, you need:
1) A matches '\ B [^ <> =! \ S] + (\ s * (!?> =? | !? <=? | !? ==? | <>) \ S * [^ <> =! \ S] +) + \ B 'find the inequality.
2) B matches '^ \ s * ([^ <> =! \ S]) ', extract the substring 1 record as the variable name.
3) then B matches '\ s * (!?> =? | !? <=? | !? ==? | <>) \ S * ([^ <> =! \ S] +) ', extract the substring 1 record as the symbol, and the substring 2 record as the variable name.
4) Start from 3) until the matching result of B is not found.
5) start from 1) until the matching result of A is not found.

2. Extract the phone number

This is a small program not long ago for friends. This is the case: his company needs to scan the Internet to find a partner of the Australian wine company. His task is to record the contact information of the online wine company into the database. The database that records the contact information stores the company name, address, phone number, fax number, email address, and other information in different fields. Therefore, he must accurately select various information on the webpage with the mouse and paste it into the data table. This is not only inefficient, but it is said that the mouse is too precise and the wrist is too sour. So I am asked to help write a small program. The requirement is that he can copy the entire contact information and my program will automatically extract the relevant information. The following describes the design process of the telephone number information extraction function:

I looked at the possible contact information and found that sometimes there are multiple phone numbers, while my friend's database only records one number. Therefore, I decided to extract all phone numbers from the selected text and list them in a ComboBox. The priority column marked as "Phone", "Ph", or "P" in the body is listed first.

There are a variety of telephone numbers. First, consider the phone number. There are generally the following types:
Phone, Phone:, (Phone), P, P-, PH, etc.
Therefore, we can design the regular expression for matching tags '(? -I )(? :\(? Phone | Ph ?) [-:]? \ S *\)?) '. This writing method has a defect that it cannot ensure that the two sides of the brackets can match (for example, it can match '(Phone :'). However, the purpose here is not to verify the phone number. The mismatch of parentheses does not affect the extraction of the phone number. Simply write it as this is enough. Otherwise, write it :'(? : Phone | Ph ?) [-:]? | \((? : Phone | Ph ?) [-:]? \ S * \) ', which is much more troublesome.

Then consider the phone number itself. A complete fixed phone number may be like this:
+ 61 2 1234 5678, (61 2) 1234 5678, (61-) 2-12345678, 61 (0) 2 12345678
The Country Code (+ 61) or continent code (02) may also be omitted
02 1234 5678,123 4 5678, 2-1234-5678
It is also possible that the statement is not based on the primary number.
02 123 456 78 and so on

The process of constructing a regular expression is as follows:

Matched Country Code :(? : [+ (] * 61 [-)] *)? It can be seen that if this formula is used for verification, it can match a body like '(+ + 61. However, I am here to extract information on the company's webpage, and the company will not put such garbled characters in its own contact information. When extracting information, we can assume that the input information will not have too many errors without misjudgment.

Matched area code :(? :\(? * 0? \)? * (\ D) * [-)]? *)? Same as above, this statement may also match an invalid body. It is much easier to merge the area code with the subject number. But here I want to extract the area code as a basis for judging the province.

Matching phone number subject :(? : \ D {5 ,}(? : [-] \ D +) * | \ d {1, 4 }(? : [-] \ D +) If you only want to match the phone number, use \ d *(? : [-] \ D +. However, the zip code in Australia is exactly four digits, and the mailbox number in the address may be three to four digits. Therefore, it is difficult to write a statement. If the number is less than five consecutive digits, a connection number or space must be followed and then followed by a number.

Therefore, the regular expression of the preferred number is :(? -I )(? :\(? Phone | Ph ?) [-:]? \ S *\)?) ((? : [+ (] * 61 [-)] *)? (? :\(? * 0? \)? * (\ D )? [-)]? \ S *)? (? : \ D {5 ,}(? : [-] \ D +) * | \ d {1, 4 }(? : [-] \ D + ))
The telephone number is in the substring 1, and the area code is in the substring 2.

In the first round of scanning, the body is replaced with ''while matching is found ''. After completion, you can use ((? : [+ (] * 61 [-)] *)? (? :\(? * 0? \)? * (\ D )? [-)]? \ S *)? (? : \ D {5 ,}(? : [-] \ D +) * | \ d {1, 4 }(? : [-] \ D +) (removed the matching phone number) to find all the strings in the format that do not have phone number signs.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.