Regular Expressions are difficult to write, read, and maintain. Regular Expressions often incorrectly match unexpected texts or Miss valid texts. These problems are caused by the performance and ability of regular expressions. The ability and nuances of each metacharacter are combined, so that the Code cannot be explained without the help of intellectual skills.
Many tools that contain certain features make it easy to read and write regular expressions, but they do not conform to the habit. For many programmers, writing regular expressions is a magic art. They stick to what they know and are absolutely optimistic. If you are willing to use the five habits discussed in this article, you will be able to let the regular expressions you have designed withstand repeated experiments.
This article uses Perl, PHP, and Python as code examples, but the suggestions in this article are almost applicable to the execution of any replacement expression (regex.
I. Use spaces and comments
For most programmers, the use of spaces and indentation in a regular expression environment is not a problem. If they do not do so, they will be jokes by their peers or even layers. Almost everyone knows that it is difficult to read, write, and maintain code in a row. What are the differences between regular expressions?
Most replacement expression tools have the extended space feature, which allows programmers to extend their regular expressions to multiple rows and add comments at the end of each line. Why does a small number of programmers use this feature? The Regular Expression of Perl 6 defaults to the space extension mode. Do not extend the space for your language by default. Take the initiative to use it.
One of the tricks to remember to expand spaces is to let the Regular Expression Engine ignore the extended space. In this way, if you need to match spaces, You Have To explicitly describe them.
In Perl, add x to the end of the regular expression, so that "m/foo | bar/" becomes the following form:
M/
Foo
Bar
/X
In PHP, add x to the end of the regular expression, so that ""/foo | bar/"is changed to the following form:
"/
Foo
Bar
/X"
In Python, the pass mode modifier parameter "re. VERBOSE" gets the compilation function as follows:
Pattern = r '''
Foo
Bar
'''
Regex = re. compile (pattern, re. VERBOSE)
When processing more complex regular expressions, spaces and comments are more important. Assume that the following regular expression is used to match the telephone number in the United States:
\(? \ D {3 }\)? ? \ D {3} [-.] \ d {4}
This regular expression matches the phone number in the form of "(314) 555-4000". Do you think this regular expression matches "314-555-4000" or "555-4000? The answer is that the two do not match. Writing such a line of code hides the disadvantages and design results. The telephone area code is required, but the regular expression lacks a description of the delimiter between the area code and the prefix.
Dividing this line of code into several lines and adding comments will undoubtedly expose the disadvantages and make it easier to modify them.
In Perl, the format is as follows:
/
\(? # Optional parentheses
\ D {3} # required telephone area code
\)? # Optional parentheses
[-\ S.]? # The separator can be a break, space, or period.
\ D {3} # three-digit prefix
[-.] # Another Separator
\ D {4} # four-digit phone number
/X
The modified regular expression now has an optional separator after the telephone area code, so it should match "314-555-4000", but the telephone area code is still required. If another programmer needs to change the telephone area code to an option, he can quickly see that it is not optional now. A small change can solve this problem.
Ii. Writing Test
There are three levels of test, each layer adds a level of reliability to your code. First, you need to think carefully about what code you need to match and whether you can handle error matching. Second, you need to use data instances to test regular expressions. Finally, you need to pass the test by a test team.
Deciding what to match is actually a balance between matching error results and missing correct results. If your regular expression is too strict, it will miss some correct matches; if it is too loose, it will produce an error match. Once a regular expression is issued to the actual code, you may not notice both. Consider the example of the phone number above. It will match "800-555-4000 =-5355 ". Incorrect matching is difficult to find, so it is important to plan and test in advance.
For example, if you confirm a phone number in the Web form, you may only need to meet ten numbers in any format. However, if you want to separate phone numbers from a large number of texts, you may need to authenticate and exclude incorrect matching.
Write down some cases when considering the data you want to match. Write down some code for the case to test your regular expression. For any complex regular expression, it is best to write a small program to test it. The specific form below can be used.
In Perl:
#! /Usr/bin/perl
My @ tests = ("314-555-4000 ",
"800-555-4400 ",
"(314) 555-4000 ",
"314.555.4000 ",
555-4000 ",
"Aasdklfjklas ",
"1234-123-12345"
);
Foreach my $ test (@ tests ){
If ($ test = ~ M/
\(? # Optional parentheses
\ D {3} # required telephone area code
\)? # Optional parentheses
[-\ S.]? # The separator can be a break, space, or period.
\ D {3} # three-digit prefix
[-\ S.] # Another Separator
\ D {4} # four-digit phone number
/X ){
Print "Matched on $ test \ n ";
}
Else {
Print "Failed match on $ test \ n ";
}
}
In PHP:
<? Php
$ Tests = array ("314-555-4000 ",
"800-555-4400 ",
"(314) 555-4000 ",
"314.555.4000 ",
555-4000 ",
"Aasdklfjklas ",
"1234-123-12345"
);
$ Regex = "/
\(? # Optional parentheses
\ D {3} # required telephone area code
\)? # Optional parentheses
[-\ S.]? # The separator can be a break, space, or period.
\ D {3} # three-digit prefix
[-\ S.] # Another Separator
\ D {4} # four-digit phone number
/X ";
Foreach ($ tests as $ test ){
If (preg_match ($ regex, $ test )){
Echo "Matched on $ test <br/> ;";
}
Else {
Echo "Failed match on $ test <br/> ;";
}
}
?>;
In Python:
Import re
Tests = ["314-555-4000 ",
"800-555-4400 ",
"(314) 555-4000 ",
"314.555.4000 ",
555-4000 ",
"Aasdklfjklas ",
"1234-123-12345"
]
Pattern = r '''
\(? # Optional parentheses
\ D {3} # required telephone area code
\)? # Optional parentheses
[-\ S.]? # The separator can be a break, space, or period.
\ D {3} # three-digit prefix
[-\ S.] # Another Separator
\ D {4} # four-digit phone number
'''
Regex = re. compile (pattern, re. VERBOSE)
For test in tests:
If regex. match (test ):
Print "Matched on", test, "\ n"
Else:
Print "Failed match on", test, "\ n"
Run the test code to find another problem: it matches "1234-123-12345 ".
Theoretically, you need to integrate all tests of the entire program into a test group. Even if you do not have a test group, your regular expression test will be a good foundation for a group. Now is a good opportunity to start creating. Even if it is not the proper time for creation, you should test the regular expression after each modification. It will take a short time to reduce your troubles.
3. Alternate operation groups
The alternate operator symbol (|) has a low priority, which means it is often more than the programmer's design. For example, the regular expression used to extract the Email address from the text may be as follows:
^ CC: | :(.*)
The above attempt is incorrect, but this bug is often not noticed. The purpose of the above Code is To find the text starting with "CC:" or "To:", and then extract the Email address from the back part of this line.
Unfortunately, if a row contains "To:", this regular expression will not capture any line starting with "CC:", but will extract several random texts. Frankly speaking, the regular expression matches the line starting with "CC:", but nothing is captured. Or it matches any line containing ", however, the remaining text in this line is captured. Generally, this regular expression captures a large number of Email addresses. No one will pay attention to this bug.
If you want to conform to the actual intention, you should add brackets to explain clearly. The regular expression is as follows:
(^ CC :) | (:(.*))
If the true intention is To capture the remaining part of the text line starting with "CC:" or "To:", the correct regular expression is as follows:
^ (CC: | :)(.*)
This is a common bug that does not fully match. If you become a habit of switching operation groups, you will avoid this error.
4. Use loose quantifiers
Many programmers avoid using loose quantifiers such as "*?" , "+ ?" And "?", Even if they make the expression easy to write and understand.
Loose quantifiers can match as few texts as possible, which helps to match the results completely. If you write "foo (.*?) Bar, the quantifiers will stop matching when bar is first encountered, rather than the last one. This is important if you want to capture "###" from "foo ### bar ++ bar. A strict quantizer captures "### bar ++ ".
Suppose you want to capture all phone numbers from the HTML file, you may use the example of the Regular Expression of phone numbers we discussed above. However, if you know that all phone numbers are in the first column of a table, you can use loose quantifiers to write a simpler regular expression:
<Tr >;< td >;(. + ?) <Td>;
Many new programmers do not use loose quantifiers to deny specific types. They can write the following code:
<Tr >;< td >;( [^ >;] +) </td>;
In this case, it can run normally, but if the text you want to capture contains the public characters that you have separated (in this case, for example, </td>;), this will cause a lot of trouble. If you use loose quantifiers, you only need to spend a little time assembling character types to generate new regular expressions.
When you know that you want to capture the text environment structure, loose quantifiers are of great value.
V. Use available delimiters
Perl and PHP often use a left slash (/) to mark the beginning and end of a regular expression. Python uses a set of quotation marks to mark the beginning and end. If you stick to the left slash in Perl and PHP, you will avoid any diagonal lines in the expression. If you use quotation marks in Python, you will avoid using the backslash (\). Selecting different delimiters or quotation marks allows you to avoid half of regular expressions. This makes the expression easy to read and reduces the potential bugs caused by forgetting to avoid symbols.
The Perl and PHP languages allow any non-numeric and space characters to be used as delimiters. If you switch to a new delimiter, When you match a URL or HTML sign (such as "http: //" or "<br/>, you can avoid missing the left slash.
For example, "/http: \/(\ S) */" can be written as "# http: // (\ S )*#".
The common delimiters are "#" and "!". And "| ". If you want to use square brackets, angle brackets, or curly brackets, you only need to keep the front and back pairs. Below are some examples of common delimiters:
#... #!...! {...} S |... |... | (Perl only) s […] […] (Perl only) s <…>; /... /(Perl only)
In Python, regular expressions are treated as strings first. If you use quotation marks as delimiters, you will miss all backslashes. However, you can use the "r''" string to avoid this problem. If you use three consecutive single quotes for the "re. VERBOSE" option, it will allow you to include line breaks. For example, regex = "(\ w +) (\ d +)" can write the following form:
Regex = r '''
(\ W +)
(\ D +)
'''
Conclusion: the suggestions in this article mainly focus on the readability of regular expressions. To develop these habits, you will have a clearer consideration of the design and expression structure, this will help reduce bug and code maintenance. If you are the maintainer of this code, you will feel relaxed.