Regular expressions are difficult to write, difficult to read, difficult to maintain, often incorrectly matching unexpected text, or missing valid text, all of which are caused by the performance and ability of regular expressions. The ability and nuances of each meta character (metacharacter) are combined to make the code impossible to explain without the help of intelligence.
Many of the tools that contain certain features make it easy to read and write regular expressions, but they are very inconsistent with habits. For many programmers, writing regular expressions is an art of magic. They adhere to the characteristics they know and hold an absolutely optimistic attitude. If you are willing to use the five habits discussed in this article, you will be able to get the regular expressions you have designed to withstand repeated trials.
This article will use the Perl, PHP, and Python languages as code examples, but the recommendations of this article apply to almost any substitution expression (regex) execution.
First, use spaces and notes
For most programmers, the use of spaces and indents in a regular expression environment is not a problem, and if they do not do so they will be seen by peers and even laymen. Almost everyone knows that squeezing code in one line can be difficult to read, write, and maintain. What's the difference with regular expressions?
Most substitution expression tools have extended whitespace characteristics, which allows programmers to extend their regular expressions to multiple lines, with comments at the end of each line. Why do only a few programmers take advantage of this feature? The regular expression of Perl 6 defaults to the pattern of extended spaces. Do not let the language for you to expand the default space, you take the initiative to use it.
Remember that one of the tips for extending spaces is to let the regular expression engine ignore extended spaces. So if you need to match the blanks, you'll have to specify.
In the Perl language, add X at the end of the regular expression so that "m/foo|bar/" changes to the following form:
m/
Foo
Bar
/x
In the PHP language, add an X at the end of the regular expression, so that "/foo|bar/" becomes the following form:
"/
Foo
Bar
/x "
In the Python language, pass the schema modifier parameter "re." VERBOSE "gets the compiled function as follows:
Pattern = R ' '
Foo
Bar
'''
Regex = Re.compile (pattern, re. VERBOSE)
Spaces and annotations are more important when dealing with more complex regular expressions. Suppose the following regular expression is used to match the U.S. phone number:
\ (? \d{3}\)?? \d{3}[-.] \D{4}
This regular expression matches the phone number in the form of "(314) 555-4000", do you think the regular expression matches "314-555-4000" or "555-4000"? The answer is that both of them do not match. Write a line of code that hides the flaws and the design result itself, the telephone area code is required, but the regular expression lacks a description of the separator symbol between the area code and the prefix.
Dividing this line of code into lines and commenting will expose the flaws, which is obviously easier to modify.
In the Perl language it should be the following form:
/
\(? # Optional parentheses
\D{3} # The required telephone area code
\)? # Optional parentheses
[-\s.]? # The separator symbol can be a dash, a space, or a period
\D{3} # three-digit prefix
[-.] # Another separator symbol
\D{4} # four digits number
/x
The rewritten regular expression now has an optional separator symbol after the phone area code, so it should match the "314-555-4000", but the phone number is still required. Another programmer who needs to turn the area code into an optional option can quickly see that it is now not optional and a small change can solve the problem.
Second, writing test
A total of three levels of testing, each layer for your code plus a layer of reliability. First, you need to think carefully about what code you need to match and whether you can handle the error match. Second, you need to use a data instance to test the regular expression. Finally, you need to formally pass a Test team test.
Deciding what to match is really about finding a balance between matching the wrong result and missing the correct result. If your regular expression is too strict, it will miss out on some proper matches, and if it is too loose, it will produce an error match. Once a regular expression is released into the actual code, you may not notice both. Consider the example of the phone number above, which will match "800-555-4000 =-5355". Bad matches are hard to find, so it's important to plan ahead and do the testing.
or using the phone number example, if you confirm a phone number in a Web form, you may want to settle for a 10-digit number in any format. However, if you want to separate phone numbers from a large amount of text, you may need to be very certified to exclude mismatches that do not meet the requirements.
When considering the data you want to match, write down some cases. Write down some code to test your regular expression for case scenarios. Any complex regular expression is best to write a small program test, you can use the following specific form.
In the Perl language:
#!/usr/bin/perl
foreach my $test (@tests) {
if ($test =~ m/
\(? # Optional parentheses
\D{3} # The required telephone area code
\)? # Optional parentheses
[-\s.]? # The separator symbol can be a dash, a space, or a period
\D{3} # three-digit prefix
[-\s.] # Another separator symbol
\D{4} # four digits number
/x) {
Print "matched on $test \ n";
}
else {
Print "Failed match on $test \ n";
}
}
$regex = "/
\(? # Optional parentheses
\D{3} # The required telephone area code
\)? # Optional parentheses
[-\s.]? # The separator symbol can be a dash, a space, or a period
\D{3} # three-digit prefix
[-\s.] # Another separator symbol
\D{4} # four digits number
/x ";
foreach ($tests as $test) {
if (Preg_match ($regex, $test)) {
echo "Matched on $test <br/>;";
}
else {
echo "Failed match on $test <br/>;";
}
}
?>;
Pattern = R ' '
\(? # Optional parentheses
\D{3} # The required telephone area code
\)? # Optional parentheses
[-\s.]? # The separator symbol can be a dash, a space, or a period
\D{3} # three-digit prefix
[-\s.] # Another separator symbol
\D{4} # four digits number
'''
Regex = Re.compile (pattern, re. VERBOSE)
For Test in tests:
If Regex.match (test):
Print "Matched on", test, "\ n"
Else
Print "Failed match on", test, "\ n"
Running the test code will find another problem: it matches the "1234-123-12345".
Theoretically, you need to integrate all the tests into a Test team. Even if you don't have a Test team yet, your regular expression test will be a good foundation for a team and now is a good time to start creating. Even if it's not the right time to create it, you should run the regular expression after each change. Spending a little time here will reduce a lot of trouble for you.
Grouping for alternating operations
The alternate operation symbol (|) has a low priority, which means it often alternates more than the programmer has designed. For example, a regular expression that extracts an email address from the text may be as follows:
^cc:|to: (. *)
The above attempt is not correct, but the bug is often unnoticed. The intent of the above code is to find text that starts with "CC:" or "to:" and then extracts the email address in the back section of the line.
Unfortunately, if "to:" appears in the middle of a row, the regular expression will not capture any line that starts with "CC:", but instead extracts a few random text. Frankly, the regular expression matches the first line of "CC:", but nothing catches it, or matches any line that contains "to:" but captures the remaining text of the line. Typically, this regular expression captures a large number of email addresses, and no one notices the bug.
If you want to conform to the actual intent, then you should add parentheses to explain clearly that the regular expression is as follows:
(^CC:) | (To: (. *))
If the real intent is to capture the remainder of a line of text that starts with cc: OR to:, the correct regular expression is as follows:
^ (cc:|to:) (. *)
This is a common bug that does not exactly match, and if you get into the habit of grouping for alternating operations, you will avoid this error.
Iv. use of loose quantity words
Many programmers avoid using loose-quantity words like "*?", "+?" And "??" even if they make this expression easy to write and understand.
A loose number of words can match as few text as possible, which helps to match the success of the exact matches. If you wrote "Foo" (. *?) Bar ", the number word stops matching when it first encounters" bar ", not the last time. This is important if you want to capture "###" from "foo## #bar +++bar". A strict number of words will capture "# # #bar + +".
Suppose you want to capture all the phone numbers from the HTML file, you might use the example of the phone number regular expression we discussed above. However, if you know that all the phone numbers are in the first column of a table, you can use a loose quantity word to write a simpler regular expression:
<tr>;<td>;(. +?) <td>;
Many beginner programmers do not use loose-quantity words to negate certain types. They can write the following code:
<tr>;<td>;([^>;] +) </td>;
In this case it works, but if the text you want to capture contains the common characters you separate (such as </td>;), this can be a lot of trouble. If you use a loose-quantity word, you can generate new regular expressions by simply taking a little time to assemble the character types.
When you know you want to capture the environment structure of the text, the loose quantity word is of great value.
V. Use of available delimiters
Perl and PHP languages often use a left slash (/) to flag the beginning and end of a regular expression, and the Python language uses a set of quotes to mark the beginning and the end. If you stick to the left slash in Perl and PHP, you'll avoid any diagonal lines in the expression; If you use quotes in Python, you will avoid using backslashes (\). Choosing a different delimiter or quotation mark allows you to avoid half of the regular expression. This makes the expression easy to read and reduces potential bugs by forgetting to avoid symbols.
The Perl and PHP languages allow you to use any non-numeric and whitespace characters as a delimiter. If you switch to a new delimiter, in a matching URL or HTML flag (such as "http://" or "<br/>;" , you can avoid missing the left slash.
For example, "/http:\/\/(\s)/" can be written as "#http:///(\s) *#".
The universal delimiter is "#", "!" and "|". If you want to use square brackets, angle brackets, or curly braces, just keep the front and back pairing up. Here are some examples of common delimiters:
#...# !...! {...} s|...| ... | (Perl only) s[...] [...] (Perl only) s<...>;/.../(Perl only)
In Python, a regular expression is first treated as a string. If you use quotes as a delimiter, you'll miss out on all backslashes. But you can use the "R" string to avoid this problem. If directed to "re. The VERBOSE option uses three consecutive single quotes, which will allow you to include line wrapping. For example, regex = "(\\w+)" (\\d+) can write the following form:
Regex = R ' '
(\w+)
(\d+)
'''
Summary: This article's recommendations focus on the readability of regular expressions, develop these habits in development, you will be more clearly consider the structure of design and expression, which will help reduce the bug and code maintenance, if you are the defender of this code you will feel relaxed.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.