· Go deep into the Regular Expression Engine Let's take a look at how the Regular Expression Engine matches the previous example. The first mark is "<", which is a text symbol. The second symbol is ".", matches the character "E", and "+" can always match other characters until the end of a line. Then the linefeed fails to match ("." does not match the linefeed ). The engine starts to match the next regular expression symbol. That is, try to match "> ". So far, "<. +" has matched "<em> first </em> test ". The engine tries to match ">" with the linefeed and the result fails. The engine traces back. The result is that "<. +" matches "<em> first </em> tes ". Therefore, the engine matches ">" with "T. Obviously, it will still fail. This process continues until "<. +" matches "<em> first </em", ">", and ">. Therefore, the engine finds a match "<em> first </em> ". Remember, the regular expression-oriented engine is "eager", so it will rush to report the first match it finds. Rather than continue tracing, even if there may be better matching, such as "<em> ". Therefore, we can see that due to the greedy nature of "+", the Regular Expression Engine returns a leftmost longest match. · Replacing greed with laziness One possible solution for correcting the above problems is to replace greed with "+" inertia. You can follow "+" with a question mark "?" To achieve this. "*", "{}" And "?" This scheme can also be used for repeated representation. Therefore, in the preceding example, we can use "<. +?> ". Let's take a look at the processing process of the Regular Expression Engine. Again, the regular expression mark "<" matches the first "<" of the string ". The next regular mark is ".". This is a lazy "+" to repeat the previous character. This tells the Regular Expression Engine to repeat the previous character as few as possible. Therefore, the engine matches "." And the character "E", and then matches "M" with ">". The result fails. The engine performs backtracking, which is different from the previous example. Because it is a inertia repetition, the engine expands the inertia repetition rather than reduces, so "<. +" is now extended to "<em ". The engine continues to match the next mark "> ". A successful match is obtained this time. The engine reports "<em>" as a successful match. The entire process is roughly the same. · An alternative to inertia Scaling We also have a better alternative. You can use a greedy repeat with an anti-Character Set: "<[^>] +> ". This is a better solution. When the inertia repeat is used, the engine will backtrack each character before finding a successful match. However, you do not need to perform backtracking when using the anti-character set. The last thing to remember is that this tutorial only talks about the regular expression-oriented engine. The text-oriented engine does not trace back. At the same time, they do not support inert and repetitive operations. 7. Use "." To match almost any character In regular expressions, "." is one of the most commonly used symbols. Unfortunately, it is also one of the most vulnerable symbols to misuse. "." Matches a single character without worrying about the character to be matched. The only exception is the newline character. The engine mentioned in this tutorial does not match the new line character by default. Therefore, by default, "." is equivalent to the abbreviation of the character set [^/n/R] (window) or [^/n] (UNIX. This exception is due to historical reasons. Because the regular expression-based tools were used in the early days. They all read a file in one row and apply the regular expression to each row. In these tools, strings do not contain newline characters. Therefore, "." Never matches new line characters. Modern tools and languages can apply regular expressions to large strings or even entire files. All regular expression implementations discussed in this tutorial provide an option to make "." match all characters, including new line characters. In regexbuddy, editpad pro, powergrep, and other tools, you can simply select "Point Matching newline ". In Perl, the pattern that "." can match a newline is called "Single Line Pattern ". Unfortunately, this is a confusing term. Because there is also the so-called "multiline mode ". The multi-row mode only affects anchor at the beginning and end of the line, while the single-row mode only affects ".". Other languages and Regular Expression Libraries also use Perl terminology. When using a regular expression class in. NET Framework, you can use a statement similar to the following to activate the single-row mode: RegEx. Match ("string", "RegEx", regexoptions. singleline) · Conservative use of the "." Point numbers can be said to be the most powerful metacharacters. It allows you to be lazy: with a dot, you can match almost all characters. But the problem is that it often matches characters that do not match. I will give a simple example. Let's see how to match a date in mm/DD/yy format, but we want to allow users to select separators. One solution that will soon come up with is </D./D./D>. It seems that it matches the date "02/12/03 ". The problem is that 02512703 is also considered a valid date. </D/d [-/.]/D/d [-/.]/D> it looks like a better solution. Remember that the point number is not a metacharacter in a character set. This solution is far from perfect, and it will match "99/99/99 ". <[0-1]/d [-/.] [0-3]/d [-/.]/D> goes further. Even though it matches "19/39/99 ". The degree to which you want your regular expression to be perfect depends on what you want to do. If you want to verify user input, try to be as perfect as possible. If you only want to analyze a known source and we know that there is no error data, it is enough to use a better regular expression to match the characters you want to search. 8. String start and end anchor The anchor is different from the regular expression symbol. It does not match any character. Instead, they match the positions before or after the characters. "^" Matches the position before the first character of a string. <^ A> match a in the string "ABC. <^ B> it does not match any character in "ABC. Similarly, $ matches the position behind the last character in the string. Therefore, <C $> matches C in "ABC. · Anchored applications When verifying user input in programming languages, it is very important to use the anchor. If you want to verify that your input is an integer, use <^/d + $>. Excessive leading or ending spaces are often displayed in user input. You can use <^/S *> and </S * $> to match leading or ending spaces. · Use "^" and "$" as the anchor for starting and ending a row If you have a string that contains multiple rows. For example, "first line/n/rsecond line" (where/n/R represents a new line character ). It is often necessary to process each line separately rather than the entire string. Therefore, almost all regular expression engines provide an option to extend the meanings of these two types of anchor. "^" Can match the start position (before F) of the string and the position (between/n/R and S) of each new line character ). Similarly, $ matches the end position of the string (after the last E) and the Front of each new line character (between E and/n/R ). In. net, when you use the following code, it will define the anchor to match the front and back of each new line character: RegEx. match ("string", "RegEx", regexoptions. multiline) Application: String STR = RegEx. Replace (original, "^", ">", regexoptions. multiline) -- inserts ">" at the beginning of each row ". · Absolute anchoring </A> only matches the start position of the entire string. </z> only matches the end position of the entire string. Even if you use the "multiline mode", </a> and </z> do not match new line characters. Even if/Z and $ match only the end position of the string, there is still an exception. If the string ends with a new line character,/Z and $ match the position before the new line character, rather than the end of the entire string. This "improvement" is introduced by Perl and followed by many regular expressions, including Java and. net. If the application is <^ [A-Z] + $> to "Joe/N", the matching result is "Joe" instead of "Joe/N ". |