Overview
Log analysis is often done with thousands of log entries, and in order to find specific patterns of data in a large amount of data, it is often necessary to write many complex regular expressions. For example, enumerating entries in a log file that do not contain a particular string, identifying entries that do not begin with a particular string, and so on.
Using the negative forward-looking
There is the concept of foresight (Lookahead) and looking Back (lookbehind) in regular expressions, which describe the matching behavior of the regular engine very vividly. It is important to note that the front and back in regular expressions are a bit different from what we generally understand. A piece of text, we generally used to refer to the direction of the beginning of the text "front", the end of the text is called "back." for the regular expression engine, however, because it is parsed from the head of the text to the end (the parsing direction can be controlled by a regular option), the direction of the text tail is called "front", because this time the regular engine has not reached that block, and the direction of the text head is called "after", Because the regular engine has gone through that piece of land . As shown in the following:
The so-called forward-looking is to match the regular expression to a character, the "unresolved text" in advance to see if the match/mismatch pattern, and then, in the regular engine has matched the text to see if the match/non-conforming pattern. Conforming and not conforming to a particular pattern of matching we are also known as positive and negative matches.
The modern advanced regular expression engine generally supports both the forward-looking and the looking back support is not very broad, so we use a negative forward-looking to achieve our needs.
Realize
Test data:
2009-07-07 04:38:44 127.0.0.1 GET /robots.txt 2009-07-07 04:38:44 127.0.0.1 GET /posts/robotfile.txt 2009-07-08 04:38:44 127.0.0.1 GET / |
For example, here are a few simple log entries, and we want to achieve two goals:
1. Filter out the data of number 8th
2. Find the entries that do not contain the robots.txt string (as long as the URL contains robots.txt).
The forward-looking syntax is:
Let's first achieve the first goal-- match an entry that doesn't start with a particular string .
Here we want to exclude a contiguous string, so the pattern of matching is very simple, that is, 2009-07-08. The implementation is as follows:
With Expresso we can see that the results do filter out numbers 8th.
Next, let's implement the second goal-- exclude entries that contain a particular string .
As we write above, I divert a bit:
This section of the regular plain English description is: The beginning of any character, and then do not follow the robots.txt continuous string, and then followed by any character, the end of the string.
Running the test, the results found:
Did not achieve the effect we wanted. What is this for? Let's add two capture groupings to the above regular expression to debug:
^(.*?)(?!robots\.txt)(.*?)$ |
Test results:
We see that the first group has nothing to match, and the second group matches the entire string. Look back and analyze the regular expression. In fact, when the regular engine resolves to zone A, it has started to perform the forward-looking work in area B. This time it is found that when the A zone is null, the match succeeds ——. * Originally allowed to match the null character, the forward-looking conditions are satisfied, the a area is followed by a "2009" string, rather than robots. So the entire matching process successfully matches to all entries.
After analyzing the reason, we revise the above-mentioned regularization, and move the. * Into the forward-looking expression as follows:
Java Regular: Does not contain a rule string