Java Regular Expression: does not contain a rule string, java rule string
Overview
For log analysis, you often need to deal with thousands of log entries. To find data in a specific mode in a large amount of data, you often need to write many complex regular expressions. For example, if a log file does not contain entries of a specific string, you can find entries that do not start with a specific string.
Forward using the negative model
Regular expressions have the concepts of Lookahead and Lookbehind. These two terms describe the Matching Behavior of the Regular Expression Engine. Note that the front and back of the regular expression are a little different from what we generally understand. For a piece of text, we generally call the direction at the beginning of the text as "Front", and the end of the text as "back ". HoweverFor the Regular Expression Engine, because it is parsed from the text header to the tail (you can use the regular expression option to control the resolution direction), for the tail direction of the text, it is called the "front ", at this time, the RegEx engine has not moved to that part, but the direction of the text header is called "back", because the RegEx engine has passed through that part.. As shown in:
When a regular expression matches a character, you can preview the text that has not been parsed to see if it meets/does not match the matching mode, check whether the matching mode is met or not in the text that has been matched by the Regular Expression Engine. This is also calledAffirmative match and negative match.
Modern advanced Regular Expression engines generally support forward looking, which is not widely supported by postcare. Therefore, we use forward looking with a negative expression to meet our needs.
Implementation
Test data:
2009-07-07 04:38:44 127.0.0.1 GET /robots.txt 2009-07-07 04:38:44 127.0.0.1 GET /posts/robotfile.txt 2009-07-08 04:38:44 127.0.0.1 GET / |
For example, we want to achieve the following two objectives for the preceding simple log entries:
1. filter out the data on the 8 th.
2. Find out the items that do not contain the robots.txtstrings (only the files containing robots.txt in urlmust be filtered out ).
The syntax of foresight is:
Let's first achieve the first goal --Match entries that do not start with a specific string.
Because we want to exclude a continuous string, the matching mode is very simple, that is, 2009-07-08. The implementation is as follows:
With Expresso, we can see that the result indeed filters out the data on the 8 th.
Next, let's achieve the second goal --Exclude entries containing specific strings.
As we wrote above, I drew a picture from the gourd:
This regular expression is described in the vernacular as follows: starting from the beginning, then following any character and ending with the character string.
Run the test and the result shows:
We didn't achieve what we wanted. Why? We can add two capture groups to the above regular expression for debugging:
^(.*?)(?!robots\.txt)(.*?)$ |
Test results:
We can see that the first group does not match anything, but the second group matches the entire string. Let's take a look at the regular expression. In fact, when the RegEx engine is resolved to the domain, the forward-looking work in Area B has been started. At this time, it is found that the match is successful when zone A is Null --. * null characters are allowed to be matched, and the forward-looking conditions are met. The a domain is followed by a "2009" string rather than robots. Therefore, all entries are successfully matched during the entire matching process.
After analyzing the cause, we can modify the above regular expression .*? Forward expression: