Text: JS regular learning small memory matching string optimization
Yesterday in the "JS Regular learning Small memory match string" talked about/"(?:\ \.| [^"]) *"/is a good expression, because it can meet our requirements, so this expression is available, but not necessarily the best.
In terms of performance, he's very bad, why do you say that because the traditional NFA engine encounters branches that match from left to right,
So it will use\\.To match every character and find the wrong one.[^"]to match.
such as a string:" 123456\ ' 78\"
Shared -Characters except the first one"The direct match is successful and the remaining theA, only2An Escape (4characters), so\\.Will failTenTimes, only2A successful time.
ThisTenThe second match fails and needs to be traced back[^"]To match the success, of course the last one"Will directly match the success.
It is clear that the normal string is not completely escaped, the normal string is the mainstream, of course, it does not rule out that some people deliberately completely escaped the situation.
So this regular need to go back after the completion of the match, if the string growth to 1K 1M swollen?
So we're going to change the regular, back and forth position?
Is it /"(?: [^"]|\\.) * "/ ? Oh, it seems not quite right, so that the escape can not be matched.
So you have to modify the following /"(?: [^" \\]|\\.) * "/ This will be OK, encounter \ escape will use \ \. to try to match.
However, there is still a problem, because we have filtered out \ \ \ \ \ \ \ So we can't match the multi-line characters.
JS in the string with\The wrapping is allowed, but the modified regular does not match such a string, so we still have to fix it.
Because.There's no way to match a newline, so we're going to do it in other ways.
.is used to match all characters except the line break.[. \ n]To say?
This is wrong, because[]In the character set. No longer represents all characters except line breaks, but characters.Which is a character of his own.
What about that?
In fact, a different idea,
\dSaid0-9
\dSaid[^0-9]
So[\d\d]That's all, isn't it? (New friends don't know if they can digest this knowledge.) )
Similarly[\s\s] [\w\w]The same can be.
So/"(?: [^" \\]|\\[\d\d]) * "/This will satisfy our requirements.
Good results.
Come back and analyze his performance now.
Or this string: "123456\ ' 78\" , regular /"(?: [^" \\]|\\[\d\d]) * "/
Shared -Characters except the first one "The direct match is successful and the remaining theOne, there2An Escape (4characters),[^"\\]Can match successTenCharacters, only2Times failed.
Why not?4The failure of the times, obviously there4a character.\\Although it is2characters, but read the first one\The match fails and then uses the \\[\d\d] match succeeded,
Took up two characters \ \ the next time with the next o start matching, so only 2 backtracking.
Only2Times need to backtrack and then use\\[\d\d]The match was successful. Of course the last one"Or it will directly match the success.
So fromTenThe second backtracking, reduced to2Times, although the regular is a lot more bloated than yesterday, but at least performance has improved more than one level.
OK, today's share is finished, see you tomorrow.