Notes on RegEx reverse reference and Its Derivative Problems

Source: Internet
Author: User

Today I think of this problem again, and I feel it is necessary to record it. On the one hand, it took me more than half a year, and I also claim that the popular Regex engine should have swept into the history garbage, but in the end, at least I did not really achieve this goal. On the other hand, this is a time-consuming task, and we can't go into any further time.

Therefore, the best way is to review old things and sort out ideas first. If there is another chance to tackle the problem in the future, I will not forget everything.

First, this is a pure NP problem. This means that it is also a model for studying computational problems. Is this model easier to start with than other models ..., because I have not really studied other NP problems, it is hard to say. Since it is an NP problem, the research on this problem involves two aspects:

1. since it is an NP problem, if we find a polynomial time solution for this problem, it is equivalent to proof that NP = P; or, if we can make a proof, proving that this problem cannot be solved by a polynomial time is equivalent to proving NP! = P. There may be another possibility, that is, to prove that this problem cannot be known whether it is NP or P. Then we can conclude whether this NP is equal to P is not verifiable.

Any one who makes breakthroughs in this kind of problem will become masters who surpass Knuth, so it's okay to think about it as a hobby, there cannot be any unrealistic fantasies.

2. If it is practical, find a reasonable solution for this problem, that is, "the current popular Regex engine is swept into the historical garbage ". This is very likely in my experiments. What is a reasonable solution? In my version, this algorithm is adaptive to (input, time. In the extreme sense, as long as a (mode, input) pair exists in the O (x) solution, then this general algorithm should also be O (x) in this input ), x represents an order of magnitude.

The biggest problem with the current Regex engine is that, in order to support reverse reference and other functions, the exponential time is also applied to input that can be completely resolved in linear time or even without reverse reference. When I paused last year, I got an algorithm of my own. However, this algorithm has some serious input bugs. On the other hand, it is too complicated to continue optimization.

At present, I have practiced Earley-Style Parser elsewhere, and I feel it can also be used for RegEx. Why do we choose Earley's path, because his algorithm tastes the same as the very complicated one I got. Although the original Earley algorithm can only support non-deterministic push-down automation, it can be easily expanded based on my previous ideas to support context-related languages; and the overall framework seems clearer than my own algorithms.

The current idea is whether the generative formula can be dynamically generated in the Predict or Complete stage, which corresponds to the function of dynamically generating the automatic mechanism in my previous algorithms. One thing to note is that when I originally generated an automatic machine, I actually kept quite a bit of information called "clues ". Obviously, even if it is transferred to the Earley algorithm, this part is also essential, so what Earley has provided and whether it is easy to add what it has not provided. It should be noted that the problems encountered by my original algorithm are better solved?

Finally, record the difficulties encountered in the previous study. In fact, reverse reference is not a real obstacle to similar problems. Obstacles are produced by the combination of reverse references and other elements. For example, (abc) de \ 1, this RegEx can be easily parsed, because although it carries a reverse reference, it is essentially the same as abcdeabc.

However, it is similar (. *)(. *) the mode like \ 1 is much more complicated at once, which is largely due to ambiguity due to insufficient information when it is half parsed. If there is ambiguity in each input character (and then there is another ambiguity in each possibility on the next character), you can imagine what the final permutation and combination is. (Here we need

Point records a clear understanding that the number of ambiguities is composed of the pattern and

Jointly determined)

(UPDATE: the original idea is to pre-process the mode, such as merging similar items. There are still many reasons for this idea. There is a key here, that is, it is necessary to identify patterns that cannot be identified, or that are useless for processing. Some patterns are better than just throwing them to the backtracking algorithm)

The previous thought was:Classify the situations and then optimize them separately, (UPDATE: This is too misleading) for (RegEx) special Function Extension algorithms, some patterns and situations are considered as design features (rather than just syntactic sugar expansion (such as the number function )), I think this idea can still be taken.

Because this method is more effective than mechanical "There is reverse reference, then A, and no B" (The worst thing is that most popular engines don't even have this ).Optimization(This should not be optimized) describe and handle situations with reverse references that can be solved within a polynomial or even linear time. However, we should carefully analyze the contributions and gains of this solution before that. In past practices, the combination of these features and algorithms formed by the traditional automatic machine viewpoint is somewhat complicated. Can it be well integrated with Earley's algorithm framework?

In addition, if there is ambiguity at a certain time point (not due to improper operations), it is certainly not possible to eliminate it before obtaining further information. So the key is how to deal with these ambiguities, so that there is a good functional relationship between (to be determined, but not human) ambiguity and algorithm time. If such a model can be established, in fact, this practical goal is at least the first step.

Since I used to wish that my solution could be applied to a stream, and I could discard the content I 've read in the past, further information, and historical information (there may be other decisions) in terms of processing, it is too limited to a limited method. The final fact is that, at least for historical information and for complex RegEx, no matter how we handle it, we must also have corresponding storage and indexing solutions, in this way, the expected limits cannot be obtained when there are only a few methods.

This includes processing of the current input. In fact, we do not need to immediately respond to each position. Instead, we can decide whether to respond or not based on a dynamic decision engine. The obvious advantage is that the more information you get, the easier it will be to process. This means faster speed and less space consumption. If this method is used, the Earley algorithm may be inappropriate as a framework, because it is basically input-driven.

(UPDATE: I thought about it this night. It seems that there is no good way to make dynamic decisions, but it may be a good way to extract some information through preprocessing .)

Writing here is much more than I thought of when I took this note. In the end, this idea lacks careful consideration and needs to be deepened. On the other hand, during this study, I have always intentionally avoided using heuristic algorithms. Because there is always a sense that match is strict, and the heuristic algorithm cannot guarantee the result that best matches the rule. But in fact, the existing RegEx engine is basically not fully compliant with the specifications. Is it too much for me?

(UPDATE: the key is the heuristic algorithm, or the idea of the heuristic algorithm. Is it useful)

In any case, this problem cannot be abandoned and should be considered a long-term research hobby. Some may ask why this question attracts you so much. Is it true that you cannot admit failure or give up learning.

This may be the reason, but it is not the main one. The key lies in the fact that this problem is a basic and difficult issue that is linked to the nature of human thinking, or has a similarity. The Research on this problem and the solution I finally adopted will in turn affect the way I view the world. That's it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.