This article was originally titled how to construct a complex regular expression. However, when I felt a little ambiguous, I felt that the regular expression was very simple. I was teaching people how to make things bigger. On the contrary, I mean that even complex regular expressions are not afraid of finding appropriate methods and constructing them.
Easy to avoid
The text provided by Snopo is as follows: or and name = 'hangsan' and id = 001 or age> 20 or area = '% renmin %' and like, how to extract the correct SQL query statements.
A brief analysis shows that the middle part is compliant, but there are several like, or, and at both ends. It should be complicated to construct regular expressions that can parse query statements that conform to SQL syntax. However, it can be simpler for specific problems. The preceding SQL statements are generated automatically by the program. Some texts at both ends of the SQL statements do not match the meaning of the question. You only need to remove the text.
So I wrote a regular expression: s/^ (? :(? : Or | and | like) \ s *) + | \ s *(? :(? : Or | and | like) \ s *) + $ // mi;. This removes the like, or, and possible spaces at the beginning and end of a multi-line string, the rest is what you want.
Divide and conquer
After the answer is sent, Snopo is obviously not very satisfied with this "lazy" approach. He continued to ask, can he write a regular expression to match the condition query statement required by the SQL syntax? (Only the where part is considered. You do not need to write the complete select statement .)
Indeed, from the perspective of quick problem solving, everything can be done as long as it can be effectively solved. But from the perspective of learning knowledge, it is the right path to learn from the root of the problem. In this case, let's take a look at how to use regular expressions to solve the SQL query statement.
The simplest query statement should be True or false, that is, where 1; where True; where false, and so on. Such a statement uses the regular expression, direct /(? :-? \ D + | True | False)/I.
A slightly more complex single statement can be a comparison between left and right, that is
Copy codeThe Code is as follows: name like 'zhang % ', or age> 25, or work in ('it', 'hr', 'r & D ')
. To simplify it, the structure becomes a op B. A Represents A variable, OP represents A comparison operator, and B represents A value.
• A: The Simplest A should be \ w +. Taking into account the actual situation, the variable contains periods or delimiters, such as 'table. salary ', which can be recorded as/[\ w.'] + /. This is a general refinement. If the requirements are harsh, you can also take off the characters at the same time on the left and right sides (condition judgment ).
• OP: the commonly used relationship of Where is equal to, <>,>,<, >=, <=, Between, Like, in. Use a simple regular expression to describe :/(? : [<>=] {1, 2} | Between | Like | In)/I.
• B: B can be divided into three types: variables, numbers, strings, and lists. For simplicity, arithmetic expressions are not considered here.
The alias variable can be directly extended with the definition of. Do not go into details.
Numeric: Use/\ d +/to define. Decimal places and negative numbers are not considered.
Character string: contains single quotes and double quotation marks. Escape quotation marks can be included in the middle. I wrote a regular expression that matches this requirement, such as:/(['"]) (? : \ ['"] | [^ \ 1]) *? \ 1 /. However, since it is only a part of a huge machine, the risk of writing this is extremely high. First, it uses reverse reference. Second, the reverse reference uses the Global reverse reference number. I wrote a function that automatically generates a global number to solve this problem. However, the details here are not too deep. We should first talk about the framework and then discuss the details. We should not be stuck in the ocean of details at the beginning.
Region list: Lists are objects such as (1, 3, 4) or ("it", "hr", "r & d, it consists of simple variables connected by commas (,) and parentheses () on both sides. A single item in the list is represented by I, which represents a number | string. In this case, the list is changed to:/\ (I (? :, I )*? \)/. It indicates the left parenthesis, an I, a series of other list items (0 or more) consisting of commas and I, and the right parenthesis. For simplicity, no blank characters are considered.
• At this point, we can summarize the regular framework of a single statement: S = ~ /A op B/I. S represents a single statement.
More complex are multiple statements, which can be composed of a single statement and connected by and or. Properly construct a single statement and compile it into multiple statements in a stable manner, and the task is completed.
In the preceding example, if S represents a single statement, the composite Statement C is C = ~ S (? :(? : Or | and) S )*? /. So far, a conditional statement parser of initial scale is born. The following uses python as an example to implement it step by step.
Python implementation
I would like to reiterate that although the implementation is provided, please pay attention to the ideas and ignore the code.Copy codeThe Code is as follows :#! /Usr/bin/python
#-*-Coding: UTF-8 -*-
#
# Author: rex
# Blog: http://iregex.org
# Filename test. py
# Created:
# Generage quoted string;
# Including 'and "string
# Allow \ 'and \ "inside
Index = 0
Def gen_quote_str ():
Global index
Index + = 1
Char = chr (96 + index)
Return r """(? P <quote _ % s> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote _ % s) "" % (char, char)
# Simple variable
Def ():
Return R' [\ w. '] +'
# Operators
Def op ():
Return R '(? : [<>=] {} | Between | Like | In )'
# List item (,)
# Eg: 'A', 23, a. B, "asdfasdf \" aasdf"
Def item ():
Return r "(? : % S | % s) "% (a (), gen_quote_str ())
# A complite list, like
# Eg: (23, 24, 44), ("regex", "is", "good ")
Def items ():
Return r "" \ (\ s *
% S
(? :, \ S * % s) * \ s *
\) "% (Item (), item ())
# Simple comparison
# Eg: a = 15, B> 23
Def s ():
Return r "% s \ s *(? : \ W + | % s) "" % (a (), op (), gen_quote_str (), items ())
# Complex comparison
# Name like 'zhang % 'and age> 23 and work in ("hr", "it", 'r & D ')
Def c ():
Return r """
(? Ix) % s
(? : \ S *
(? : And | or) \ s *
% S \ s *
)*
"% (S (), s ())
Print "A: \ t", ()
Print "OP: \ t", op ()
Print "ITEM: \ t", item ()
Print "ITEMS: \ t", items ()
Print "S: \ t", s ()
Print "C: \ t", c ()
The result of running this code on my machine (Ubuntu 10.04, Python 2.6.5) is:Copy codeThe Code is as follows: A: [\ w. '] +
OP :(? : [<>=] {1, 2} | Between | Like | In)
ITEM :(? : [\ W. '] + | (? P <quote_a> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_a ))
ITEMS: \ (\ s *
(? : [\ W. '] + | (? P <quote_ B> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_ B ))
(? :, \ S *(? : [\ W. '] + | (? P <quote_c> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_c) * \ s *
\)
S: [\ w. '] + \ s *(? : [<>=] {1, 2} | Between | Like | In) \ s *(? : \ W + | (? P <quote_d> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_d) | \ (\ s *
(? : [\ W. '] + | (? P <quote_e> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_e ))
(? :, \ S *(? : [\ W. '] + | (? P <quote_f> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_f) * \ s *
\))
C:
(? Ix) [\ w. '] + \ s *(? : [<>=] {1, 2} | Between | Like | In) \ s *(? : \ W + | (? P <quote_g> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_g) | \ (\ s *
(? : [\ W. '] + | (? P <quote_h> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_h ))
(? :, \ S *(? : [\ W. '] + | (? P <quote_ I> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_ I) * \ s *
\))
(? : \ S *
(? : And | or) \ s *
[\ W. '] + \ s *(? : [<>=] {1, 2} | Between | Like | In) \ s *(? : \ W + | (? P <quote_j> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_j) | \ (\ s *
(? : [\ W. '] + | (? P <quote_k> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_k ))
(? :, \ S *(? : [\ W. '] + | (? P <quote_l> ['"]) (? : \ ['"] | [^'"]) *? (? P = quote_l) * \ s *
\) \ S *
)*
See matching:
Arithmetic expression
I remember it was mentioned just now that "for the sake of simplicity, arithmetic expressions are not considered here ". However, parsing arithmetic expressions is a very interesting topic. As long as it is an algorithm book, it will be mentioned (the infix expression is converted to the prefix expression, and so on ). Of course, it can also be described using a regular expression.
The main idea is:
Copy codeThe Code is as follows: expr-> expr + term | expr-term | term
Term-> term * factor | term/factor | factor
Factor-> digit | (expr)
And code:Copy codeThe Code is as follows :#! /Usr/bin/python
#-*-Coding: UTF-8 -*-
#
# Author: rex
# Blog: http://jb51.net
# Filename math. py
# Created: 2010-08-07
Integer = r "\ d +"
Factor = r "% s (? : \. % S )? "% (Integer, integer)
Term = "% s (? : \ S * [*/] \ s * % s) * "% (factor, factor)
Expr = "(? X) % s (? : \ S * [+-] \ s * % s) * "% (term, term)
Print expr
Let's take a look at its output and matching:
Tips
• Do not use complex regular expressions to solve the problem.
• If you must write complex regular expressions, refer to the following principles.
• First, let's look at the overall structure of the text to be parsed and divide it into widgets;
• Starting from the details, we try to implement every small part, and strive to make every part complete and robust, and there will be no conflict in the whole world.
• Properly assemble these components.
• The benefit of divide-and-conquer: only when an error occurs in a module and other parts are correct, you can quickly locate the error and eliminate the BUG.
• Use the capturing parentheses with caution unless you know what you are doing, what side effects it has, and whether there are feasible solutions. For short regular expressions, one or two redundant parentheses are harmless, but for complex regular expressions, a pair of redundant parentheses may be fatal errors.
• Use free-space mode whenever possible. In this case, you can add comments and spaces freely to improve the readability of regular expressions.