First, run the Python interpreter, import the re module, and compile a re:
#!python python 2.2.2 (#1, Feb 2003, 12:57:01) >>> import re >>> p = re.compile (' [a-z]+ ') >>> ; P <_sre. Sre_pattern Object at 80c3c28>
Now, you can try to match the different strings with the [a-z]+] of the RE. An empty string will not match at all, because the + means "one or more repetitions". In this case, match () returns none because it causes the interpreter to have no output. You can clearly print out the results of match () to figure this out.
#!python>>> P.match ("") >>> print P.match ("") None
Now, let's try to match a string, such as "tempo", with it. At this point, match () returns a matchobject. So you can save the results in a variable for later use.
#!python >>> m = p.match (' tempo ') >>> print M <_sre. Sre_match Object at 80c4f68>
Now you can query the ' matchobject ' for information about matching strings. Matchobject instances also have several methods and properties, the most important of which are as follows:
Try these methods soon to be clear about their role:
#!python >>> m.group () ' Tempo ' >>> m.start (), M.end () (0, 5) >>> M.span () (0, 5)
Group () returns a substring that matches the RE. Start () and end () return the index at the start and end of the match. Span () returns the start and end indexes with a single tuple. Because the match method checks that if RE starts to match at the beginning of the string, start () will always be zero. However, the search method of the ' Regexobject ' instance scans the following string, in which case the position of the match start may not be zero.
#!python >>> Print P.match ('::: Message ') None >>> m = P.search ('::: Message '); Print M <re. Matchobject instance at 80c9650> >>> m.group () ' Message ' >>> M.span () (4, 11)
In the actual program, the most common practice is to save ' matchobject ' in a variable and then check if it is None, which is usually as follows:
#!python p = re.compile (...) m = P.match (' string goes here ') if M:print ' match found: ', M.group () else:print ' No ma Tch
Two ' Regexobject ' methods return substrings of all matching patterns. FindAll () returns a matching string row table:
#!python >>> p = re.compile (' \d+ ') >>> p.findall (' drummers drumming, pipers piping, Lords A-lea Ping ') [' 12 ', ' 11 ', ' 10 ']
FindAll () has to create a list when it returns results. In Python 2.2, you can also use the Finditer () method.
#!python >>> iterator = p.finditer (' Drummers drumming, 11 ... ... ') >>> iterator <callable-iterator object at 0x401833ac> >>> to match in iterator: ... Print Match.span () ... (0, 2) (22, 24) (29, 31)
Module-level functions
You do not have to produce a ' Regexobject ' object and then call its method; The RE module also provides top-level function calls such as match (), search (), sub (), and so on. These functions use the RE string as the first argument, and the subsequent arguments are the same as the corresponding ' Regexobject ' method parameters, and either None is either the "Matchobject" instance.
#!python >>> Print Re.match (R ' from\s+ ', ' fromage amk ') None >>> re.match (R ' from\s+ ', ' from AMK Thu May 1 4 19:12:10 1998 ') <re. Matchobject instance at 80c5978>
Under the hood, these functions simply produce a regexoject and invoke the corresponding method on it. They also save the compiled object in the cache, so it will be quicker to use the same RE in future calls.
Will you use these module-level functions, or do you get a ' regexobject ' method to call it again? How to choose depends on how to use RE more efficiently and your personal coding style. If a RE is used only once in the code, then the module-level function may be more convenient. If your program contains many regular expressions, or if you reuse the same one in multiple places, it is more useful to put all the definitions together and compile all the REs in a piece of code in advance. See an example from the standard library, which is extracted from the xmllib.py file:
#!python ref = Re.compile (...) entityref = Re.compile (...) charref = Re.compile (...) starttagopen = Re.compile (...) )
I usually prefer to use a compiled object, even if it is used only once, but few people would be as much of a purist about this as I am.
Compile flags
The compile flag allows you to modify some of the way regular expressions are run. In the RE module The logo can use two names, one is full name such as IGNORECASE, one is abbreviated, one letter form like I. (If you are familiar with Perl's pattern modifications, use the same letters in one letter; for example, re.) The abbreviated form of verbose is re. X. Multiple flags can be specified by bitwise or-ing them. such as Re. I | Re. M is set to the I and M flags:
There is a list of available flags, followed by a detailed description of each flag.
I
IGNORECASE
Makes the match insensitive to case, and the character class and the string that match the letter are ignored when the case is written. For example, [A-z] can also match lowercase letters, Spam can match "Spam", "Spam", or "Spam". This lowercase letter does not take into account the current position.
L
LOCALE
Affects \w, \w, \b, and \b, depending on the current localization settings.
Locales is a feature in the C language library and is used to help with programming that requires different languages to consider. For example, if you are working with French text, you want to match the text with \w+, but \w only matches the character class [a-za-z]; it does not match "é" or "C". If your system is properly configured and localized to French, the internal C function tells the program that "é" should also be considered a letter. Using the LOCALE flag when compiling regular expressions will give you the ability to use these C functions to handle \w compiled objects, which will be slower, but will also match the French text with \w+ as you would expect.
M
MULTILINE
(At this time ^ and $ will not be interpreted; they will be introduced in section 4.1.)
Use "^" to match only the beginning of the string, and $ to match only the end of the string and the end of the string directly before the line break (if any). When this flag is specified, "^" matches the start of the string and the start of each line in the string. Similarly, the $ metacharacters match the end of the string and the end of each line in the string (directly before each line break).
S
Dotall
Make the "." Special character match any character exactly, including line breaks; no this flag, "." matches any characters except line breaks.
X
VERBOSE
This flag is given by giving you a more flexible format so that you can write regular expressions much easier to understand. When the flag is specified, whitespace characters in the RE string are ignored, unless the whitespace is in the character class or after the backslash, which allows you to organize and indent the re more clearly. It can also allow you to write comments to the RE, which are ignored by the engine; comments are identified by the "#" sign, but the symbol cannot be followed by a string or backslash.
For example, here is a use of re. VERBOSE RE; see how easy it is to read it?
#!python charref = Re.compile (r "" "&[[]" # Start of a numeric entity reference ([0-9]+[^0-9] # Decimal Form | 0[0-7]+[^0-7] # octal form | x[0-9a-fa-f]+[^0-9a-fa-f] # hexadecimal form) "", Re. VERBOSE)
Without the verbose setting, RE will look like this:
#!python charref = Re.compile ("([0-9]+[^0-9]" "|0[0-7]+[^0-7]" "|x[0-9a-fa-f]+[^0-9a-fa-f]")
In the example above, the Python string auto-join can be used to break the re into smaller parts, but it is better than re. VERBOSE logo is more difficult to understand.