python--(2)

Source: Internet
Author: User
Tags character classes

This article was translated from official documentation: Regular Expression HOWTO

Article: python--(1)

Full text download: Python is a basic form of expression

======================================================================================

3. Using the Regular form

Now. We've learned some simple, regular forms, but how do we use them in Python? The RE module provides an interface for connecting the normal table engine, and it is agreed that you will compile the re into objects and use them for matching.

----------------------------------------------------------------------------------------- -------------------------------------------------------------

3.1. Compiling the regular form of the expression

The normal form is compiled into a schema object that has many methods to perform a variety of operations, such as finding or running a replacement of a string according to pattern matching.

>>> import re>>> p = re.compile (' ab* ') >>> pre.compile (' ab* ')
The re.compile () method can also accept an optional flags parameter to specify a variety of special functions and syntax changes that we will learn later. Now let's look at a simple example:

>>> p = re.compile (' ab* ', re. IGNORECASE)
The regular table is passed as a string parameter to the Re.compile () method. Because the normal form is not the core of Python, there is no special syntax to represent them. So it is only possible to be treated as a string (there are many applications that do not require re, so there is no need to put re into the core of Python). Conversely, the normal form re module is only embedded in Python as a C extension module, like a socket or zlib module.


It also makes Python more concise as a string, but there is also a drawback, and we'll talk about it below.
------------------------- ------------------------- ------------------------- ------------------------- ------------------------- -------------------------

3.2. Trouble with the back slash
As described above. The use of backslashes \ To make some characters have special meanings (such as \s) or to remove special characters (for example, \* is to represent asterisks without special meaning). This conflicts with characters in the Python string that implement the same functionality.


If we are going to write a re to match a string "\section" in the Latex file, first you need to write out the string that will match in the program code.

Then. You need to precede backslashes and other metacharacters with backslashes to remove their special meanings. So get the result "\\section", this string will be passed to the Re.compile () function. However, it is important to know that the backslash in the Python string is also special, so precede the two backslashes with a backslash. Get the string "\\\\section".

Match string
Matching steps
\section The string that will match
\\section The expression ' \ \ ' in the form of ' \ '
"\\\\section" The Python string also uses ' \ \ ' to denote ' \ '
Anyway. To match a backslash character ' \ ', you need to write four backslashes ' \\\\ ' as a regular table-type string, since the regular form must be a double slash \ \, and each backslash in the Python string is also denoted by a double slash ' \ \ '.

This creates the need to repeat the backslash repeatedly, and to make the final form string difficult to understand.


The workaround is to use the original string in Python. The so-called primitive string. That is, precede the string with the letter R. In this way, the backslash in the string is stripped of the special semantics, and is considered a normal character. For example, the string R "\ n" is a string that includes ' \ ' and ' n ' two characters. The string "\ n" is a string that has only a newline character. Regular expressions are typically represented using the original string in Python.

Regular table-type string
Raw string
"Ab*" R "Ab*"
"\\\\section" R "\\section"
"\\w+\\s+\\1" R "\w+\s+\1"
------------------------------------------------------------------------------------------------------------------------------------------------------

3.3. Run match
What would you do with a schema object after compiling the regular table? Schema objects include many methods and properties, and we'll just cover a few of the most frequently used ones here. You can view the full list by looking at the documents of the RE module.

Method/Property
Function
Match () Infer whether a regular expression matches a string from the beginning
Search () Scan a string to find the first position of a regular table match
FindAll () Scans a string, finds all substrings that match the regular table, and returns them as a list
Finditer () Scans a string, finds all substrings that match the regular table, and returns them as Iterators
Assuming no matching string is found, the match () and search () methods return none.

If the match succeeds, a matching object is returned, including the matching information: The starting position and the matched substring, and so on.


You can use the RE module in interactive mode to learn these things. Suppose you can use Tkinter. You can take a look at the tools/demo/redemo.py program, which is a demo sample program that is published with Python. It allows you to enter a regular expression and a string, and outputs whether the two match. When you test a complex, regular form. The redemo.py is very practical. Phil Schwartz's Kodos is also a very useful and interactive tool for developing and testing the normal form.


We use the standard Python interpreter to interpret these examples.

First, open the Python interpreter and import the RE module. Then compile a re:

>>> import re>>> p = re.compile (' [a-z]+ ') >>> pre.compile (' [a-z]+ ')
Now you can use the a-z]+ to match a variety of strings. However, an empty string cannot be matched, because the plus sign + indicates that the match () method will return none in such a case, because it is repeated more than 1 times.

In addition, this result is not output in the interpreter. Just you can clearly call the print () method to output this result.

>>> P.match (") >>> print (P.match (")) None
Next, let's try a string that it can match. For example, the string "tempo".

Such a case. The match () method will return a match object, and in order to use the object later, you should save the result in a variable.

>>> m = p.match (' tempo ') >>> m<_sre. Sre_match object; Span= (0, 5), match= ' tempo ' >

You can now query matching string information using matching objects.

There are also methods and properties for matching object instances. Here are some of the most important:

Method/Property Function
Group () Returns a matching string
Start () Returns the start position of a string match
End () Returns the end position of a string match
Span () Returns a tuple representing a matching location. (Start, end)

Try these examples below. You can understand these methods very quickly:

>>> m.group () ' Tempo ' >>> m.start (), M.end () (0, 5) >>> M.span () (0, 5)
The group () method returns substrings determined by the Re.start () and re.end () positions. The span () method returns the start and end positions of the substring with a single tuple. But it's important to note that. The match () method is to infer whether a regular expression matches a string from the beginning, so the start () method always returns 0.

However, the search () method is not the same, it scans the entire string. The start position of the matched substring is not necessarily 0.

>>> Print (P.match ('::: Message ')) none>>> m = P.search ('::: Message ') >>> print (m) <_sre. Sre_match object; Span= (3, ten), match= ' message ' >>>> m.group () ' Message ' >>> M.span () (3, 10)
In the actual program, the most commonly used notation is to store the matching object in a variable and then check if it is none.

Just like this:

p = re.compile (...) m = P.match (' String  goes here  ') if M:print (' Match  found: ', M.group ()) Else:    print (' No  match ')
There are two methods that can return all matching substrings. The FindAll () method returns a list of all matching strings:
>>> p = re.compile (' \d+ ') >>> p.findall (' drummers drumming, one pipers piping, ten Lords a-leaping ') [' 12 ', ' 11 ', ' 10 ']
The FindAll () method needs to create an entire list before it returns the result.

However, the Finditer () method returns the matching object as an iterator (the translator notes: The way the iterator is more memory-efficient).

>>> iterator = P.finditer (' Drummers drumming, pipers piping, Lords a-leaping ') >>> iterator<c Allable_iterator object at 0x036a2110>>>> to match in Iterator:print (Match.span ()) (0, 2) (22, 24) (40, 42)
------------------------------------------------------------------------------------------------------------------------------------------------------
3.4. Module-level Functions
You do not have to create a method that matches the object to invoke it, and the RE module also provides some global functions such as match (), search (), Findadd (), Sub (), and so on. The first parameter of these functions is a regular table-type string, and the other parameters have the same number of parameters as the schema object, and the return value is the same. Either returns none or matches the object.


>>> Print (Re.match (R ' from\s+ ', ' fromage amk ')) none>>> print (Re.match (R ' from\s+ ', ' fromage amk ')) <_sre. Sre_match object; span= (0, 7), match= ' fromage ' >>>> print (Re.match (R ' from\s+ ', ' fromage AMK '). Group ()) fromage>>> Re.match (R ' from\s+ ', ' from AMK Thu 19:12:10 1998 ') <_sre. Sre_match object; Span= (0, 5), match= ' from ' >
In fact, these functions simply create a schema object for you, and can invoke its related functions. Other than that. It stores the compiled schema objects in the cache, so assume that after you use the same regular form, you don't have to create the schema again. Able to implement high-speed calls.

So you should use these module-level functions? Or should you compile your schema object before calling it? Suppose you want to use a regular form in a loop. Compiling it ahead of time saves the call to the function.

But outside of the loop because of the internal buffering mechanism. The efficiency of the two is not bad up and down.
------------------------- ------------------------- ------------------------- ------------------------- ------------------------- -------------------------
3.5. Compile flags
Compiling flags allows you to change the way you work in some ways. The compile flag has two available names in the RE module: full name and shorthand. For example, the shorthand for ignorecase is the letter I (assuming that you are familiar with the Perl language pattern, you will know that the Perl language shorthand is the same as this, for example, re.) Verbose and shorthand are re.x).

Multiple compilation flags can be either logical or connected. For example, re. I | Re. M set I and M two flags.



The following table lists some of the available compilation flags:

tr>
compile flag
ascii. A
dotall,s
ignorecase. I match does not distinguish between uppercase and lowercase
Locale,l supports current language (region) settings
multiline,m Multiline match, will affect ^ and $
verbose. X (for ' extended ') enable the specific normal expression

I
IGNORECASE

Matches do not distinguish between uppercase and lowercase, so that character classes and text strings do not differentiate between uppercase and lowercase when matching characters. For example, [A-z] will also match lowercase letters. Spam will match spam, spam, and spam. If you do not set the locale flag, you do not consider the issue of the language (locale) setting method.



L
LOCALE

Making \w, \w, \b, and \b depend on the current locale, not the Unicode database.

Locale is a function of the C language library, mainly to consider the differences in language when typing code. Suppose, for example, that you are working on a French text. You want to write \w+ to match the words. But \w only matches the words in today's character class [A-za-z], it doesn't match ' é ' or '? ', assuming your system is set to the French locale, then the C function will think ' é ' is also a letter. The locale flag is set when compiling the normal table. \w will be able to recognize the French. But it's relatively slow.

M
MULTILINE

(^ and $ we haven't mentioned yet.) They will be explained later)

Normally, ^ matches only the beginning of a string, and $ only matches the end of the string, and when this flag is set, the meta-character ^ will match the beginning of each line in the string, and, similarly, the meta-character $ will match the end of each line.

S
Dotall

Make the dot '. ' Matches all of the characters. Contains a newline character.

Assuming this flag is not set, dot '. ' All characters except the line break will be matched.

A
Ascii

Causes the \w, \w, \b, \b, and \s values to match ASCII characters instead of Unicode characters. This flag is only meaningful for Unicode mode, and ignores byte patterns.

X
VERBOSE

This flag allows you to organize the form to be more flexible. Thus the written form of the expression is more readable. Assuming that this flag is set, the spaces in the form are ignored, but the spaces in the character class are not included. It also does not contain spaces that are escaped by backslashes, which will allow you to organize your regular expressions more clearly.

Other than that. This flag also agrees to use gaze in the regular form, and the character # and subsequent characters will be ignored by the regex, unless the pound # is escaped in the character class or after a backslash.

Here's a look at one using re. Examples of verbose:

>>> charref = re.compile (R ' &[#]                   #開始数字引用 (0[0-7]+         #八进制格式       |[ 0-9]+          #十进制格式       |x[0-9a-fa-f]+     #十六进制格式);                      #结尾分号 ", Re. VERBOSE)

Suppose you don't have to verbose settings. This form will be the following format:

>>> charref = Re.compile (' &[#] (0[0-7]+ '         |[ 0-9]+ '         |x[0-9a-fa-f]+);
in the examples above. We used Python's ability to concatenate strings on its own initiative, dividing the regular form into a few smaller parts. But it still does not use re. Verbose version number of re good understanding.

python--(2)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.