Python re module-Regular Expression operation, pythonre

Source: Internet
Author: User
Tags character classes

Python re module-Regular Expression operation, pythonre

This module provides regular expression matching operations similar to Perl l. Unicode strings also apply.

 

The regular expression uses the Backslash "\" to represent a special form or as an escape character, which conflicts with the Python syntax. Therefore, python uses "\\\\" to represent "\" in a regular expression, because if a regular expression matches "\", it must be escaped "\\", in the Python syntax, each \ In the string needs to be escaped, so it becomes "\\\\".

 

Is it difficult to write the above Code? In order to make the regular expression more readable, Python specially designed the original string (raw string). Please note that, do not use raw string when writing the file path. There is a trap here. Raw string uses 'R' as the string prefix. For example, r "\ n" indicates two characters "\" and "n", instead of line breaks. This form is recommended when writing regular expressions in Python.

 

Most regular expression operations can achieve the same purpose as module-level functions or RegexObject methods. In addition, you do not need to compile regular expression objects from the beginning, but you cannot use some useful fine-tuning parameters.

 

1. Regular expression syntax

To save space, I will not describe it here.

 

2. Differences between martch and search

Python provides two different primitive operations: match and search. Match starts from the starting point of the string, while search (perl by default) performs any matching from the string.

 

Note: When the regular expression starts with '^', match is the same as search. Match is successful only when the matched string can be matched at the beginning or from the position of the pos parameter. As follows:

>>> Import re
>>> Re. match ("c", "abcdef ")
>>> Re. search ("c", "abcdef ")
<_ Sre. SRE_Match object at 0x00A9A988>

>>> Re. match ("c", "cabcdef ")
<_ Sre. SRE_Match object at 0x00A9AB80>

>>> Re. search ("c", "cabcdef ")
<_ Sre. SRE_Match object at 0x00AF1720>

>>> Patterm = re. compile ("c ")
>>> Patterm. match ("abcdef ")
>>> Patterm. match ("abcdef", 1)
>>> Patterm. match ("abcdef", 2)
<_ Sre. SRE_Match object at 0x00A9AB80>

3. Module Content re. compile (pattern, flags = 0)

 

Compile the regular expression, return the RegexObject object, and then call the match () and search () methods through the RegexObject object.

 

Prog = re. compile (pattern)

Result = prog. match (string)

And

Result = re. match (pattern, string)

Is equivalent.

 

The first method can reuse regular expressions.

 

Re. search (pattern, string, flags = 0)

 

Search for the regular expression in the string. Returns the _ sre. SRE_Match object. If no match exists, None is returned.

 

Re. match (pattern, string, flags = 0)

 

Whether the start of a string matches a regular expression. Returns the _ sre. SRE_Match object. If no match exists, None is returned.

 

Re. split (pattern, string, maxsplit = 0)

 

Use a regular expression to separate strings. If the regular expression is enclosed in parentheses, the matched string will be included in the list and returned. Maxsplit is the number of splits. maxsplit = 1 is separated once. The default value is 0, which is unlimited.

>>> Re. split ('\ W +', 'words, Words .')
['Word', '']
>>> Re. split ('(\ W +)', 'words, Words, words .')
['Word', ',', 'word', ',', 'word', '.', '']
>>> Re. split ('\ W +', 'words, Words. ', 1)
['Word', 'words, Words. ']
>>> Re. split ('[a-f] +', '0a3b9', flags = re. IGNORECASE)

 

Note: Python 2.6 is used. Check the source code and find that split () does not have the flags parameter. Only 2.7 is added. I have found this problem more than once. The official documentation is inconsistent with the source code. if an exception is found, find the cause in the source code.

 

If the string matches the start or end of the string, the returned list starts or ends with an empty string.

>>> Re. split ('(\ W +)', '... words, words ...')
['', '...', 'Word', ', 'word','... ','']

 

If the string does not match, a list of the entire string is returned.

>>> Re. split ("a", "bbb ")
['Bbb ']

 

Re. findall (pattern, string, flags = 0)

 

Find all the substrings matching the RE and return them as a list. This match is returned sequentially from left to right. If no match exists, an empty list is returned.

>>> Re. findall ("a", "bcdef ")
[]

>>> Re. findall (r "\ d +", "12a32bc43jf3 ")
['12', '32', '43 ', '3']

 

Re. finditer (pattern, string, flags = 0)

 

Find all the substrings matching the RE and return them as an iterator. This match is returned sequentially from left to right. If no match exists, an empty list is returned.

>>> It = re. finditer (r "\ d +", "12a32bc43jf3 ")
>>> For match in it:
Print match. group ()

12
32
43
3

 

Re. sub (pattern, repl, string, count = 0, flags = 0)

 

Find all substrings matching the RE and replace them with a different string. The optional parameter count is the maximum number of replicas after pattern matching. The value count must be a non-negative integer. The default value is 0, indicating that all matches are replaced. If no match exists, the string is returned without any change.

 

Re. subn (pattern, repl, string, count = 0, flags = 0)

The method works the same as the re. sub method, but returns a two-element group containing the new string and the number of replace executions.

 

Re. escape (string)

 

Escape non-letter numbers in a string

 

Re. purge ()

 

Clear the regular expression in the cache

 

4. Regular Expression object

 

Re. RegexObject

 

Re. compile () returns the RegexObject object

 

Re. MatchObject

 

Group () returns the string matched by the RE.

Start () returns the position where the matching starts.

End () returns the position at which the matching ends.

Span () returns the position where a tuples contain a match (START, end ).

 

5. Compilation flag

The compile flag allows you to modify the running mode of regular expressions. In the re module, two names can be used for the logo. One is the full name such as IGNORECASE, the other is the abbreviation, and the other is a letter such as I. (If you are familiar with Perl Mode Modification, use the same letter as a letter; for example, re. VERBOSE is abbreviated as re. X .) Multiple tags can be specified by bitwise OR-ing. For example, re. I | re. M is set to I and M:

I 
IGNORECASE

Make matching case insensitive. character classes and strings are case insensitive when matching letters. For example, the [A-Z] can also match lowercase letters, Spam can match "Spam", "spam", or "spAM ". This lowercase letter does not take the current position into consideration.

L 
LOCALE

Affects "w," W, "B, and" B, depending on the current localization settings.

Locales is a function in the C language library and is used to help programming in different languages. For example, if you are processing French text, you want to use "w + to match the text, but" w only matches the character class [A-Za-z]; it does not match "é" or "? ". If your system is configured appropriately and localized to French, the internal C function will tell the program "é" should also be considered a letter. When the LOCALE flag is used to compile a regular expression, these C functions are used to process the compiled objects after w. This slows down, however, you can use "w +" to match the French text as expected.

M 
MULTILINE

(^ And $ will not be explained at this time; they will be introduced in Section 4.1 .)

Use "^" to match only the start of the string, while $ matches only the end of the string and the end of the string directly before the line break (if any. When this flag is specified, "^" matches the start of the string and the start of each line in the string. Similarly, $ metacharacters match the end of a string and the end of each line in the string (directly before each line break ).

S 
DOTALL

Make "." special characters fully match any character, including line breaks; without this sign, "." matches any character except line breaks.

X 
VERBOSE

This flag gives you a more flexible format so that you can write regular expressions more easily. When the flag is specified, the blank character in the RE string is ignored unless the blank character is in the character class or after the backslash; this allows you to organize and indent RE more clearly. It also allows you to write comments to the RE, which will be ignored by the engine. The comments are identified by the "#" sign, but the symbol cannot be behind the string or backslash.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.