Python Regular Expression (Escape problem), python escape
Let's first talk about a relatively embarrassing thing: When I write a Xiami music audition package, I encountered a problem, because the saved files are all named by the music title, therefore, when you encounter titles that contain illegal characters such as "logging handler/out border" (hum, it means you → _ → Windows), it will fail to be saved. So I think of the Solution of Thunder: replace all invalid characters with underscores.
So we introduced the use of regular expressions. After some searching, I wrote the following function:
Copy codeThe Code is as follows:
Def sanitize_filename (filename ):
Return re. sub ('[\/:*? <> |] ',' _ ', Filename)
Recently I realized many problems in this function:
- Python and Shell are different. whether single quotes or double quotation marks, the backslash is an escape character. Python does not make any sense.
\/
Is unchanged.
- Even so,
sanitize_filename('\\/:*?<>|')
Still return\_______
Not all are underscores.
So I felt like I was reading the document.
Raw strings
After reading the document, I realized that the escape Function of the Python Regular Expression module is independent. For example, to match a backslash character, you must write the parameter :'\\\\':
Python escapes the string \\\\ \\
The re module obtains the passed \ and interprets it as a regular expression. According to the escape rules of the regular expression, it is escaped \
In this case, Raw String has a lot to do. As the name suggests, it is a String (except the backslash at the end) that will not be escaped. Therefore, you can write R' \ 'By matching a backslash character '\\'.
So the above sanitize_filename is changed:
Copy codeThe Code is as follows:
Def sanitize_filename (filename ):
Return re. sub (R '[\\/:*? <> |] ',' _ ', Filename)
Regex and Match
So let's take a look at the re module ~ The following is a flow account for acute viewing.
The main objects in the re module of Python's regular expression are actually:
RegexObject
Match MatchObject
RegexObject is a regular expression object, and all operations such as match sub belong to it. Generated by re. compile (pattern, flag.
Copy codeThe Code is as follows:
>>> Email_pattern = re. compile (R' \ w + @ \ w + \. \ w + ')
>>> Email_pattern.findall ('My e-mail is abc@def.com and his is user@example.com ')
['Abc @ def.com ', 'user @ example.com']
The method is as follows:
Search matches any character and returns MatchObject or None
Match starts from the first character and returns MatchObject or None
Split returns the List separated by a match.
Findall returns all matched lists.
Finditr returns the MatchObject iterator
Sub returns the replaced string
Return Value of subn (replacement string, replacement times)
Functions provided by the re module, such as re. sub re. match re. findall, can be considered as a shortcut to directly create a regular expression object. The RegexObject itself can be used repeatedly, which is also the advantage of RegexObject over these shortcut functions.
MatchObject is a matching object, indicating the result of a regular expression match. Returned by some RegexObject methods. The matching object is always True, and there are also a lot of methods to obtain group-related information in regular expressions.
Copy codeThe Code is as follows:
>>> For m in re. finditer (R' (\ w +) @ \ w + \. \ w + ', 'My email is abc@def.com and his is user@example.com '):
... Print '% d-% d % s % s' % (m. start (0), m. end (0), m. group (1), m. group (0 ))
...
12-23 abc abc@def.com
User user@example.com 35-51
Reference
- The Python Standard Library: http://docs.python.org/2/library/re.html