First of all, a rather embarrassing thing: in writing the shrimp music to try to listen to the downloader encountered a problem, because the saved files are named after the title of the music, so encountered some such as "'s really into Zhi/out border" such as contain illegal characters (hem, said you →_→ Windows) title, It will save the failure. So I think of the solution of Thunderbolt: to replace all illegal characters with underscores.
The use of regular expressions is then introduced. After searching for swallowed, I wrote the following function:
Copy the Code code as follows:
def sanitize_filename (filename):
return re.sub (' [\ \:*? <>|] ', ' _ ', filename)
Recently realized many of the problems in this function:
- Unlike a Shell, a backslash is an escape character, regardless of the single or double quotation marks. The dog poop is that Python's handling of meaningless escapes
\/
is kept intact.
- Even so,
sanitize_filename('\\/:*?<>|')
The return \_______
is still not all underlined.
So I felt Turkey to look at the document.
Raw strings
After reading the document, we realized that the escape of the Python regular expression module was independent. For example, matching a backslash character requires a parameter to be written: ' \\\\ ':
Python escapes the string: \\\\ is escaped to \ \
The RE module obtains the incoming \ \ to interpret it as a regular expression, escaping it as a regular expression by escaping the rule as \
In such a troublesome premise, Raw string can do a very much, as the name implies (except the trailing backslash) will not be escaped the string. So you can write R ' \ \ ' by matching a backslash character.
So the above Sanitize_filename changed to:
Copy the Code code as follows:
def sanitize_filename (filename):
Return Re.sub (R ' [\\/:*? <>|] ', ' _ ', filename)
Regex and Match
So seriously look at the RE module bar ~ The following is a running account for the impatient watch.
Python's regular expression module the main objects in re are these two:
Regular Expression Regexobject
Match Matchobject
Regexobject is a regular expression object, and all operations such as match sub are owned by it. Generated by re.compile (pattern, flag).
Copy the Code code as follows:
>>> Email_pattern = re.compile (R ' \w+@\w+\.\w+ ')
>>> email_pattern.findall (' My email is abc@def.com and he is user@example.com ')
[' abc@def.com ', ' user@example.com ']
One of the methods:
Search starts from any character and returns Matchobject or None
Match starts with the first character, returns Matchobject or None
Split returns the List that was split by the match
FindAll returns all matching List
Finditr returns an iterator to Matchobject
Sub returns the replaced string
SUBN return (replacement string, number of replacements)
Functions provided in the RE module, such as Re.sub Re.match Re.findall, can actually be thought of as a shortcut to eliminate the direct creation of regular expression objects. And since the Regexobject object itself can be reused, this is the advantage of its relative to these shortcut functions.
Matchobject is a matching object that represents the result of a regular expression match. Returned by some methods of Regexobject. Matching objects are always True, and there is a whole bunch of ways to get information about grouping in regular expressions.
Copy the Code code as follows:
>>> for M in Re.finditer (R ' (\w+) @\w+\.\w+ ', ' My e-mail is abc@def.com and he is user@example.com '):
... print '%d-%d%s%s '% (M.start (0), m.end (0), M.group (1), M.group (0))
...
12-23 ABC abc@def.com
35-51 User user@example.com
Reference
- The Python standard Library: http://docs.python.org/2/library/re.html