Previously recorded the Python Htmlparser from the Internet to process HTML escape rune files. However, in the processing of content with Chinese characters will be the error, the code is as follows:
The code is as follows |
Copy Code |
# Cat html.py #/usr/bin/python #coding =utf-8 Import Htmlparser Html_parser = Htmlparser.htmlparser () title = ' Eclipse Functionality <template> learning. E.g: quickly insert timestamp in code-361way.com ' Newtitle = Html_parser.unescape (title) Print Newtitle |
The error content is as follows:
code is as follows |
copy code |
Traceback ( Most recent called last): File "html.py", line 7, in <module> newtitle = Html_parser . unescape (title) File "/usr/lib64/python2.6/htmlparser.py", line 390, in Unescape return Re.sub (R) & (#?[ XX]? (?: [0-9a-fa-f]+|w{1,8})); ", Replaceentities, s) File"/usr/lib64/python2.6/re.py ", line 151, in sub return _compile (pattern, 0). Sub (repl, String, Count) Unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe5 in position 7:ordinal don't in range (128) |
The workaround is as follows:
The code is as follows |
Copy Code |
#/usr/bin/python #coding =utf-8 Import Htmlparser Import Sys Reload (SYS) Sys.setdefaultencoding (' Utf-8 ') Html_parser = Htmlparser.htmlparser () title = ' Eclipse Functionality <template> learning. E.g: quickly insert timestamp in code-Segmentfault ' Newtitle = Html_parser.unescape (title) Print Newtitle |
You need to load the SYS module, reset the default encoding to UTF8, and there is no error. However, the content to be processed is only the title part of an article, and the common HTML escape content is as follows:
Character Decimal escape character
"& #34; "
& & #38; &
< & #60; <
> & #62; >
Keep open space (non-breaking spaces) & #160;
Note: It is not commonly used to refer to the HTML Escape character table on the open source China online tool.
I decided to use the Replace function of Python to implement a simple escape function, as follows:
The code is as follows |
Copy Code |
#/usr/bin/python #coding =utf-8 def replace_html (s): s = s.replace (' " ', ' "') s = s.replace (' & ', ' & ') s = s.replace (' < ', ' < ') s = s.replace (' > ', ' > ') s = s.replace (' ', ') s = S.replace ('-361way.com ', ') Print S Replace_html (title)
|
The advantage is that it is quick and concise, does not depend on the module, and does not need to reload the SYS module to specify the default encoding when applied.