Example of html code escape conversion using python

Source: Internet
Author: User

Previously recorded the Python Htmlparser from the Internet to process HTML escape rune files. However, in the processing of content with Chinese characters will be the error, the code is as follows:

The code is as follows Copy Code

# Cat html.py
#/usr/bin/python
#coding =utf-8
Import Htmlparser
Html_parser = Htmlparser.htmlparser ()
title = ' Eclipse Functionality <template> learning. E.g: quickly insert timestamp in code-361way.com '
Newtitle = Html_parser.unescape (title)
Print Newtitle

The error content is as follows:

  code is as follows copy code

Traceback ( Most recent called last):
  File "html.py", line 7, in <module>
    newtitle = Html_parser . unescape (title)
  File "/usr/lib64/python2.6/htmlparser.py", line 390, in Unescape
    return Re.sub (R) & (#?[ XX]? (?: [0-9a-fa-f]+|w{1,8})); ", Replaceentities, s)
  File"/usr/lib64/python2.6/re.py ", line 151, in sub
    return _compile (pattern, 0). Sub (repl, String, Count)
Unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe5 in position 7:ordinal don't in range (128)

The workaround is as follows:

The code is as follows Copy Code

#/usr/bin/python
#coding =utf-8
Import Htmlparser
Import Sys
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')
Html_parser = Htmlparser.htmlparser ()
title = ' Eclipse Functionality &lt;template&gt; learning. E.g: quickly insert timestamp in code-Segmentfault '
Newtitle = Html_parser.unescape (title)
Print Newtitle

You need to load the SYS module, reset the default encoding to UTF8, and there is no error. However, the content to be processed is only the title part of an article, and the common HTML escape content is as follows:

Character Decimal escape character
"& #34; &quot;
& & #38; &amp;
< & #60; &lt;
> & #62; &gt;
Keep open space (non-breaking spaces) & #160; &nbsp;

Note: It is not commonly used to refer to the HTML Escape character table on the open source China online tool.

I decided to use the Replace function of Python to implement a simple escape function, as follows:

The code is as follows Copy Code
#/usr/bin/python
#coding =utf-8
def replace_html (s):
s = s.replace (' &quot; ', ' "')
s = s.replace (' &amp; ', ' & ')
s = s.replace (' &lt; ', ' < ')
s = s.replace (' &gt; ', ' > ')
s = s.replace (' &nbsp; ', ')
s = S.replace ('-361way.com ', ')
Print S
Replace_html (title)

The advantage is that it is quick and concise, does not depend on the module, and does not need to reload the SYS module to specify the default encoding when applied.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.