Python converts character entity references to Unicode characters

Source: Internet
Author: User
Tags cdata object object in python



HTML entities format such as: &LT;,NCR format such as:& #60; Or & #x3c, all representing the "<" character.
Character entity references is specified in HTML, and the corresponding relationship between HTML characters and NCR is listed in "24.2.1 the list of entities", for example:



<! ENTITY nbsp CDATA "& #160;"--No-break space = non-breaking space, u+00a0 isonum-->
<! ENTITY iexcl CDATA "& #161;"--inverted exclamation mark u+00a1 ISOnum
<! ENTITY Yen CDATA "& #165;"--Yen sign = Yuan sign, u+00a5 ISOnum,-->



So how do we convert HTML entities and NCR to ordinary characters in Python?



Before answering this question, let's do some simple reviews:



Group method



Group ([group1,...]) The
group belongs to a method owned by the match object object that returns one or more child groups that are matched to. If it is a parameter, the result returns a string, and if it is more than one argument, the tuple is returned. The default value for Group1 is 0 (all matching values will be returned), if the GROUPX value is [1 ... 99] Within the range, then the string corresponding to the bracket group will be matched. If the group number is negative or larger than the group number defined in pattern, the Indexerror exception is thrown. If pattern does not match to, but the group matches, then group's value is none. If a pattern can match multiple, then the group corresponds to the last one.
Re.sub Method
Re.sub (pattern, replace, string [, Count])
Sub belongs to the string substitution and modification function of the RE module, which looks for a matching string in the target string and replaces it with the specified String.
Pattern Parameters--matching regular rules
Replace parameters--Specify the string or function to replace. If replace is a function, the function is recalled for all matches, which use a single match Object as a parameter and then return the replacement string. The
string argument--Destination string
Count argument--the maximum number of substitutions, unspecified, replaces all matching strings
Re.sub () using the following case:


  code is as follows

Import RE
def DASHREPL (matchobj):
    if Matchobj.group (0) = = '-':
        return '
    else:
         return '-'
Re.sub ('-{1,2} ', Dashrepl, ' pro----gram-files ']

# result: ' Pro--gram files '
Htmlentitydefs
Htmlentitydefs has three properties, detailed as follows:
Entitydefs:a Dictionary Mapping XHTML 1.0 entity D Efinitions to their replacement text in ISO Latin-1.
Name2codepoint:a Dictionary that maps HTML entity names to the Unicode codepoints. New in version 2.3.
Codepoint2name:a dictionary that maps Unicode codepoints to HTML entity names.


The form of the actual existence is roughly as follows:


The code is as follows

Entitydefs = {' Aelig ': ' \xc6 ', ' aacute ': ' \xc1 ', ' acirc ': ' \xc2 ', ...}
Name2codepoint = {' Aelig ': 198, ' aacute ': 193, ' ACIRC ': 194, ...}
Codepoint2name = {: ' quot ', +: ' amp ', ' lt ', #: ' GT ', ...}


For us, the most useful at this time is the Name2codepoint attribute, for example: "<", name is LT, we can get its code point:60 by NAME2CODEPOINT[LT.



Unichr method



UNICHR is a method of string (UNICHR (int)) that converts integers to corresponding Unicode characters, such as: UNICHR (60) –> u ' \u003c ' or U ' < '


The code is as follows


Import Re, htmlentitydefs



##
# removes HTML or XML character references and entities from a-text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.


def unescape (text):
def convert (matchobj):
Text = Matchobj.group (0)
If text[:2] = = "&#":
# Numeric Character Reference
Try
If text[:3] = = "& #x":
return UNICHR (int (text[3:-1], 16))
Else
return UNICHR (int (text[2:-1]))
Except ValueError:
Pass
Else
# Character Entities References
Try
Text = UNICHR (Htmlentitydefs.name2codepoint[text[1:-1]])
Except Keyerror:
Pass
Return text # return Unicode characters
Return Re.sub ("&#?\w+;", convert, text)


Users have provided the Unicode Chinese string into string strings



Ordinary strings can be encoded into Unicode strings in many ways, depending on which encoding you choose:


  code is as follows
unicodestring = u" Hello World "
# converts Unicode to normal Python string:" Encode " 
utf8string = Unicodestring.encode ("Utf-8")  
Asciistring = Unicodestring.encode ("ASCII")  
Isostring = Unicodestring.encode ("iso-8859-1")  
Utf16string = Unicodestring.encode ("utf-16")  
# Converts a generic Python string to Unicode: "Decode"  
Plainstring1 = Unicode (utf8string, "Utf-8")  
Plainstring2 = Unicode (asciistring, "ASCII")  
Plainstring3 = Unicode (isostring, "iso-8859-1")  
Plainstring4 = Unicode (utf16string, "utf-16")  
Assert plainstring1 = = Plainstring2 = = Plainstring3 = Plainstring4
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.