Detailed explanation of the character encoding problem when lxml is used to process xml

Source: Internet
Author: User
To simplify the problem, the xml content is simplified to the following form: {code ...} its encoding is gbk, and one of the nodes is a Chinese character. when lxml is used to extract the node value, the following exception occurs: {code ...} the corresponding Python script is: {code ...} but before simplifying... to simplify the problem, the xml content is simplified as follows:

 
 
  中文,就是任性
  
 

Its encoding is gbk, and the node has a Chinese character
When using lxml to extract node values, the following exception occurs:

lxml.etree.XMLSyntaxError: Extra content at the end of the document

The corresponding Python script is:

Tst = U'
 
 
  中文,就是任性
  
 'For event, element in etree. iterparse (BytesIO (tst. encode ('utf-8'): print ("% s, % s" % (element. tag, element. text ))

However, before simplification, another exception was reported.

lxml.etree.XMLSyntaxError: input conversion failed due to input error, bytes 0x8B 0x2C 0xE6 0x9D

Regardless of the exception, the prediction is related to the character encoding format.
After a variety of attempts, I later saw this article in stackoverflow. the problems mentioned in this article are related to the encoding value in xml, and I tried to add a piece of code.

Tst = U'
 
 
  中文,就是任性
  
 'Tst = tst. replace ('Encoding = "gbk" ', 'Encoding = "UTF-8"') for event, element in etree. iterparse (BytesIO (tst. encode ('utf-8'): print ("% s, % s" % (element. tag, element. text ))

Added a replacement statement to replace the previous encoding = "gbk" with encoding: "UTF-8"
So I finally got the result:

Da, Chinese, is willful DOCUMENT, None

The above is a detailed description of the character encoding problem when lxml is used to process xml. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.