After a variety of attempts, I later saw this article in stackoverflow. the problems mentioned in this article are related to the encoding value in xml, and I tried to add a piece of code to simplify the problem, the xml content is simplified as follows:
中文,就是任性
Its encoding is gbk, and one of the nodes is a Chinese character. when lxml is used to extract the node value, the following exception occurs:
lxml.etree.XMLSyntaxError: Extra content at the end of the document
The corresponding Python script is:
Tst = U'
中文,就是任性
'For event, element in etree. iterparse (BytesIO (tst. encode ('utf-8'): print ("% s, % s" % (element. tag, element. text ))
However, before simplification, another exception was reported.
lxml.etree.XMLSyntaxError: input conversion failed due to input error, bytes 0x8B 0x2C 0xE6 0x9D
Regardless of the exception, the prediction is related to the character encoding format.
After a variety of attempts, I later saw this article in stackoverflow. the problems mentioned in this article are related to the encoding value in xml, and I tried to add a piece of code.
Tst = U'
中文,就是任性
'Tst = tst. replace ('Encoding = "gbk" ', 'Encoding = "UTF-8"') for event, element in etree. iterparse (BytesIO (tst. encode ('utf-8'): print ("% s, % s" % (element. tag, element. text ))
A replacement statement is added to replace the previous encoding = "gbk" with encoding: "UTF-8", and the result is finally obtained:
Da, Chinese, is willful DOCUMENT, None
The above is the detailed content about character encoding when lxml is used to process xml. For more information, see other related articles in the first PHP community!