To simplify the problem, the xml content is simplified to the following form: {code ...} its encoding is gbk, and one of the nodes is a Chinese character. when lxml is used to extract the node value, the following exception occurs: {code ...} the corresponding Python script is: {code ...} but before simplifying... to simplify the problem, the xml content is simplified as follows:
中文,就是任性
Its encoding is gbk, and the node has a Chinese character
When using lxml to extract node values, the following exception occurs:
lxml.etree.XMLSyntaxError: Extra content at the end of the document
The corresponding Python script is:
Tst = U'
中文,就是任性
'For event, element in etree. iterparse (BytesIO (tst. encode ('utf-8'): print ("% s, % s" % (element. tag, element. text ))
However, before simplification, another exception was reported.
lxml.etree.XMLSyntaxError: input conversion failed due to input error, bytes 0x8B 0x2C 0xE6 0x9D
Regardless of the exception, the prediction is related to the character encoding format.
After a variety of attempts, I later saw this article in stackoverflow. the problems mentioned in this article are related to the encoding value in xml, and I tried to add a piece of code.
Tst = U'
中文,就是任性
'Tst = tst. replace ('Encoding = "gbk" ', 'Encoding = "UTF-8"') for event, element in etree. iterparse (BytesIO (tst. encode ('utf-8'): print ("% s, % s" % (element. tag, element. text ))
Added a replacement statement to replace the previous encoding = "gbk" with encoding: "UTF-8"
So I finally got the result:
Da, Chinese, is willful DOCUMENT, None
The above is a detailed description of the character encoding problem when lxml is used to process xml. For more information, see other related articles in the first PHP community!