the storage of illegal characters in XML has always been annoying. In fact, this illegal character is not only an invisible character, but also contains some special characters specified in XML, such as <&>.
A convenient processing method is to store invalid characters in hex or base64 format, the following two functions demonstrate how to use base64 encryption to properly handle invalid characters, which not only ensures data integrity, but also keeps data readable. After all, the generated XML is not only used for machine reading, but also very user-friendly. The train of thought is: for strings with invalid characters, base64 encryption is used in a unified manner. The base64 = true attribute is added to the generated xml tag. For strings without invalid characters, raw data is directly displayed, the base64 attribute is not added to the generated tag. In this way, both data integrity and XML readability can be ensured.
#-*-Encoding: UTF-8-*-"created on @ Summary: helper functions may be used in XML Process @ Author: jerrykwan" try: Import XML. sax. saxutilsexcept importerror: Raise importerror ("requires XML. sax. saxutils package, pleas check if XML. sax. saxutils is installed! ") Import base64import logginglogger = logging. getlogger (_ name _) _ all _ = ["escape", "Unescape"] def escape (data): "@ Summary: Escape '&', '<', and'> 'in a string of data. if the data is not ASCII, then encode in base64 @ Param data: the data to be processed @ return {"base64": True | false, "data ": data} "# Check if all of the data is in ASCII code is_base64 = false escaped_data =" "try: Data. Decode ("ASCII") is_base64 = false # Check if the data shocould be escaped to be stored in XML escaped_data = xml. sax. saxutils. escape (data) Does T unicodedecodeerror: logger. debug ("% s is not ascii-encoded string, so I will encode it in base64") # base64 encode escaped_data = base64.b64encode (data) is_base64 = true return {"base64 ": is_base64, "data": escaped_data} def Unescape (data, is_base64 = false ): "@ Summary: Unescape '&', '<', and '>' in a string of data. if base64 is true, then base64 decode will be processed first @ Param data: the data to be processed @ Param base64: Specify if the data is encoded by base64 @ result: unescaped data "" # Check if base64 unescaped_data = data if is_base64: Try: unescaped_data = base64.b64decode (data) failed t exception, EX: logger. debug ("Some excpetion o Ccured when invoke b64decode ") Logger. error (Ex) print ex else: # Unescape it unescaped_data = xml. sax. saxutils. unescape (data) return unescaped_dataif _ name _ = "_ main _": def test (data): Print "original data is :", data T1 = escape (data) print "escaped Result:", T1 print "unescaped result is:", Unescape (T1 ["data"], T1 ["base64"]) print "#" * 50 test ("123456") test ("test") test ("<&>") test ("'! @ # $ % ^ & *: '\ "-=") Print "just a test"
Note: The preceding method is simple, but it only processes ASCII characters and <&>. Non-ASCII data is encrypted using base64 in a unified manner. For better compatibility, you can use the chardet package, convert the string to UTF-8 storage, which is more suitable.