I wrote about Google protocol buffer's UTF-8 problem last time.
According to the protocol author Kenton Varda
Description
:
C ++ protocol Buffers
Use UTF-8 for all text encoding, regardless of platform. If you want to use some other encoding in your code, you will
Have to manually convert between that and UTF-8 when interacting
Protocol buffers.
In Java and Python everything is taken care of automatically, since these
Ages have built-in Unicode support. In Java, protocol buffers uses
String object (which are Unicode) to represent strings, and in Python you
Can use the "Unicode" builtin type for Unicode.
Translate
: UTF-8 is used in C ++ for text encoding, Which is platform-independent. If you want to use another encoding, You need to manually convert it to UTF-8. In Java and python, you need to be careful about automatic encoding conversion because these two languages Support built-in Unicode. In Java, protocol buffer uses string objects (UNICODE) to display strings. in Python, you can use built-in Unicode to display ucnode strings.
Explain
: If the proto file of protocol buffer is of the string type, UTF-8 is used in C ++ and Unicode must be used in Java and python. If not, you can manually convert it. For example, in Python:
Data is UTF-8, but the Protocol Python version requires Unicode. What should I do?
Cont = msgcontent ()
Cont. strcont = data. Decode ('utf-8') # Unicode must be decoded from UTF-8
Buff = Cont. serializetostring () # Skip the serialized string.
Similarly, after parse is completed, the string fields are Unicode
Cont = msgcontent ()
Cont. parsefromstring (buff)
Data = Cont. strcont # This is Unicode
Data = data. encode ('utf-8') # What encoding is needed, and what encoding is needed from Unicode
The most important thing is that you are too troublesome. I guess many people are too troublesome. For example, the encoding in the database, file, or even the transmission process is UTF-8, each time it needs to be converted to Unicode and then serialized, it will have to be converted to UTF-8 after conversion. You can use bytes without any processing.
Final supplement
: I think Kenton Varda
On this issue
In fact, any encoding of Java and Python strings is supported, not only Unicode. For example, in Python, if the file is named #-*-coding: UTF-8-*-, all manually entered strings in the Code are UTF-8. If the files are read, the data is read from the database, and the data received from the network is UTF-8, the entire process is unified, which is the most convenient, and automatic conversion is really another step.