Google protocol buffer character encoding (C ++/Java/Python)

Source: Internet
Author: User

I wrote about Google protocol buffer's UTF-8 problem last time.
According to the protocol author Kenton Varda


Description
:

C ++ protocol Buffers
Use UTF-8 for all text encoding, regardless of platform. If you want to use some other encoding in your code, you will
Have to manually convert between that and UTF-8 when interacting
Protocol buffers.

In Java and Python everything is taken care of automatically, since these
Ages have built-in Unicode support. In Java, protocol buffers uses
String object (which are Unicode) to represent strings, and in Python you
Can use the "Unicode" builtin type for Unicode.

Translate
: UTF-8 is used in C ++ for text encoding, Which is platform-independent. If you want to use another encoding, You need to manually convert it to UTF-8. In Java and python, you need to be careful about automatic encoding conversion because these two languages Support built-in Unicode. In Java, protocol buffer uses string objects (UNICODE) to display strings. in Python, you can use built-in Unicode to display ucnode strings.

Explain
: If the proto file of protocol buffer is of the string type, UTF-8 is used in C ++ and Unicode must be used in Java and python. If not, you can manually convert it. For example, in Python:

Data is UTF-8, but the Protocol Python version requires Unicode. What should I do?

Cont = msgcontent ()

Cont. strcont = data. Decode ('utf-8') # Unicode must be decoded from UTF-8

Buff = Cont. serializetostring () # Skip the serialized string.

 

Similarly, after parse is completed, the string fields are Unicode

Cont = msgcontent ()

Cont. parsefromstring (buff)

Data = Cont. strcont # This is Unicode

Data = data. encode ('utf-8') # What encoding is needed, and what encoding is needed from Unicode

 

The most important thing is that you are too troublesome. I guess many people are too troublesome. For example, the encoding in the database, file, or even the transmission process is UTF-8, each time it needs to be converted to Unicode and then serialized, it will have to be converted to UTF-8 after conversion. You can use bytes without any processing.

Final supplement
: I think Kenton Varda


On this issue
In fact, any encoding of Java and Python strings is supported, not only Unicode. For example, in Python, if the file is named #-*-coding: UTF-8-*-, all manually entered strings in the Code are UTF-8. If the files are read, the data is read from the database, and the data received from the network is UTF-8, the entire process is unified, which is the most convenient, and automatic conversion is really another step.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.