Google protocol buffer character encoding (C ++/Java/Python)

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I wrote about Google protocol buffer's UTF-8 problem last time.
According to the protocol author Kenton Varda

Description
:

C ++ protocol Buffers
Use UTF-8 for all text encoding, regardless of platform. If you want to use some other encoding in your code, you will
Have to manually convert between that and UTF-8 when interacting
Protocol buffers.

In Java and Python everything is taken care of automatically, since these
Ages have built-in Unicode support. In Java, protocol buffers uses
String object (which are Unicode) to represent strings, and in Python you
Can use the "Unicode" builtin type for Unicode.

Translate
: UTF-8 is used in C ++ for text encoding, Which is platform-independent. If you want to use another encoding, You need to manually convert it to UTF-8. In Java and python, you need to be careful about automatic encoding conversion because these two languages Support built-in Unicode. In Java, protocol buffer uses string objects (UNICODE) to display strings. in Python, you can use built-in Unicode to display ucnode strings.

Explain
: If the proto file of protocol buffer is of the string type, UTF-8 is used in C ++ and Unicode must be used in Java and python. If not, you can manually convert it. For example, in Python:

Data is UTF-8, but the Protocol Python version requires Unicode. What should I do?

Cont = msgcontent ()

Cont. strcont = data. Decode ('utf-8') # Unicode must be decoded from UTF-8

Buff = Cont. serializetostring () # Skip the serialized string.

Similarly, after parse is completed, the string fields are Unicode

Cont = msgcontent ()

Cont. parsefromstring (buff)

Data = Cont. strcont # This is Unicode

Data = data. encode ('utf-8') # What encoding is needed, and what encoding is needed from Unicode

The most important thing is that you are too troublesome. I guess many people are too troublesome. For example, the encoding in the database, file, or even the transmission process is UTF-8, each time it needs to be converted to Unicode and then serialized, it will have to be converted to UTF-8 after conversion. You can use bytes without any processing.

Final supplement
: I think Kenton Varda

On this issue
In fact, any encoding of Java and Python strings is supported, not only Unicode. For example, in Python, if the file is named #-*-coding: UTF-8-*-, all manually entered strings in the Code are UTF-8. If the files are read, the data is read from the database, and the data received from the network is UTF-8, the entire process is unified, which is the most convenient, and automatic conversion is really another step.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Google protocol buffer character encoding (C ++/Java/Python)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support