Chinese garbled characters in JNI and C ++ Communication

Source: Internet
Author: User

First, we need to clarify several basic concepts about encoding:

    • Java internally uses 16-bit Unicode encoding (UTF-16) to represent the string, both English and Chinese are 2 bytes;
    • JNI uses UTF-8 encoding to represent strings. UTF-8 is a variable-length Unicode. Generally, the ASCII character is 1 byte, and the Chinese character is 3 byte;
    • C/C ++ uses raw data, and ASCII is a byte. The Chinese character is generally gb2312 encoded and represents a Chinese character in two bytes.

JNI Chinese string processing

Analyze JAVA --> C ++ and C ++ --> JAVA

    • Java --> C ++

In this case, Java uses a UTF-16-encoded string when calling, JVM passes this parameter to JNI, C ++ gets the input is jstring, at this time, two functions provided by JNI can be used. One is getstringutfchars, which will get a UTF-8-encoded string, and the other is getstringchars, which will get a UTF-16-encoded string. Regardless of the function, if the string contains Chinese characters, it must be further converted to gb2312 encoding.

String
UTF-16)
|
[Java] |
-------------------- JNI call
[CPP] |
V
Jstring
UTF-16)
|
+ -------- + --------- +
| Getstringchars | getstringutfchars
|
V v
Wchar_t * char *
(Utf_16) (UTF-8)
    • C/C ++-> JAVA

The string that JNI returns to Java, C/C ++ should first take charge of turning this string into UTF-8 or UTF-16 format, and then encapsulate it into jstring through newstringutf or newstring, return to Java.

 
String
UTF-16)
^
|
[Java] |
-------------------- JNI returned
[CPP] |
Jstring
UTF-16)
^
|
+ -------- + --------- +
^
|
| Newstring | newstringutf
Wchar_t * char *
(Utf_16) (UTF-8)

 

If the string does not contain Chinese characters, only the standard ASCII code, then use getstringutfchars/newstringutf can be done, because in this case, the UTF-8 encoding and ASCII encoding are consistent, conversion is not required.

However, if a string contains Chinese characters, encoding and conversion in the C/C ++ Section is required. We need two conversion functions: encode utf8/16 to gb2312, and convert gb2312 to utf8/16.

 

It should be noted that both Linux and Win32 support wchar, which is in fact a 16-bit Unicode code UTF16. Therefore, if we use C/C ++ProgramIf the wchar type is fully used, this type of conversion is not required theoretically. However, in fact, we cannot completely replace char with wchar, so for most applications, conversion is still necessary.

Specific conversion functions are supported by Linux and Win32. For example, glibc mbstowcs can be used to convert gb2312 encoding to UTF16, however, such support is generally platform-related (because the C/C ++ standard does not include this part) and is not comprehensive (for example, glibc does not provide UTF-8 encoding ), not independent (in Linux, mbstowcs behavior is affected by locale settings ). Therefore, we recommend that you use the iconv library to complete the conversion.

The iconv library is a free independent encoding conversion library that supports many platforms and multiple encodings (in fact, it can process almost all the character encodings we use ), and its behavior is not affected by any external environment. Iconv is installed by default on * nix platform. Additional installation is required on the Win32 platform.

The following is an example of converting a gb2312 encoded string to utf8 encoding.

# Include <iconv. h>
Char * bytestoutf8 (string SRC, char * DST, int * Nout)
{
Size_t n_in = SRC. Length ();
Size_t n_out = * Nout;

Iconv_t c = iconv_open ("UTF-8", "gb2312 ");
If (C = (iconv_t)-1 ){
Cerr <strerror (errno) <Endl;
Return NULL;
}

 

Char * inbuf = new char [n_in + 1];
If (! Inbuf ){
Iconv_close (C );
Return NULL;
}

Strcpy (inbuf, SRC. c_str ());
Memset (DST, 0, n_out );

Char * In = inbuf;
Char * out = DST;
If (iconv (C, & in, & n_in, & out, & n_out) = (size_t)-1 ){
Cerr <strerror (errno) <Endl;
Out = NULL;
}
Else {
N_out = strlen (DST );
Out = DST;
}

Iconv_close (C );
* Nout = n_out;
Delete [] inbuf;

Return out;
}

 

Additional Notes:

1. From the JNI interface, JNI provides two series of string processing functions: UTF16 and utf8, in the internal implementation of JNI, utf8 is used as the string encoding format, so it is more appropriate to use the utf8 series (newstringutf/getstringutfchars/releasestringutfchars)

2. If the iconv library is used, the setting of the runtime environment does not affect the encoding and conversion. However, the parsing of strings by the outer Java program depends on the locale of the runtime environment, therefore, setting the correct locale is of little significance to JNI, but it is still necessary for the entire system.

 

The above section mainly describes how to use a third-party library to solve the encoding problem. For Windows Platforms only, the related methods provided by windows can be used for encoding conversion.

The following method can be used to convert jstring to char *. It is mainly used when C ++ receives the parameters passed by Java and contains Chinese characters. The encoding has been converted during the conversion process, and Chinese characters can be normally returned.

Char * jstringtowindows (jnienv * pjnienv, jstring jstr)
{
Jsize Len = pjnienv-> getstringlength (jstr );
Const jchar * jcstr = pjnienv-> getstringchars (jstr, null );
Int size = 0;
Char * STR = (char *) malloc (LEN * 2 + 1 );
If (size = widechartomultibyte (cp_acp, 0, lpcwstr (jcstr), Len, STR, Len * 2 + 1, null, null) = 0)
Return NULL;
Pjnienv-> releasestringchars (jstr, jcstr );
STR [size] = 0;
Return STR;
}

Note: The char * returned in the preceding method needs to be deleted and released after use, because the memory is allocated by malloc during the encoding and conversion process. If this parameter is not released, memory leakage will occur.

To return Chinese information to Java in C ++, use the following method to convert char * To jstring.

Jstring windowstojstring (jnienv * ENV, char * Str)
{
Jstring RTN = 0;
Int slen = strlen (STR );
Unsigned short * buffer = 0;
If (slen = 0)
RTN = (ENV)-> newstringutf (STR );
Else
{
Int length = multibytetowidechar (cp_acp, 0, (lpcstr) STR, slen, null, 0 );
Buffer = (unsigned short *) malloc (length * 2 + 1 );
If (multibytetowidechar (cp_acp, 0, (lpcstr) STR, slen, (lpwstr) buffer, length)> 0)
RTN = (ENV)-> newstring (jchar *) buffer, length );
}
If (buffer)
Free (buffer );
Return RTN;
}

Based on the analysis of the preceding compaction stream and the subsequent conversion method, Chinese parameter garbled characters in JNI can be basically solved.

The above content is taken from the network. Personal summary.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.