First, we need to clarify several basic concepts about encoding:
Java internally uses 16-bit unicode encoding (UTF-16) to represent the string, both English and Chinese are 2 bytes;
Jni uses UTF-8 encoding to represent strings. UTF-8 is a variable-length unicode. Generally, the ascii character is 1 byte, and the Chinese character is 3 byte;
C/c ++ uses raw data, and ascii is a byte. The Chinese character is generally GB2312 encoded and represents a Chinese character in two bytes.
Jni Chinese string processing
Analyze java --> C ++ and C ++ --> java
Java --> C ++
In this case, java uses a UTF-16-encoded string when calling, jvm passes this parameter to jni, C ++ gets the input is jstring, at this time, two functions provided by jni can be used. One is GetStringUTFChars, which will get a UTF-8-encoded string, and the other is GetStringChars, which will get a UTF-16-encoded string. Regardless of the function, if the string contains Chinese characters, it must be further converted to GB2312 encoding.
String
UTF-16)
|
[Java] |
-------------------- JNI call
[Cpp] |
V
Jstring
UTF-16)
|
+ -------- + --------- +
| GetStringChars | GetStringUTFChars
|
V
Wchar_t * char *
(UTF_16) (UTF-8)
C/c ++-> java
The string that jni returns to java, c/c ++ should first take charge of turning this string into UTF-8 or UTF-16 format, and then encapsulate it into jstring through NewStringUTF or NewString, return to java.
String
UTF-16)
^
|
[Java] |
-------------------- JNI returned
[Cpp] |
Jstring
UTF-16)
^
|
+ -------- + --------- +
^
|
| NewString | NewStringUTF
Wchar_t * char *
(UTF_16) (UTF-8)
If the string does not contain Chinese characters, only the standard ascii code, then use GetStringUTFChars/NewStringUTF can be done, because in this case, the UTF-8 encoding and ascii encoding are consistent, conversion is not required.
However, if a string contains Chinese characters, encoding and conversion in the c/c ++ section is required. We need two conversion functions: ENCODE UTF8/16 to GB2312, and convert GB2312 to UTF8/16.
It should be noted that both linux and win32 support wchar, which is in fact a 16-bit unicode code UTF16. Therefore, if the wchar type is fully used in our c/c ++ program, in theory, this type of conversion is not required. However, in fact, we cannot completely replace char with wchar, so for most applications, conversion is still necessary.
Specific conversion functions are supported by linux and win32. For example, glibc mbstowcs can be used to convert GB2312 encoding to UTF16, however, such support is generally platform-related (because the c/c ++ standard does not include this part) and is not comprehensive (for example, glibc does not provide UTF-8 encoding ), not independent (in linux, mbstowcs behavior is affected by locale settings ). Therefore, we recommend that you use the iconv library to complete the conversion.
The iconv library is a free independent encoding conversion library that supports many platforms and multiple encodings (in fact, it can process almost all the character encodings we use ), and its behavior is not affected by any external environment. Iconv is installed by default on * nix platform. Additional installation is required on the win32 Platform.
The following is an example of converting a GB2312 encoded string to UTF8 encoding.
The code is as follows: |
Copy code |
# Include <iconv. h> char * BytesToUtf8 (string src, char * dst, int * nout) {size_t n_in = src. length (); size_t n_out = * nout; iconv_t c = iconv_open ("UTF-8", "GB2312"); if (c = (iconv_t)-1) {cerr <strerror (errno) <endl; return NULL ;} Char * inbuf = new char [n_in + 1]; If (! Inbuf ){ Iconv_close (c ); Return NULL; } Strcpy (inbuf, src. c_str ()); Memset (dst, 0, n_out ); Char * in = inbuf; Char * out = dst; If (iconv (c, & in, & n_in, & out, & n_out) = (size_t)-1 ){ Cerr <strerror (errno) <endl; Out = NULL; } Else { N_out = strlen (dst ); Out = dst; } Iconv_close (c ); * Nout = n_out; Delete [] inbuf; Return out; } |
Additional notes:
1. From the jni interface, jni provides two series of string processing functions: UTF16 and UTF8, in the internal implementation of jni, UTF8 is used as the string encoding format, so it is more appropriate to use the UTF8 series (NewStringUTF/GetStringUTFChars/ReleaseStringUTFChars)
2. If The iconv library is used, the setting of the runtime environment does not affect the encoding and conversion. However, the parsing of strings by the outer java program depends on the locale of the runtime environment, therefore, setting the correct locale is of little significance to jni, but it is still necessary for the entire system.
The above section mainly describes how to use a third-party library to solve the encoding problem. For windows platforms only, the related methods provided by windows can be used for encoding conversion.
The following method can be used to convert jstring to char *. It is mainly used when C ++ receives the parameters passed by java and contains Chinese characters. The encoding has been converted during the conversion process, and Chinese characters can be normally returned.
The code is as follows: |
Copy code |
Char * JStringToWindows (JNIEnv * pJNIEnv, jstring jstr) { Jsize len = pJNIEnv-> GetStringLength (jstr ); Const jchar * jcstr = pJNIEnv-> GetStringChars (jstr, NULL ); Int size = 0; Char * str = (char *) malloc (len * 2 + 1 ); If (size = WideCharToMultiByte (CP_ACP, 0, LPCWSTR (jcstr), len, str, len * 2 + 1, NULL, NULL) = 0) Return NULL; PJNIEnv-> ReleaseStringChars (jstr, jcstr ); Str [size] = 0; Return str; } |
Note: The char * returned in the preceding method needs to be deleted and released after use, because the memory is allocated by malloc during the encoding and conversion process. If this parameter is not released, memory leakage will occur.
To return Chinese information to java in C ++, use the following method to convert char * to jstring.
The code is as follows: |
Copy code |
Jstring WindowsTojstring (JNIEnv * env, char * str) { Jstring rtn = 0; Int slen = strlen (str ); Unsigned short * buffer = 0; If (slen = 0) Rtn = (env)-> NewStringUTF (str ); Else { Int length = MultiByteToWideChar (CP_ACP, 0, (LPCSTR) str, slen, NULL, 0 ); Buffer = (unsigned short *) malloc (length * 2 + 1 ); If (MultiByteToWideChar (CP_ACP, 0, (LPCSTR) str, slen, (LPWSTR) buffer, length)> 0) Rtn = (env)-> NewString (jchar *) buffer, length ); } If (buffer) Free (buffer ); Return rtn; } |
Based on the analysis of the preceding compaction stream and the subsequent conversion method, the Chinese parameter garbled characters in jni can be basically solved.