Let me start by stating that this refers to a string in Java, although I have decided to be in C + +, but because today I have a problem, let's look at it. The string is defined as follows:
Copy Code code as follows:
Public final class String
{
Private final char value[]; The saved string
private final int offset; Where to start
private final int count; Number of characters
private int hash; Cached hash value
......
}
You can see the saved values in debug as follows:
To illustrate: if Hashcode () is not invoked, the hash value is 0. It's easy to know that the value here is the char array of the true saved string (that is, "string test"), and what is the value of each char? It's easy to verify: Unicode.
Here we can also guess how our common substring is implemented: if it is for us to implement, let the new string use the same value (char array) and only change offset and count. This would save space and speed (no copying needed), and in fact it was:
Copy Code code as follows:
Public String substring (int beginindex) {
return substring (beginindex, count);
}
Public String substring (int beginindex, int endindex) {
......
Return ((beginindex = 0) && (endindex = count)? this:
New String (offset + beginindex, endindex-beginindex, value);
}
String (int offset, int count, Char value[]) {
This.value = value;
This.offset = offset;
This.count = count;
}
Since this is a discussion string, what encoding does the JVM use by default? Through debugging you can find:
Copy Code code as follows:
public static Charset Defaultcharset () {
if (Defaultcharset = = null) {
Synchronized (Charset.class) {
Java.security.PrivilegedAction pa = new Getpropertyaction ("file.encoding");
String CSN = (string) accesscontroller.doprivileged (PA);
Charset cs = lookup (CSN);
if (CS!= null)
Defaultcharset = CS;
Else
Defaultcharset = forname ("UTF-8");
}
}
Where the value of Defaultcharset can be passed:
-dfile.encoding=utf-8
is set. Of course, if you want to set to "ABC" can also, but the default is set to UTF-8. You can look at specific values by System.getproperty ("file.encoding"). See Defaultcharset is why? Because the network transmission should be byte array, the different encoding can be obtained by the byte array is not the same. So, we need to know how to get the coding way, right? The exact method of getting the byte array is the GetBytes we're going to see here, and it's ultimately going to call the Charsetencoder encode method, as follows:
Copy Code code as follows:
Public final Coderresult encode (Charbuffer in, Bytebuffer out, Boolean endofinput) {
int newstate = Endofinput? st_end:st_coding;
if (state!= St_reset) && (state!= st_coding) &&! ( Endofinput && (state = = St_end))
Throwillegalstateexception (state, newstate);
state = NewState;
for (;;) {
Coderresult CR;
try {
CR = Encodeloop (in, out);
catch (Bufferunderflowexception x) {
throw new Codermalfunctionerror (x);
catch (Bufferoverflowexception x) {
throw new Codermalfunctionerror (x);
}
if (Cr.isoverflow ())
return CR;
if (Cr.isunderflow ()) {
if (Endofinput && in.hasremaining ()) {
CR = Coderresult.malformedforlength (in.remaining ());
} else {
return CR;
}
}
Codingerroraction action = null;
if (cr.ismalformed ())
action = malformedinputaction;
else if (cr.isunmappable ())
action = unmappablecharacteraction;
Else
Assert false:cr.toString ();
if (action = = codingerroraction.report)
return CR;
if (action = = codingerroraction.replace) {
if (out.remaining () < replacement.length)
return coderresult.overflow;
Out.put (replacement);
}
if (action = = Codingerroraction.ignore) | | (action = = codingerroraction.replace)) {
In.position (in.position () + cr.length ());
Continue
}
Assert false;
}
}
Of course, first of all, according to the required encoding format to select the corresponding Charsetencoder, and the most important is that different charsetencoder to achieve a different encodeloop methods. Here may not understand why there is a for (;;)? In fact, looking at Charsetencoder's package (NIO) and its parameters is probably clear: this function is able to handle the flow (although we do not loop when used here).
In the Encodeloop method, converting as many chars as possible to byte,new string is almost the inverse of the above process.
In the actual development process will often encounter garbled problem:
Take the file name when uploading the files;
JS to the back end of the string;
First try the results of the following code:
Copy Code code as follows:
public static void Main (string[] args) throws Exception {
String str = "string";
-41-42-73-5-76-82
PrintArray (Str.getbytes ());
-27-83-105-25-84-90-28-72-78
PrintArray (Str.getbytes ("Utf-8"));
// ???
System.out.println (New String (Str.getbytes (), "Utf-8"));
Ying 楃 Juan?
System.out.println (New String (Str.getbytes ("Utf-8"), "GBK");
Character??
System.out.println (New String ("Ying 楃 Juan?"). GetBytes ("GBK"), "Utf-8"));
-41-42-73-5 63 63
PrintArray (New String ("Ying 楃 Juan?"). GetBytes ("GBK"), "Utf-8"). GetBytes ());
}
public static void PrintArray (byte[] bs) {
for (int i = 0; i < bs.length; i++) {
System.out.print (Bs[i] + "");
}
System.out.println ();
}
The output is described in the comments in the program:
Because the GBK 2 byte represents a Chinese character, so there are 6 byte;
Because the UTF-8 3 byte represents a Chinese character, so there are 9 byte;
Because the string is generated by a byte array that cannot be generated via GBK and then based on UTF-8 's rules, the display??? ;
This is often encountered garbled reason, GBK using UTF-8 generated byte can generate strings;
Although the above generated is garbled, but the computer does not think so, it is still able to get through the GetBytes byte array, and this array is utf-8 can be recognized;
The last two 63 (?) should be encode filled (or the byte is not enough to fill directly, this place does not look at);
GBK and UTF-8 for letters and numbers because the encoding is the same, so the processing of these characters will not appear garbled, but their encoding of Chinese characters is really different, this is the origin of many problems, look at the following code:
New String ("we". GetBytes ("UTF-8"), "GBK"). GetBytes ("GBK"), "UTF-8";
Obviously the result of this code is "we," but what's the use for us? First, we note that:
New String ("we". GetBytes ("UTF-8"), "GBK");
The result of this code is garbled, and a lot of garbled is "disorderly into such". But remember: the mess here is for us, for the computer does not matter "disorderly" and "not disorderly", it is almost give up when we can also from garbled through the "GetBytes (" GBK ")" to get its "backbone", and then we can use the "backbone" to restore the original string.
Seemingly above this piece of code can solve the "GBK" and "UTF-8" between the garbled problem, but this solution is limited to a special situation: all the number of consecutive Chinese characters are even several! The reason has already been said in the above, here will not repeat.
So how do we solve this problem?
The First solution: encodeURI
Why do you use this method? The reason is simple: GBK and UTF-8 for%, number, letter encoding is uniform, so after the transmission of encode string can be 100% to ensure that the two codes to get the same thing, and then decode to get the string. Depending on the format of the string, it is possible to guess that the efficiency of encode and decode is very very high, so this is a good solution.
The second workaround: Unified coding Format
This is the use of WEBX mine built, just to Webx.xml set defaultcharset= "UTF-8" on it.