Java string and java string

Source: Internet
Author: User

Java string and java string
First, it refers to the String in Java. Although I have decided to switch to C/C ++, let's take a look at it because of a problem today. The definition of String is as follows:
Copy codeThe Code is as follows:
Public final class String
{
Private final char value []; // saved string
Private final int offset; // the start position.
Private final int count; // number of characters
Private int hash; // The cached hash value.
......
}

The saved values are as follows:
 
Note that if hashCode () has not been called, the hash value is 0. It is easy to know that the value here is the char array of the truly saved string value (that is, the "string test"), and what is the value of each char? It is easy to verify: Unicode.
Here, we can guess how our commonly used subString is implemented: if we want to implement it, let the new String use the same value (char array ), only the offset and count values can be modified. In this case, it saves both space and speed (no copy is required), but in fact it is also like this:
Copy codeThe Code is as follows:
Public String substring (int beginIndex ){
Return substring (beginIndex, count );
}
Public String substring (int beginIndex, int endIndex ){
......
Return (beginIndex = 0) & (endIndex = count ))? This:
New String (offset + beginIndex, endIndex-beginIndex, value );
}
String (int offset, int count, char value []) {
This. value = value;
This. offset = offset;
This. count = count;
}

Since we are discussing strings, what encoding is used by JVM by default? Through debugging, we can find that:
Copy codeThe Code is as follows:
Public static Charset defaultCharset (){
If (defacharcharset = null ){
Synchronized (Charset. class ){
Java. security. PrivilegedAction pa = new GetPropertyAction ("file. encoding ");
String csn = (String) AccessController. doPrivileged (pa );
Charset cs = lookup (csn );
If (cs! = Null)
DefaultCharset = cs;
Else
DefaultCharset = forName ("UTF-8 ");
}
}

Here, the value of defaultCharset can be passed through:
-Dfile. encoding = UTF-8
. Of course you can also set it to "abc", but it will be set to UTF-8 by default. You can view the specific value through System. getProperty ("file. encoding. Why is defacharcharset? Because the network transmission process should be a byte array, the byte arrays obtained by different encoding methods may be different. So we need to know how the encoding method is obtained? The specific method for getting the byte array is the getBytes that we should focus on below. It will call the CharsetEncoder's encode method as follows:
Copy codeThe Code is as follows:
Public final CoderResult encode (CharBuffer in, ByteBuffer out, boolean endOfInput ){
Int newState = endOfInput? ST_END: ST_CODING;
If (state! = ST_RESET) & (state! = ST_CODING )&&! (EndOfInput & (state = ST_END )))
ThrowIllegalStateException (state, newState );
State = newState;
For (;;){
CoderResult cr;
Try {
Cr = encodeLoop (in, out );
} Catch (BufferUnderflowException x ){
Throw new CoderMalfunctionError (x );
} Catch (BufferOverflowException x ){
Throw new CoderMalfunctionError (x );
}
If (cr. isOverflow ())
Return cr;
If (cr. isUnderflow ()){
If (endOfInput & in. hasRemaining ()){
Cr = CoderResult. malformedForLength (in. remaining ());
} Else {
Return cr;
}
}
CodingErrorAction action = null;
If (cr. isMalformed ())
Action = malformedInputAction;
Else if (cr. isUnmappable ())
Action = unmappableCharacterAction;
Else
Assert false: cr. toString ();
If (action = CodingErrorAction. REPORT)
Return cr;
If (action = CodingErrorAction. REPLACE ){
If (out. remaining () <replacement. length)
Return CoderResult. OVERFLOW;
Out. put (replacement );
}
If (action = CodingErrorAction. IGNORE) | (action = CodingErrorAction. REPLACE )){
In. position (in. position () + cr. length ());
Continue;
}
Assert false;
}
}

Of course, the corresponding CharsetEncoder will be selected based on the required encoding format, and the most important thing is that different CharsetEncoder implements different encodeLoop methods. Why is there a (;;)? In fact, let's take a look at the CharsetEncoder package (nio) and its parameters to understand that this function can process the stream (although we will not use it here ).
In the encodeLoop method, as many char as possible will be converted to byte, and the new String is almost the inverse process above.
Garbled characters are often encountered in the actual development process:
Get the file name when uploading the file;
The string Uploaded By JS to the backend;
First, try the running result of the following code:
Copy codeThe Code is as follows:
Public static void main (String [] args) throws Exception {
String str = "String ";
//-41-42-73-5-76-82
PrintArray (str. getBytes ());
//-27-83-105-25-84-90-28-72-78
PrintArray (str. getBytes ("UTF-8 "));
//???
System. out. println (new String (str. getBytes (), "UTF-8 "));
// Why?
System. out. println (new String (str. getBytes ("UTF-8"), "gbk "));
// Character ??
System. out. println (new String ("? ". GetBytes (" gbk ")," UTF-8 "));
//-41-42-73-5 63 63
PrintArray (new String ("? ". GetBytes (" gbk ")," UTF-8 "). getBytes ());
}
Public static void printArray (byte [] bs ){
For (int I = 0; I <bs. length; I ++ ){
System. out. print (bs [I] + "");
}
System. out. println ();
}

The output result is described in the annotations in the program:
Because two bytes in GBK represent one Chinese character, there are 6 bytes;
Because 3 bytes in the UTF-8 represents a Chinese character, so there are 9 bytes;
Because the byte array cannot be generated through GBK and then according to the rules of the UTF-8 to generate a string, so the display ???;
This is often encounter garbled reasons, GBK Using byte generated by the UTF-8 can generate a string;
Although the above Code is garbled, the computer does not think so, so we can still get a byte array through getBytes, which can be identified by UTF-8;
The last two 63 (?) It should be filled with encode (or the bytes are not enough to be filled directly, which is not detailed );
GBK and UTF-8 for because the letter and number encoding is the same, so in the processing of these types of characters is not garbled, but their Chinese character encoding is indeed different, this is the origin of many problems. See the following code:
New String ("we". getBytes ("UTF-8"), "GBK"). getBytes ("GBK"), "UTF-8 );
Obviously, the result of this code is "we", but what is the use for us? First, we noticed that:
New String ("we". getBytes ("UTF-8"), "GBK ");
The result of this Code is garbled, and a lot of garbled code is "messy like this ". But remember: the chaos here is for us. It doesn't matter if it's a computer ", when we almost give up, it can get its "backbone" from Garbled text through "getBytes (" GBK ", then we can use the "primary" to restore the original string.
It seems that the above Code can solve the garbled problem between "GBK" and "UTF-8", but this solution is only limited to a special situation: the number of all consecutive Chinese characters are even! The reason has been mentioned above. I will not go into details here.
So how can we solve this problem?
Solution 1: encodeURI
Why is this method used? The reason is very simple: GBK and UTF-8 for %, numbers, letters are unified, therefore, the string after transmitting the encode can be 100% so that the same thing is obtained under the two encodings, and then the decode can get the string. According to the String format, we can guess that the efficiency of encode and decode is very high, so this is a good solution.
Solution 2: Uniform encoding format
Here we use Webx mining, just set defaultCharset = "UTF-8" in webx. xml.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.