Writing high-quality code: 151 suggestions for improving Java programs (Chapter 1: String ___ suggestion 56 ~ 59), java151
Suggestion 56: freely select the String concatenation Method
There are three methods to concatenate a string: the plus sign, the concat method, and the append method of StringBuilder (or StringBuffer, because the StringBuffer method is the same as the StringBuilder method, which is not described in detail). the plus sign is the most commonly used, the other two methods occasionally appear in some open-source projects. Are there any differences between them? Let's take a look at the following example:
1 public class Client56 {2 public static void main (String [] args) {3 // the plus sign concatenates 4 String str = ""; 5 long start1 = System. currentTimeMillis (); 6 for (int I = 0; I <100000; I ++) {7 str + = "c"; 8} 9 long end1 = System. currentTimeMillis (); 10 System. out. println ("plus sign splicing time:" + (end1-start1) + "ms"); 11 12 // concat splicing 13 str = ""; 14 long start2 = System. currentTimeMillis (); 15 for (int I = 0; I <100000; I ++) {16 str = str. concat ("c"); 17} 18 long end2 = System. currentTimeMillis (); 19 System. out. println ("concat splicing time:" + (end2-start2) + "ms"); 20 21 // StringBuilder splicing 22 str = ""; 23 StringBuilder buffer = new StringBuilder (""); 24 long start3 = System. currentTimeMillis (); 25 for (int I = 0; I <100000; I ++) {26 buffer. append ("c"); 27} 28 long end3 = System. currentTimeMillis (); 29 System. out. println ("StringBuilder splicing time:" + (end3-start3) + "ms"); 30 31 // StringBuffer splicing 32 str = ""; 33 StringBuffer sb = new StringBuffer (""); 34 long start4 = System. currentTimeMillis (); 35 for (int I = 0; I <100000; I ++) {36 sb. append ("c"); 37} 38 long end4 = System. currentTimeMillis (); 39 System. out. println ("StringBuffer splicing time:" + (end4-start4) + "ms"); 40 41} 42}
The preceding four String concatenation methods are used to check the execution time after 0.1 million loops. The execution results are as follows:
From the above execution results, in the String concatenation method, the append method of StringBuilder is the fastest, followed by the append method of StringBuffer (because the append method of StringBuffer is thread-safe, the synchronization method is a little slower), followed by the concat method, and the plus sign is the slowest. Why?
(1). concatenate a string using the "+" method: although the editor optimizes the plus sign of the string, it uses the append method of StringBuilder to append the string. In principle, the execution time should also be 1 ms, but the result is converted to a String using the toString method. The code for splicing "+" in this example is the same as the following code:
str= new StringBuilder(str).append("c").toString();
Note that it is different from the append method that uses StringBuilder. First, a StringBuilder object is created in each loop, the second is to call the toString method to convert it into a string after each execution-the execution time is spent here!
(2) concat method concatenate a string: Let's take a look at the implementation of the concat method from the source code. The Code is as follows:
Public String concat (String str) {int otherLen = str. length (); // if the length of the append character is 0, the return string itself if (otherLen = 0) {return this;} int len = value. length; char buf [] = Arrays. copyOf (value, len + otherLen); str. getChars (buf, len); // generate a new String return new String (buf, true );}
It seems to be an array copy. Although the processing in the memory is atomic, the speed is very fast. However, pay attention to the final return statement. Each concat operation will create a String object, this is the real reason why concat slows down. It creates 0.1 million String objects.
(3) append method concatenate a string: The append method of StringBuilder is directly implemented by the parent class AbstractStringBuilder. The Code is as follows:
public StringBuilder append(String str) { super.append(str); return this; }
Public AbstractStringBuilder append (String str) {// if the value is null, use null as a String to process if (str = null) str = "null"; int len = str. length (); ensureCapacityInternal (count + len); // copy the string to the target array str. getChars (0, len, value, count); count + = len; return this ;}
No, the entire append method is processing character arrays, lengthening, and copying arrays. These are basic data processing and no object is created, so the speed is the fastest! Note: In this example, a String is returned through the toString method of StringBuilder. That is to say, a String object is generated after the 0.1 million cycle ends. The processing of StringBuffer is similar to this, but the method is synchronous.
The implementation of the four methods is different, and the performance is different, but it does not mean that we must use StringBuilder, this is because "+" is very consistent with our coding habits, suitable for reading, two String concatenation, use the plus sign to connect. This is normal and friendly. In most cases, we can use the plus sign, the concat or append method can be considered only when the system performance is critical (for example, when the performance increases by a factor of too long. In addition, in many cases, 80% of the system's performance is spent on 20% of the code, and our energy should be more invested in algorithms and structures.
Note: use the appropriate String concatenation method in appropriate scenarios.
Recommendation 57: Regular Expressions are recommended for complex string operations.
String operations, such as append, merge, replace, flashback, and split, are frequently used during encoding, in addition, Java also provides append, replace, reverse, spit and other methods to complete these operations. They are convenient to use, but more often, you need to use regular expressions to complete complex processing, let's take an example: count the number of Chinese and English words in an article. The Code is as follows:
1 public class Client57 {2 public static void main (String [] args) {3 Inputs input = new inputs (System. in); 4 while (input. hasNext () {5 String str = input. nextLine (); 6 // use the split method to separate and count 7 int wordsCount = str. split (""). length; 8 System. out. println (str + "Number of words:" + wordsCount); 9} 10} 11}
Is it reliable to use the spit method to separate words by spaces and then calculate the length of the split array? Let's look at the output result:
Note that the output is correct except that the first input "Today is Monday" is incorrect! There are two consecutive spaces before the word "Monday" in the second input, and there are No spaces before and after the word "No" in the third input, the last input does not take into account the hyphen "'". In this way, the number of words counted must be incorrect. How can this problem be solved?
If you consider using a loop to handle such "exceptions", the program stability will be deteriorated, and too many factors should be considered, which greatly improves the complexity of the program. What should we do? You can use a regular expression. The Code is as follows:
1 public class Client57 {2 public static void main (String [] args) {3 Inputs input = new inputs (System. in); 4 while (input. hasNext () {5 String str = input. nextLine (); 6 // Regular Expression object 7 Pattern p = Pattern. compile ("\ B \ w + \ B"); 8 // generate Matcher = p. matcher (str); 10 int wordsCount = 0; 11 while (matcher. find () {12 wordsCount ++; 13} 14 System. out. println (str + "Number of words:" + wordsCount); 15} 16} 17}
Accurate. Let's look at the same input. The output result is as follows:
The output of each item is accurate, and the program is not complicated. A regular expression object is formed, and the matching is used for matching. Then, the number of matching items is calculated through a while loop. In a Java regular expression, "\ B" represents the boundary of a word. It is a location identifier and one side is a character or number, the other side is A non-character or number. For example, an input such as "A" has two boundaries, that is, the left and right positions of the word ", this explains why we need to add "\ w" (which represents a character or number ).
Regular Expressions play an extraordinary role in searching, replacing, cutting, copying, and deleting strings, especially when dealing with a large number of text characters (such as reading a large number of LOG logs, regular Expressions can greatly improve development efficiency and system performance. However, regular expressions are a devil, which makes it hard for programs to understand, write A packet containing ^, $, \ A, \ s, \ Q, + ,? , (), {}, [] And other regular expressions, and then tell you that this is a "like this ...... "string SEARCH, are you going to crash? This code is really hard to read, so you have to spend more time on regular expressions.
Note: Regular Expressions are evil and powerful, but difficult to control.
Suggestion 58: UTF Encoding is strongly recommended.
Java garbled code has been around for a long time. experienced developers must have encountered garbled code. Sometimes, garbled code is received from the Web and sometimes read from the database, sometimes it is garbled files received in external interfaces. These are confusing and even painful. See the following code:
1 public class Client58 {2 public static void main (String [] args) throws UnsupportedEncodingException {3 String str = "Chinese character"; 4 // read byte 5 byte B [] = str. getBytes ("UTF-8"); 6 // regenerate a new string 7 System. out. println (new String (B); 8} 9}
Java files are created by default using the IDE tool. The encoding format is GBK. Let's take a look at what the above output will be? It may be garbled, right? The two encoding formats are different. Let's not talk about the results for the moment. Let's first explain the coding rules in Java. The Java program involves two encoding parts:
(1 ),Java file encoding: If we use NotePad to create a file with the. java suffix, the encoding format of the file is the default format of the operating system. If it is created using IDE tools, such as Eclipse, it depends on IDE settings. Eclipse uses the OS encoding by default (Windows is generally GBK );
(2 ),Class file encoding: The file with the. class extension generated by the javac command is a UNICODE file encoded by the UTF-8, which is the same on any operating system, as long as it is a. class file will make the UNICODE format. It should be noted that UTF is a UNICODE storage and transmission format, which is generated to solve the problem of occupying redundant space at the high level of UNICODE. Using UTF Encoding means that the character set uses UNICODE.
Return to our example. The getBytes method extracts the byte array based on the specified character set (extracted in UNICODE format), and then the program uses the new String (byte [] bytes) re-generate a String to see this constructor of the String: decode the specified byte array using the default Character Set of the operating system to construct a new String. The result is clear, if the operating system is a UTF-8, the output is correct, if not, it will be garbled. Because the default GBK encoding is used here, the output result is garbled. Let's take a detailed breakdown of the operation steps:
Step 1:Create a Client58.java File: The default encoding format of this file is GBK (if it is Eclipse, you can view it in the attribute ).
Step 2:Write code(As above );
Step 3:Save and compile with javacNote that we didn't use "javac-encoding GBK Client58.java" to display the declared Java encoding method. javac will automatically read the Client58.java file according to the operating system encoding (GBK) and compile it. class file.
Step 4:Generate a. class File. Compilation ends. class file and save it to the hard disk. the UNICODE Character Set encoded in the UTF-8 format used by the class file can be read through the javap command, where the "Chinese character" variable has been changed from GBK to UNICODE format.
Step 5:Run the main method to extract the byte array of "Chinese characters". "Chinese character" was originally stored in accordance with the UTF-8 format, to extract out of course there is no problem.
Step 6:Restructured stringRead the default GBK encoding of the operating system, and recode all bytes of variable B. The problem arises here: Because the UNICODE storage format is two bytes to represent a character (Note: here refers to the UCS-2 standard), although GBK is also two bytes to represent a character, however, there is no ing relationship between the two. As long as the conversion is performed, only the ing table can be read, and automatic conversion cannot be achieved-the JVM reads the UNICODE two bytes according to the default encoding method (GBK.
Step 7:Output garbled code. The program is running.The problem is clear, and the solution is also generated. There are two solutions.
Step 8: modify the code and specify the encoding as follows:
System. out. println (new String (B, "UTF-8 "));
Step 9: Modify the encoding method of the operating system. The modification methods of each operating system are different.
We can regard the process of reading strings into bytes as the need for data transmission (such as network and storage), while restructuring strings is the requirement of the business logic, so as to reproduce garbled characters: the byte array read through JDBC is GBK, and the business logic encoding adopts UTF-8, so garbled code is generated. For such problems, the best solution is to use a unified encoding format, either use GBK, or use UTF-8, each component, interface, logic layer, with UTF-8, reject the unique situation.
The problem is clear. Let me see the following code:
1 public class Client58 {2 public static void main (String [] args) throws UnsupportedEncodingException {3 String str = "Chinese character"; 4 // read byte 5 byte B [] = str. getBytes ("GB2312"); 6 // regenerate a new string. out. println (new String (B); 8} 9}
Only modified the encoding method for reading bytes (changed to GB2312). What will happen? Or change it to GB18030. What is the result? The results are all "Chinese characters", not garbled characters. This is because GB2312 is V1.0 of the Chinese Character Set, GBK is V2.0, GB18030 is V3.0, and the version is backward compatible, but they only contain different numbers of Chinese characters, note that UNICODE is not in this sequence.
Note: A system uses uniform encoding.
Recommendation 59: tolerance for strings
In Java, there will be many problems related to Chinese processing. Sorting is also a headache. Let's look at the following code:
1 public class Client59 {2 public static void main (String [] args) {3 String [] strs = {"Zhang San (Z)", "Li Si (L )", "Wang Wu (W)"}; 4 Arrays. sort (strs); 5 int I = 0; 6 for (String str: strs) {7 System. out. println (++ I) + "," + str); 8} 9} 10}
The code above defines an array and then sorts it in ascending order. The expected results are listed in ascending order of Pinyin, namely, Li Si, Wang Wu, and Zhang San, but the results are not as follows:
What sort of order is this? It's very messy! We know that the default sorting of the Arrays tool class is compared by the compareTo method of the array elements. Let's look at the main implementation of the String class compareTo:
1 public int compareTo(String anotherString) { 2 int len1 = value.length; 3 int len2 = anotherString.value.length; 4 int lim = Math.min(len1, len2); 5 char v1[] = value; 6 char v2[] = anotherString.value; 7 8 int k = 0; 9 while (k < lim) {10 char c1 = v1[k];11 char c2 = v2[k];12 if (c1 != c2) {13 return c1 - c2;14 }15 k++;16 }17 return len1 - len2;18 }
The above code first obtains the character array of the string, and then compares the size one by one. Note that here is the character comparison (minus operator), that is, the UNICODE code value comparison. Check the UNICODE code table, the value of "Zhang" is 5F20, and "Li" is 674E, "Zhang" is right before "Li", but this is obviously in conflict with our intention. This is also explained in the JDK document: Non-English String sorting may be inaccurate. How can this problem be solved? We recommend using the collator class for sorting in Java. Well, let's modify the code:
Public class Client59 {public static void main (String [] args) {String [] strs = {"Zhang San (Z)", "Li Si (L)", "Wang Wu (W) "}; // define a Chinese sequencer c = Collator. getInstance (Locale. CHINA); Arrays. sort (strs, c); int I = 0; for (String str: strs) {System. out. println (++ I) + "," + str );}}}
Output result:
1. Li Si (L)
2. Wang Wu (W)
3. James (Z)
This is indeed the expected result. It should not be wrong! But it's not slow. Chinese characters are profound and profound. Can Java be precisely sorted? The most important thing is that there are hieroglyphics in Chinese characters and audio-form separation. Can each Chinese character be sorted in the order of Pinyin? Let's write a complex Chinese character to see:
1 public class Client59 {2 public static void main (String [] args) {3 String [] strs = {"random (B)", "Xin (X )", "separator (M)"}; 4 // define a Chinese sequencer 5 Comparator c = Collator. getInstance (Locale. CHINA); 6 Arrays. sort (strs, c); 7 int I = 0; 8 for (String str: strs) {9 System. out. println (++ I) + "," + str); 10} 11} 12}
The output result is as follows:
The output results are messy. Do not blame Java. They have been considered for us as much as possible, but because our Chinese Character culture is too broad and profound, it is a bit difficult to do this sorting well, the deeper reason is that Java uses UNICODE encoding, while the Chinese UNICODE Character Set comes from GB18030, and GB18030 is developed from GB2312. GB2312 is a character set that contains more than 7000 characters, it is sorted by Pinyin and continuous, and later GBK and GB18030 are expanded based on it, so it is difficult to sort them completely.
If sorting objects are frequently used Chinese characters, sorting using the Collator class can fully meet our requirements. After all, GB2312 already contains most Chinese characters. If strict sorting is required, we need to use some open-source projects to implement it by ourselves. For example, pinyin4j can convert Chinese characters to pinyin, and then we can implement the Sorting Algorithm by ourselves, however, you will find many problems to consider, such as algorithms, homophone and polyphonic words.
NOTE: If sorting is not a key algorithm, use the Collator class.