Java method for intercepting a string with Chinese characters by Byte (recommended) _java

Source: Internet
Author: User
Tags truncated stringbuffer

Because the interface uses a fixed number of bytes for the Oracle field and the string passed in is estimated to be larger than the total number of bytes in the database field, the string that is less than the number of bytes in the database is intercepted.

I refer to the example on the Internet, the whole recursive call can be, because the length of the intercepted character byte must be small and the length of the database, that is, if the last character is Chinese, then only to remove the forward intercept.

/** 
   * Determines whether the incoming string is 
   * greater than the specified byte, if it is greater than the recursive call * until it is less than the
   specified number of bytes, be sure to specify the character encoding because each system character encoding is different and the number of bytes is different 
   * @param s 
   *      the raw string 
   * @param num 
   *      is passed in the specified number of bytes * 
   @return string intercepted by string 

   * @throws Unsupportedencodingexception 
   * 
  /public static string Idgui (String S,int num) throws exception{
    int Changdu = S.getbytes ("UTF-8"). Length;
    if (Changdu > num) {
      s = s.substring (0, S.length ()-1);
      s = Idgui (S,num);
    }
    return s;
  }

Java Interview questions:

Write a function that intercepts the string, input as a string and byte number, and output as a byte-truncated string. But to ensure that the Chinese character is not truncated half, such as "I abc" 4, should be truncated to "I ab", input "my ABC-def", 6, should be exported as "my ABC" instead of "I abc+ Han half."

There are a lot of popular languages, such as C #, Java internal use is Unicode (UCS2) encoding, in this encoding all the characters are two characters, so if the string to be intercepted is in Chinese, English, digital mixed, you will have problems, such as the following string:

String s = "a plus b equals C, if a is equal to 1, B equals 2, then C et 3";

The above string has both Chinese characters and English characters and numbers. If you want to intercept the first 6 bytes of characters, should be "a plus B", but if the substring method to intercept the first 6 characters is "a plus B equals C". This problem arises because the substring method treats double-byte characters as a byte character (UCS2 character).

English letters and Chinese characters in different coding format, the number of bytes used in the different, we can use the following example to see how many of the common coding format, an English letter and a Chinese character to occupy the number of bytes.

Import java.io.UnsupportedEncodingException; public class Encodetest {/** * print string The number of bytes and encoded names under the specified encoding to the console * * @param s * String * @param Encodin Gname * coded format/public static void Printbytelength (string s, String encodingname) {System.out.print  
    ("Number of bytes:");  
    try {System.out.print (s.getbytes (encodingname). length);  
    catch (Unsupportedencodingexception e) {e.printstacktrace ();  
  } System.out.println ("; Encoding:" + encodingname);  
    public static void Main (string[] args) {String en = "A";  
 
    String ch = "person";  
    Calculates the number of bytes of an English letter in various encodings System.out.println ("English letter:" + en);  
    Encodetest.printbytelength (en, "GB2312");  
    Encodetest.printbytelength (en, "GBK");  
    Encodetest.printbytelength (en, "GB18030");  
    Encodetest.printbytelength (en, "iso-8859-1");  
    Encodetest.printbytelength (en, "UTF-8");  
    Encodetest.printbytelength (en, "UTF-16"); Encodetest.printbytelength (En, "utf-16be");  
 
    Encodetest.printbytelength (en, "utf-16le");  
 
    System.out.println ();  
    Calculates the number of bytes in a Chinese character under various encodings System.out.println ("Chinese Character:" + ch);  
    Encodetest.printbytelength (CH, "GB2312");  
    Encodetest.printbytelength (CH, "GBK");  
    Encodetest.printbytelength (CH, "GB18030");  
    Encodetest.printbytelength (CH, "iso-8859-1");  
    Encodetest.printbytelength (CH, "UTF-8");  
    Encodetest.printbytelength (CH, "UTF-16");  
    Encodetest.printbytelength (CH, "utf-16be");  
  Encodetest.printbytelength (CH, "Utf-16le");  }  
}

The results of the operation are as follows:

1. English Letter: A
2. Bytes: 1; Encoding: GB2312
3. Bytes: 1; Encoding: GBK
4. Bytes: 1; Encoding: GB18030
5. Bytes: 1; Encoding: iso-8859-1
6. Bytes: 1; Encoding: UTF-8
7. Bytes: 4; Encoding: UTF-16
8. Bytes: 2; Encoding: UTF-16BE
9. Bytes: 2; Encoding: Utf-16le
10. Chinese Characters: People
11. Bytes: 2; Encoding: GB2312
12. Bytes: 2; Encoding: GBK
13. Bytes: 2; Encoding: GB18030
14. Bytes: 1; Encoding: iso-8859-1
15. Bytes: 3; Encoding: UTF-8
16. Bytes: 4; Encoding: UTF-16
17. Bytes: 2; Encoding: UTF-16BE
18. Bytes: 2; Encoding: Utf-16le

Utf-16be and Utf-16le are two members of the Unicode encoding family. The Unicode standard defines the three encoding formats for UTF-8, UTF-16, and UTF-32, with a total of UTF-8, UTF-16, Utf-16be, Utf-16le, UTF-32, Utf-32be, Utf-32le, and seven coding schemes. The coding scheme used in Java is utf-16be. From the results of the above example, we can see that the GB2312, GBK, GB18030 three kinds of coding formats can meet the requirements of the problem. Let's take the GBK code as an example to answer the question below.

We cannot directly use the substring (int beginindex, int endindex) method of the string class, because it is intercepted by character. ' I ' and ' Z ' are treated as one character, length is 1. In fact, as long as we can separate the Chinese characters and the English alphabet, the problem is solved, and the difference is that the Chinese character is two bytes, the English letter is a byte.

Package com.newyulong.iptv.billing.ftpupload;

Import java.io.UnsupportedEncodingException; public class Cutstring {/** * Determines whether it is a Chinese character * * @param c * Character * @return True is the Chinese character, false is the English word Mother * @throws Unsupportedencodingexception * uses a coded format that is not supported by Java * * public static Boolean Ischinesechar (Cha 
    R C) throws Unsupportedencodingexception {//If the number of bytes is greater than 1, is kanji//In this way the distinction between English and Chinese characters is not very rigorous, but in this topic, so the judgment is enough 
  return string.valueof (c). GetBytes ("UTF-8"). length > 1; /** * by Byte intercept String * * @param orignal * Raw String * @param count * intercept number of digits * @return intercepted The string * @throws Unsupportedencodingexception * uses a coded format that is not supported in Java/public static string substring Strin G orignal, int count) throws Unsupportedencodingexception {//The original character is not NULL, nor is it an empty string if (orignal!= null &am p;&! "". Equals (orignal)) {//Convert the original string to GBK encoded format orignal = new string (Orignal.getbytes (), UTF-8 ");///System.out.println (orignal);
      System.out.println (Orignal.getbytes (). length); The number of bytes to intercept is greater than 0 and less than the number of bytes of the original string if (Count > 0 && Count < orignal.getbytes ("UTF-8"). Length) {Stri 
        Ngbuffer buff = new StringBuffer (); 
        char c;
          for (int i = 0; i < count; i++) {System.out.println (count); 
          c = Orignal.charat (i); 
          Buff.append (c); 
          if (Cutstring.ischinesechar (c)) {//encounters Chinese characters, the total number of intercepted bytes is reduced by 1--count;
        }//System.out.println (New String (Buff.tostring (). GetBytes ("GBK"), "UTF-8")); 
      return new String (Buff.tostring (). GetBytes (), "UTF-8"); 
  } return orignal; /** * by Byte intercept String * * @param orignal * Raw String * @param count * intercept bits * @return intercept The string * @throws unsupportedencodingexception * uses Java unsupported encoding format */public static string gsubstring (Str ing orignal, int countThrows Unsupportedencodingexception {//original character is not NULL, nor is it an empty string if (orignal!= null &&! "".  
      Equals (orignal)) {//Convert the original string to GBK encoded format orignal = new string (Orignal.getbytes (), "GBK"); The number of bytes to intercept is greater than 0 and less than the number of bytes of the original string if (Count > 0 && Count < orignal.getbytes ("GBK"). Length) {Stri  
        Ngbuffer buff = new StringBuffer ();  
        char c;  
          for (int i = 0; i < count; i++) {c = Orignal.charat (i);  
          Buff.append (c);  
          if (Cutstring.ischinesechar (c)) {//encounters Chinese characters, the total number of intercepted bytes is reduced by 1--count;  
      } return buff.tostring ();  
  } return orignal; 
   /** * Determines whether the incoming string is * greater than the specified byte if it is greater than the recursive call * until it is less than the specified number of bytes * @param s * Original string * @param num * Pass in the specified number of bytes * @return string after the truncated strings */public static string Idgui (String S,int num) {int Changdu = S.G
    Etbytes (). length; if (Changdu &GT
      num) {s = s.substring (0, S.length ()-1);
    s = Idgui (S,num);
  return s; 
    public static void Main (string[] args) throws exception{//raw string string s = "I zwr Love you Java"; 

System.out.println ("original string: + S +": The number of bytes is: "+ s.getbytes (). length); 
      /* SYSTEM.OUT.PRINTLN ("Intercept the top 1:" + cutstring.substring (S, 1)); 
      System.out.println ("intercept the first 2 digits:" + cutstring.substring (S, 2)); System.out.println ("intercept the first 4 digits:" + cutstring.substring (S, 4)); 
      * *//SYSTEM.OUT.PRINTLN ("intercept the top 12:" + cutstring.substring (S, 12)); 
    
  System.out.println ("Intercept first 12 bytes:" + Cutstring.idgui (S, 11));
 }  
}

Above this Java by Byte intercept with Chinese character string solution (recommended) is small series to share all the content, hope to give you a reference, also hope that we support cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.