Java: Judging Chinese and English symbols, punctuation __java

Source: Internet
Author: User

methods One, using Unicodeblock and Unicodescript to judge

In Java, the main use of the character class to handle character-related functionality, and JDK 1.7 character is implemented in the Unicode 6.0 version, so first learn the common Unicode encoding.

The Unicodeblock and Unicodescript classes can help us judge the character type, Unicodeblock is a basic unit of the Unicode Standards Organization's Unicode code, and in fact a The Unicodeblock represents a contiguous Unicode number segment, and Unicodeblock does not overlap. For example, we usually use Unicode encoding to determine whether a character is Chinese or not in 0X4E00–0X9FCC, because there is a unicodeblock specifically classified as a stored Chinese character (CJK Unified Kanji), the unicodeblock is called CJK Unified Ideographs, a total of 74,617 characters are defined.

Unicodeblock and Unicodescript relations:

So Unicodescript is the classification of Unicode characters from the level of the language writing rules, which is divided by the use angle, and the unicodeblock is divided from the hard coding angle.

1. Unicodeblock is a simple range of values (some of which may have "empty numbers" of characters that have not been assigned).

2. Characters in a unicodescript may be dispersed in multiple unicodeblock;

3. A character in a unicodeblock may be drawn into multiple unicodescript.

Distinguish Chinese punctuation marks.

Because the Chinese punctuation mark mainly exists in the following 5 Unicodeblock,

U2000-general punctuation (hundred semicolon, thousand semicolon, single quotes, double quotes, etc.)

U3000-CJK symbols and punctuation (comma, period, title number, 〸,〹,〺, etc. PS: Back three characters you know what that means. : )    )

Uff00-halfwidth and Fullwidth Forms (greater than, less than, equal to, parentheses, exclamation marks, plus, minus, colon, semicolon, etc.)

UFE30-CJK compatibility Forms (mainly parentheses used for vertical writing, as well as discontinuous line ﹉, wavy line ﹌, etc.)

Ufe10-vertical Forms (mainly some vertically written punctuation, , etc.)

According to the Unicodeblock method, the Chinese punctuation mark public
    boolean ischinesepunctuation (char c) {
        Character.unicodeblock UB = Character.UnicodeBlock.of (c);
        if (UB = = Character.UnicodeBlock.GENERAL_PUNCTUATION
                | | ub = = Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION | |
                | ub = = Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                | | UB = = Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                | | | ub = = Character.UnicodeBlock.VERTICAL_FORMS) {
            return true;
        } else {return
            false;
        }
    }

Method Two, judge by character range

Static Boolean Issymbol (char ch) {if (Iscnsymbol (CH)) return true;  
          
        if (Isensymbol (CH)) return true;  
        if (0x2010 <= ch && ch <= 0x2017) return true;   
        if (0x2020 <= ch && ch <= 0x2027) return true;   
        if (0x2b00 <= ch && ch <= 0x2bff) return true;   
        if (0xff03 <= ch && ch <= 0xff06) return true;  
        if (0xff08 <= ch && ch <= 0xff0b) return true;  
        if (ch = = 0XFF0D | | ch = = 0xff0f) return true;  
        if (0xff1c <= ch && ch <= 0xff1e) return true;  
        if (ch = = 0xff20 | | ch = = 0XFF65) return true;  
        if (0xff3b <= ch && ch <= 0xff40) return true;  
        if (0xff5b <= ch && ch <= 0xff60) return true;  
        if (ch = = 0XFF62 | | ch = = 0XFF63) return true;  
        if (ch = = 0x0020 | | ch = = 0x3000) return true;  
  
    return false; Static Boolean Iscnsymbol (char ch) { 
          if (0x3004 <= ch && ch <= 0x301c) return true;  
          if (0x3020 <= ch && ch <= 0x303f) return true;  
    return false;  
          Static Boolean Isensymbol (char ch) {if (ch = = 0x40) return true;  
          if (ch = = 0x2d | | ch = = 0x2f) return true;  
          if (0x23 <= ch && ch <= 0x26) return true;          
          if (0x28 <= ch && ch <= 0x2b) return true;          
          if (0x3c <= ch && ch <= 0x3e) return true;  
          if (0x5b <= ch && ch <= 0x60) return true;  
  
          if (0x7b <= ch && ch <= 0x7E) return true;  
        return false;  
          Static Boolean ispunctuation (char ch) {if (Iscjkpunc (CH)) return true;  
            
          if (Isenpunc (CH)) return true;     
          if (0x2018 <= ch && ch <= 0x201f) return true; if (ch = = 0XFF01 | | ch = = 0XFF02) retUrn true;         
          if (ch = = 0xff07 | | ch = = 0XFF0C) return true;  
          if (ch = = 0XFF1A | | ch = = 0XFF1B) return true;   
          if (ch = = 0XFF1F | | ch = = 0XFF61) return true;  
          if (ch = = 0xff0e) return true;   
  
          if (ch = = 0xff65) return true;  
        return false;  
      Static Boolean Isenpunc (char ch) {if (0x21 <= ch && ch <= 0x22) return true;  
      if (ch = = 0X27 | | ch = = 0x2c) return true;  
      if (ch = = 0X2E | | ch = = 0X3A) return true;  
  
      if (ch = = 0X3B | | ch = = 0x3F) return true;  
    return false;  
          Static Boolean Iscjkpunc (char ch) {if (0x3001 <= ch && ch <= 0x3003) return true;  
  
          if (0x301d <= ch && ch <= 0x301f) return true;  
        return false;   }
method Three, the custom Punctuation mark Collection, examines (slightly)

Reference: http://bbs.csdn.net/topics/390812840

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.