methods One, using Unicodeblock and Unicodescript to judge
In Java, the main use of the character class to handle character-related functionality, and JDK 1.7 character is implemented in the Unicode 6.0 version, so first learn the common Unicode encoding.
The Unicodeblock and Unicodescript classes can help us judge the character type, Unicodeblock is a basic unit of the Unicode Standards Organization's Unicode code, and in fact a The Unicodeblock represents a contiguous Unicode number segment, and Unicodeblock does not overlap. For example, we usually use Unicode encoding to determine whether a character is Chinese or not in 0X4E00–0X9FCC, because there is a unicodeblock specifically classified as a stored Chinese character (CJK Unified Kanji), the unicodeblock is called CJK Unified Ideographs, a total of 74,617 characters are defined.
Unicodeblock and Unicodescript relations:
So Unicodescript is the classification of Unicode characters from the level of the language writing rules, which is divided by the use angle, and the unicodeblock is divided from the hard coding angle.
1. Unicodeblock is a simple range of values (some of which may have "empty numbers" of characters that have not been assigned).
2. Characters in a unicodescript may be dispersed in multiple unicodeblock;
3. A character in a unicodeblock may be drawn into multiple unicodescript.
Distinguish Chinese punctuation marks.
Because the Chinese punctuation mark mainly exists in the following 5 Unicodeblock,
U2000-general punctuation (hundred semicolon, thousand semicolon, single quotes, double quotes, etc.)
U3000-CJK symbols and punctuation (comma, period, title number, 〸,〹,〺, etc. PS: Back three characters you know what that means. : ) )
Uff00-halfwidth and Fullwidth Forms (greater than, less than, equal to, parentheses, exclamation marks, plus, minus, colon, semicolon, etc.)
UFE30-CJK compatibility Forms (mainly parentheses used for vertical writing, as well as discontinuous line ﹉, wavy line ﹌, etc.)
Ufe10-vertical Forms (mainly some vertically written punctuation, , etc.)
According to the Unicodeblock method, the Chinese punctuation mark public
boolean ischinesepunctuation (char c) {
Character.unicodeblock UB = Character.UnicodeBlock.of (c);
if (UB = = Character.UnicodeBlock.GENERAL_PUNCTUATION
| | ub = = Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION | |
| ub = = Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
| | UB = = Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
| | | ub = = Character.UnicodeBlock.VERTICAL_FORMS) {
return true;
} else {return
false;
}
}
Method Two, judge by character range
Static Boolean Issymbol (char ch) {if (Iscnsymbol (CH)) return true;
if (Isensymbol (CH)) return true;
if (0x2010 <= ch && ch <= 0x2017) return true;
if (0x2020 <= ch && ch <= 0x2027) return true;
if (0x2b00 <= ch && ch <= 0x2bff) return true;
if (0xff03 <= ch && ch <= 0xff06) return true;
if (0xff08 <= ch && ch <= 0xff0b) return true;
if (ch = = 0XFF0D | | ch = = 0xff0f) return true;
if (0xff1c <= ch && ch <= 0xff1e) return true;
if (ch = = 0xff20 | | ch = = 0XFF65) return true;
if (0xff3b <= ch && ch <= 0xff40) return true;
if (0xff5b <= ch && ch <= 0xff60) return true;
if (ch = = 0XFF62 | | ch = = 0XFF63) return true;
if (ch = = 0x0020 | | ch = = 0x3000) return true;
return false; Static Boolean Iscnsymbol (char ch) {
if (0x3004 <= ch && ch <= 0x301c) return true;
if (0x3020 <= ch && ch <= 0x303f) return true;
return false;
Static Boolean Isensymbol (char ch) {if (ch = = 0x40) return true;
if (ch = = 0x2d | | ch = = 0x2f) return true;
if (0x23 <= ch && ch <= 0x26) return true;
if (0x28 <= ch && ch <= 0x2b) return true;
if (0x3c <= ch && ch <= 0x3e) return true;
if (0x5b <= ch && ch <= 0x60) return true;
if (0x7b <= ch && ch <= 0x7E) return true;
return false;
Static Boolean ispunctuation (char ch) {if (Iscjkpunc (CH)) return true;
if (Isenpunc (CH)) return true;
if (0x2018 <= ch && ch <= 0x201f) return true; if (ch = = 0XFF01 | | ch = = 0XFF02) retUrn true;
if (ch = = 0xff07 | | ch = = 0XFF0C) return true;
if (ch = = 0XFF1A | | ch = = 0XFF1B) return true;
if (ch = = 0XFF1F | | ch = = 0XFF61) return true;
if (ch = = 0xff0e) return true;
if (ch = = 0xff65) return true;
return false;
Static Boolean Isenpunc (char ch) {if (0x21 <= ch && ch <= 0x22) return true;
if (ch = = 0X27 | | ch = = 0x2c) return true;
if (ch = = 0X2E | | ch = = 0X3A) return true;
if (ch = = 0X3B | | ch = = 0x3F) return true;
return false;
Static Boolean Iscjkpunc (char ch) {if (0x3001 <= ch && ch <= 0x3003) return true;
if (0x301d <= ch && ch <= 0x301f) return true;
return false; }
method Three, the custom Punctuation mark Collection, examines (slightly)
Reference: http://bbs.csdn.net/topics/390812840