It is easy to find this [\u4e00-\u9fa5]when it comes to matching Chinese characters with regular Expressions , but it is not comprehensive and does not contain some rare characters.
This article makes a comb about this problem.
The following is a more comprehensive Unicode distribution of Chinese characters, as of the Unicode 8.0 standard (June 2015):
Chunk |
scope |
actual kanji number |
regular |
CJK Unified Kanji |
4e00-6 2FF, 6300-77ff, 7800-8CFF, 8d00-9fff. |
20,950 |
[\u4e00-\u9fff] |
CJK Unified Kanji Extension A zone |
3400-4dbf. |
6,582 |
[\u3400-\u4dbf] |
CJK compatible Kanji |
f900–faff. |
472 |
[\uf900-\ufaff] |
CJK Unified Kanji Extension Zone B |
20000-215ff, 21600-230ff, 23100 -245FF, 24600-260FF, 26100-275ff, 27600-290ff, 29100-2a6df. |
42,711 |
[\u00020000-\u0002a6d6] |
CJK Unified Kanji Extension C zone |
2a700-2b73f. |
4,149 |
[\u0002a700-\u0002b73f] |
CJK Unified Kanji Extension D zone |
2b740–2b81f. |
222 |
[\u0002b740-\u0002b81f] |
CJK Unified Kanji Extension E-zone |
2b820–2ceaf. |
5,762 |
[\u0002b820-\u0002ceaf] |
If you want to represent the most common kanji, use:
[\U4E00-\U9FFF]
If you want to represent a Chinese character within a BMP, which is the Unicode value <=0xffff, use:
[\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]
This contains, but is not limited to, the GBK definition of Chinese characters
If you want to represent as many characters as possible, use:
[\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff\u00020000-\u0002ceaf]
This contains more than 80,000 characters from the above table
Description
1, the above regular expression does not match (English, kanji) punctuation .
2, contains some empty position without Chinese characters, this does not matter.
3, test pass on Python 3.5.
Match Chinese characters with regular expressions, complete summary