In the process of natural language processing, the inconsistency of the full-width and half-angle causes the information extraction to be inconsistent and therefore needs to be unified.
Conversion instructions
Full-width Half-width conversion description
Regular (with no spaces):
Full-width character Unicode encoding from 65281~65374 (hex 0xff01 ~ 0xff5e)
Half-width character Unicode encoding from 33~126 (hex 0x21~ 0x7E)
Exception:
Space is special, full angle is 12288 (0x3000), half angle is (0x20)
In addition to empty, full-width/half-width sorting in Unicode is corresponding in order (half-width + 0x7e= full-width), so it is possible to process non-whitespace data directly by using + + method.
Note:
1. Chinese text is always full-width, only the English alphabet, number keys, symbol keys have the concept of full-width half-angle, a letter or number of the position of a Chinese character called full-width, half of the position of the Chinese character called half-angle.
2. Quotation marks are different in both Chinese and English and full-width cases
Reference Code
#-*-coding:cp936-*-defstrq2b (ustring):"""full angle turning half angle"""rstring="" forUcharinchUstring:inside_code=Ord (Uchar)ifInside_code = = 12288:#full-width space direct conversionInside_code = 32elif(Inside_code >= 65281 andInside_code <= 65374):#Full-width characters (except spaces) are converted according to the relationshipInside_code-= 65248rstring+=UNICHR (Inside_code)returnrstringdefstrb2q (ustring):"""half angle turn full angle"""rstring="" forUcharinchUstring:inside_code=Ord (Uchar)ifInside_code = = 32:#half-width space direct conversionInside_code = 12288elifInside_code >= 32 andInside_code <= 126:#Half-width characters (except spaces) are converted according to the relationshipInside_code + = 65248rstring+=UNICHR (Inside_code)returnRSTRINGB= STRQ2B ("MN123ABC Blog Park". Decode ('cp936')) PrintBC= strb2q ("MN123ABC Blog Park". Decode ('cp936')) PrintC
Execution results
Library function Description
The CHR () function returns a corresponding character by using an integer (that is, 0~255) within range (256) as a parameter.
UNICHR () is just like it, except that it returns a Unicode character.
The Ord () function is a matching function for the Chr () function (for 8-bit ASCII strings) or the UNICHR () function (for Unicode objects), which returns the corresponding ASCII value, or Unicode value, as an argument with a character (a string of length 1).
Case
Python enables the conversion of full-width half-width to each other