This library primarily provides access toUCDa library of related content. UCDis aUnicodeCharacter Database (Unicode Character DataBase) is abbreviated. UCDby some descriptionUnicodePlain text for character attributes and internal relationships, orHTMLfile composition. UCDMost of the text files are suitable for program analysis.Unicoderelated data. One of theHTMLThe file explains the organization of the database, the format and meaning of the data. UCDThe largest of the files is undoubtedly the document describing the attributes of Chinese charactersUnihan.txt. In theUCD 5.0,0in whichUnihan.txtFile size is28,221kbytes. Unihan.txtcontains a lot of reference value index, such as Chinese radicals, strokes, pinyin, use frequency, four-corner number sorting. These indexes are based on some more authoritative dictionaries, but most indexes can only retrieve part of the kanji.
Unicodedata.lookup (name)
Find a character by name. If the character exists, it returns the corresponding character if there is no throw exception Keyerror.
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.lookup (' left CURLY bracket '))
The resulting output is as follows:
>>>
{
>>>
Unicode.name (Chr[,default])
Use the character to find its name. Throws an exception ValueErrorif the corresponding name is returned successfully.
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.name (' {'))
The resulting output is as follows:
>>>
Left CURLY Bracket
>>>
Unicodedata.decimal (chr[, default])
Returns a numeric value that represents a numeric character. If a value with no number is given, an exception ValueErroris thrown.
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.decimal (' 7 '))
The resulting output is as follows:
>>>
7
>>>
Unicodedata.digit (chr[, default])
Converts a valid numeric string to a numeric value, such as a string from 0 to 9 to a corresponding numeric value. If the string is illegal, throw an exception valueerror.
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.digit (' 9 ', None))
The resulting output is as follows:
>>>
9
>>>
Unicodedata.numeric (chr[, default])
Converts a string representing a number to a floating-point return. For example, the '8', ' four ' can be converted to a value output. Unlike digit(), it is possible to have any character that can represent a numeric value, not just 0 to 9 characters. If it is not a valid character, an exception valueerroris thrown.
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.numeric (' four ', None))
Print (Unicodedata.numeric (' 8 ', None))
The resulting output is as follows:
>>>
4.0
8.0
>>>
Unicodedata.category (CHR)
Returns a character to the type it classifies in UNICODE . The specific types are as follows:
Codedescription
[Cc] Other, Control
[Cf] Other, Format
[Cn] Other, not Assigned (no characters in the file has this property)
[Co] Other, Private use
[Cs] Other, surrogate
[LC] Letter, cased
[Ll] Letter, lowercase
[Lm] Letter, Modifier
[Lo] Letter, other
[Lt] Letter, Titlecase
[Lu] Letter, uppercase
[Mc] Mark, Spacing combining.
[Me] Mark, enclosing.
[Mn] Mark, nonspacing.
[Nd] Number, Decimal Digit
[Nl] Number, letter
[No] Number, other
[Pc] Punctuation, Connector
[Pd] Punctuation, Dash
[Pe] Punctuation, Close
[Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage)
[Pi] Punctuation, Initial quote (may behave-like Ps or Pe depending on usage)
[Po] Punctuation, other
[Ps] Punctuation, Open
[Sc] Symbol, Currency
[Sk] Symbol, Modifier
[Sm] Symbol, Math
[So] Symbol, other
[Zl] Separator, line
[Zp] Separator, Paragraph
[Zs] Separator, Space
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.category (' four '))
Print (Unicodedata.category (' 8 '))
Print (Unicodedata.category (' a '))
The resulting output is as follows:
>>>
Lo
Nd
Ll
>>>
Unicodedata.bidirectional (CHR)
Assigns a character to its classification so that it is arranged from left to right, or right to left. If not defined, returns an empty string.
Example:
#python 3.4.3
Import Unicodedata
Print (unicodedata.bidirectional (' 9 '))
Print (Unicodedata.bidirectional (U ' \u0660 '))
Print (Unicodedata.bidirectional (' medium '))
Print (Unicodedata.bidirectional (' a '))
Print (Unicodedata.category (U ' \u0660 '))
The resulting output is as follows:
>>>
En
An
L
L
Nd
>>>
where EN is the English number and an represents Arabicnumber, L represents Letter , Nd is to indicate Number Decimal .
Unicodedata.combining (CHR)
Returns the authoritative combined value of the character, if not defined, returns 0by default. When normalized, you can sort by this value, with a large value followed by a small value.
Example:
#python 3.4.3
Import Unicodedata
Print (unicodedata.combining (' 9 '))
Print (unicodedata.combining (' A '))
The resulting output is as follows:
>>>
0
0
>>>
Unicodedata.east_asian_width (CHR)
Returns the width of the character display. The specific contents are as follows:
' F ' (fullwidth), ' H ' (halfwidth), ' W ' (Wide), ' Na ' (Narrow), ' A ' (ambiguous) or ' N ' (Natural).
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.east_asian_width (' 9 '))
Print (Unicodedata.east_asian_width (' A '))
Print (Unicodedata.east_asian_width (' cai '))
The resulting output is as follows:
>>>
Na
Na
W
>>>
Unicodedata.mirrored (CHR)
Determines whether a character supports mirrored properties,or returns 0 If support returns 1.
Example:
#python 3.4.3
Import Unicodedata
Print (unicodedata.mirrored (' 9 '))
Print (unicodedata.mirrored (' A '))
Print (unicodedata.mirrored (' cai '))
The resulting output is as follows:
>>>
0
0
0
>>>
Unicodedata.decomposition (CHR)
Returns a broken-down character into a two - based value, or null if it is not decomposed.
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.decomposition (' 9 '))
Print (Unicodedata.decomposition ('-'))
Print (Unicodedata.decomposition (' cai '))
Print (Unicodedata.decomposition (' checkbook '))
The resulting output is as follows:
>>>
30AB 3099
>>>
Unicodedata.normalize (form, UNISTR)
Converts a string of UNICODE strings into normal-format strings that support NFC,nfkc,NFD and nfkd formats. Some text elements can use static, pre-assembled forms, or dynamic combinations. The different representation sequences of Unicode characters are considered equivalent. If two or more sequences are considered equivalent,theUnicode Standard does not specify which particular sequence is correct, and that each sequence is only equivalent to other sequences.
If you need a single, single representation, you can use a normalizedUnicodetext form to reduce the need to differentiate. Unicodethe standard defines four forms of normalization:Normalization Form D (NFD),Normalization Form KD (NFKD),Normalization Form C (NFC), andNormalization Form KC (NFKC). Roughly speaking,NFDand theNFKDThe possible characters are decomposed, and theNFCand theNFKCcombine the possible characters.
Example:
#python 3.4.3
Import Unicodedata
Print (Unicodedata.normalize (' Nfkd ', U ' aあ ? '). Encode (' ASCII ', ' ignore '))
The resulting output is as follows:
>>>
B ' AA '
>>>
Unicodedata.unidata_version
Returns the version of the database used by the current Unicode.
Unicodedata.ucd_3_2_0
Provides ucd3.2 Object-mode access for compatibility with legacy IDNA applications.
Example:
#python 3.4.3
Import Unicodedata
Print (unicodedata.unidata_version)
Print (UNICODEDATA.UCD_3_2_0)
The resulting output is as follows:
>>>
6.3.0
<unicodedata. UCD Object at 0x029b77e8>
>>>
Here's a closer look at the UNICODE data for one character:
u+0062 is the Unicode hex value of the character Latin Small letter B, which are categorized as "lowercase letter" in the U Nicode 6.0 character table.
Unicode Character Information
Unicode hexu+0062
Character namelatin SMALL Letter B
General categorylowercase Letter [Code:ll]
Canonical combining CLASS0
Bidirectional Categoryl
Mirroredn
Uppercase versionu+0042
Titlecase versionu+0042
Unicode Character Encodings
Latin Small Letter B HTML entityb (decimal entity), B (hex entity)
Windows Key codealt 0098 or Alt +00621
Programming Source Code Encodingspython hex:u "\u0062", Hex for C + + and Java: "\u0062"
UTF-8 hexadecimal encoding0x62
Most of the functions above are queries against the data and return the corresponding values.
Cai Junsheng qq:9073204 Shenzhen
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
3.5 Unicodedata--unicode Database