3.5 Unicodedata--unicode Database

Last Update:2015-10-24 Source: Internet

Author: User

Tags chr lowercase modifier throw exception

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This library primarily provides access toUCDa library of related content. UCDis aUnicodeCharacter Database (Unicode Character DataBase) is abbreviated. UCDby some descriptionUnicodePlain text for character attributes and internal relationships, orHTMLfile composition. UCDMost of the text files are suitable for program analysis.Unicoderelated data. One of theHTMLThe file explains the organization of the database, the format and meaning of the data. UCDThe largest of the files is undoubtedly the document describing the attributes of Chinese charactersUnihan.txt. In theUCD 5.0,0in whichUnihan.txtFile size is28,221kbytes. Unihan.txtcontains a lot of reference value index, such as Chinese radicals, strokes, pinyin, use frequency, four-corner number sorting. These indexes are based on some more authoritative dictionaries, but most indexes can only retrieve part of the kanji.

Unicodedata.lookup (name)

Find a character by name. If the character exists, it returns the corresponding character if there is no throw exception Keyerror.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.lookup (' left CURLY bracket '))

The resulting output is as follows:

>>>

{

>>>

Unicode.name (Chr[,default])

Use the character to find its name. Throws an exception ValueErrorif the corresponding name is returned successfully.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.name (' {'))

The resulting output is as follows:

>>>

Left CURLY Bracket

>>>

Unicodedata.decimal (chr[, default])

Returns a numeric value that represents a numeric character. If a value with no number is given, an exception ValueErroris thrown.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.decimal (' 7 '))

The resulting output is as follows:

>>>

Unicodedata.digit (chr[, default])

Converts a valid numeric string to a numeric value, such as a string from 0 to 9 to a corresponding numeric value. If the string is illegal, throw an exception valueerror.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.digit (' 9 ', None))

The resulting output is as follows:

>>>

Unicodedata.numeric (chr[, default])

Converts a string representing a number to a floating-point return. For example, the '8', ' four ' can be converted to a value output. Unlike digit(), it is possible to have any character that can represent a numeric value, not just 0 to 9 characters. If it is not a valid character, an exception valueerroris thrown.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.numeric (' four ', None))

Print (Unicodedata.numeric (' 8 ', None))

The resulting output is as follows:

>>>

4.0

8.0

>>>

Unicodedata.category (CHR)

Returns a character to the type it classifies in UNICODE . The specific types are as follows:

Codedescription

[Cc] Other, Control

[Cf] Other, Format

[Cn] Other, not Assigned (no characters in the file has this property)

[Co] Other, Private use

[Cs] Other, surrogate

[LC] Letter, cased

[Ll] Letter, lowercase

[Lm] Letter, Modifier

[Lo] Letter, other

[Lt] Letter, Titlecase

[Lu] Letter, uppercase

[Mc] Mark, Spacing combining.

[Me] Mark, enclosing.

[Mn] Mark, nonspacing.

[Nd] Number, Decimal Digit

[Nl] Number, letter

[No] Number, other

[Pc] Punctuation, Connector

[Pd] Punctuation, Dash

[Pe] Punctuation, Close

[Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage)

[Pi] Punctuation, Initial quote (may behave-like Ps or Pe depending on usage)

[Po] Punctuation, other

[Ps] Punctuation, Open

[Sc] Symbol, Currency

[Sk] Symbol, Modifier

[Sm] Symbol, Math

[So] Symbol, other

[Zl] Separator, line

[Zp] Separator, Paragraph

[Zs] Separator, Space

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.category (' four '))

Print (Unicodedata.category (' 8 '))

Print (Unicodedata.category (' a '))

The resulting output is as follows:

>>>

Unicodedata.bidirectional (CHR)

Assigns a character to its classification so that it is arranged from left to right, or right to left. If not defined, returns an empty string.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.bidirectional (' 9 '))

Print (Unicodedata.bidirectional (U ' \u0660 '))

Print (Unicodedata.bidirectional (' medium '))

Print (Unicodedata.bidirectional (' a '))

Print (Unicodedata.category (U ' \u0660 '))

The resulting output is as follows:

>>>

where EN is the English number and an represents Arabicnumber, L represents Letter , Nd is to indicate Number Decimal .

Unicodedata.combining (CHR)

Returns the authoritative combined value of the character, if not defined, returns 0by default. When normalized, you can sort by this value, with a large value followed by a small value.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.combining (' 9 '))

Print (unicodedata.combining (' A '))

The resulting output is as follows:

>>>

Unicodedata.east_asian_width (CHR)

Returns the width of the character display. The specific contents are as follows:

' F ' (fullwidth), ' H ' (halfwidth), ' W ' (Wide), ' Na ' (Narrow), ' A ' (ambiguous) or ' N ' (Natural).

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.east_asian_width (' 9 '))

Print (Unicodedata.east_asian_width (' A '))

Print (Unicodedata.east_asian_width (' cai '))

The resulting output is as follows:

>>>

Unicodedata.mirrored (CHR)

Determines whether a character supports mirrored properties,or returns 0 If support returns 1.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.mirrored (' 9 '))

Print (unicodedata.mirrored (' A '))

Print (unicodedata.mirrored (' cai '))

The resulting output is as follows:

>>>

Unicodedata.decomposition (CHR)

Returns a broken-down character into a two - based value, or null if it is not decomposed.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.decomposition (' 9 '))

Print (Unicodedata.decomposition ('-'))

Print (Unicodedata.decomposition (' cai '))

Print (Unicodedata.decomposition (' checkbook '))

The resulting output is as follows:

>>>

30AB 3099

>>>

Unicodedata.normalize (form, UNISTR)

Converts a string of UNICODE strings into normal-format strings that support NFC,nfkc,NFD and nfkd formats. Some text elements can use static, pre-assembled forms, or dynamic combinations. The different representation sequences of Unicode characters are considered equivalent. If two or more sequences are considered equivalent,theUnicode Standard does not specify which particular sequence is correct, and that each sequence is only equivalent to other sequences.

If you need a single, single representation, you can use a normalizedUnicodetext form to reduce the need to differentiate. Unicodethe standard defines four forms of normalization:Normalization Form D (NFD),Normalization Form KD (NFKD),Normalization Form C (NFC), andNormalization Form KC (NFKC). Roughly speaking,NFDand theNFKDThe possible characters are decomposed, and theNFCand theNFKCcombine the possible characters.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.normalize (' Nfkd ', U ' aあ ? '). Encode (' ASCII ', ' ignore '))

The resulting output is as follows:

>>>

B ' AA '

>>>

Unicodedata.unidata_version

Returns the version of the database used by the current Unicode.

Unicodedata.ucd_3_2_0

Provides ucd3.2 Object-mode access for compatibility with legacy IDNA applications.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.unidata_version)

Print (UNICODEDATA.UCD_3_2_0)

The resulting output is as follows:

>>>

6.3.0

<unicodedata. UCD Object at 0x029b77e8>

>>>

Here's a closer look at the UNICODE data for one character:

u+0062 is the Unicode hex value of the character Latin Small letter B, which are categorized as "lowercase letter" in the U Nicode 6.0 character table.

Unicode Character Information

Unicode hexu+0062

Character namelatin SMALL Letter B

General categorylowercase Letter [Code:ll]

Canonical combining CLASS0

Bidirectional Categoryl

Mirroredn

Uppercase versionu+0042

Titlecase versionu+0042

Unicode Character Encodings

Latin Small Letter B HTML entityb (decimal entity), B (hex entity)

Windows Key codealt 0098 or Alt +00621

Programming Source Code Encodingspython hex:u "\u0062", Hex for C + + and Java: "\u0062"

UTF-8 hexadecimal encoding0x62

Most of the functions above are queries against the data and return the corresponding values.

Cai Junsheng qq:9073204 Shenzhen

3.5 Unicodedata--unicode Database

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

3.5 Unicodedata--unicode Database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

3.5 Unicodedata--unicode Database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support