3.5 Unicodedata--unicode Database

Source: Internet
Author: User
Tags chr lowercase modifier throw exception

This library primarily provides access toUCDa library of related content. UCDis aUnicodeCharacter Database (Unicode Character DataBase) is abbreviated. UCDby some descriptionUnicodePlain text for character attributes and internal relationships, orHTMLfile composition. UCDMost of the text files are suitable for program analysis.Unicoderelated data. One of theHTMLThe file explains the organization of the database, the format and meaning of the data. UCDThe largest of the files is undoubtedly the document describing the attributes of Chinese charactersUnihan.txt. In theUCD 5.0,0in whichUnihan.txtFile size is28,221kbytes. Unihan.txtcontains a lot of reference value index, such as Chinese radicals, strokes, pinyin, use frequency, four-corner number sorting. These indexes are based on some more authoritative dictionaries, but most indexes can only retrieve part of the kanji.

Unicodedata.lookup (name)

Find a character by name. If the character exists, it returns the corresponding character if there is no throw exception Keyerror.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.lookup (' left CURLY bracket '))

The resulting output is as follows:

>>>

{

>>>

Unicode.name (Chr[,default])

Use the character to find its name. Throws an exception ValueErrorif the corresponding name is returned successfully.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.name (' {'))

The resulting output is as follows:

>>>

Left CURLY Bracket

>>>

Unicodedata.decimal (chr[, default])

Returns a numeric value that represents a numeric character. If a value with no number is given, an exception ValueErroris thrown.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.decimal (' 7 '))

The resulting output is as follows:

>>>

7

>>>

Unicodedata.digit (chr[, default])

Converts a valid numeric string to a numeric value, such as a string from 0 to 9 to a corresponding numeric value. If the string is illegal, throw an exception valueerror.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.digit (' 9 ', None))

The resulting output is as follows:

>>>

9

>>>

Unicodedata.numeric (chr[, default])

Converts a string representing a number to a floating-point return. For example, the '8', ' four ' can be converted to a value output. Unlike digit(), it is possible to have any character that can represent a numeric value, not just 0 to 9 characters. If it is not a valid character, an exception valueerroris thrown.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.numeric (' four ', None))

Print (Unicodedata.numeric (' 8 ', None))

The resulting output is as follows:

>>>

4.0

8.0

>>>

Unicodedata.category (CHR)

Returns a character to the type it classifies in UNICODE . The specific types are as follows:

Codedescription

[Cc] Other, Control

[Cf] Other, Format

[Cn] Other, not Assigned (no characters in the file has this property)

[Co] Other, Private use

[Cs] Other, surrogate

[LC] Letter, cased

[Ll] Letter, lowercase

[Lm] Letter, Modifier

[Lo] Letter, other

[Lt] Letter, Titlecase

[Lu] Letter, uppercase

[Mc] Mark, Spacing combining.

[Me] Mark, enclosing.

[Mn] Mark, nonspacing.

[Nd] Number, Decimal Digit

[Nl] Number, letter

[No] Number, other

[Pc] Punctuation, Connector

[Pd] Punctuation, Dash

[Pe] Punctuation, Close

[Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage)

[Pi] Punctuation, Initial quote (may behave-like Ps or Pe depending on usage)

[Po] Punctuation, other

[Ps] Punctuation, Open

[Sc] Symbol, Currency

[Sk] Symbol, Modifier

[Sm] Symbol, Math

[So] Symbol, other

[Zl] Separator, line

[Zp] Separator, Paragraph

[Zs] Separator, Space

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.category (' four '))

Print (Unicodedata.category (' 8 '))

Print (Unicodedata.category (' a '))

The resulting output is as follows:

>>>

Lo

Nd

Ll

>>>

Unicodedata.bidirectional (CHR)

Assigns a character to its classification so that it is arranged from left to right, or right to left. If not defined, returns an empty string.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.bidirectional (' 9 '))

Print (Unicodedata.bidirectional (U ' \u0660 '))

Print (Unicodedata.bidirectional (' medium '))

Print (Unicodedata.bidirectional (' a '))

Print (Unicodedata.category (U ' \u0660 '))

The resulting output is as follows:

>>>

En

An

L

L

Nd

>>>

where EN is the English number and an represents Arabicnumber, L represents Letter , Nd is to indicate Number Decimal .

Unicodedata.combining (CHR)

Returns the authoritative combined value of the character, if not defined, returns 0by default. When normalized, you can sort by this value, with a large value followed by a small value.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.combining (' 9 '))

Print (unicodedata.combining (' A '))

The resulting output is as follows:

>>>

0

0

>>>

Unicodedata.east_asian_width (CHR)

Returns the width of the character display. The specific contents are as follows:

' F ' (fullwidth), ' H ' (halfwidth), ' W ' (Wide), ' Na ' (Narrow), ' A ' (ambiguous) or ' N ' (Natural).

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.east_asian_width (' 9 '))

Print (Unicodedata.east_asian_width (' A '))

Print (Unicodedata.east_asian_width (' cai '))

The resulting output is as follows:

>>>

Na

Na

W

>>>

Unicodedata.mirrored (CHR)

Determines whether a character supports mirrored properties,or returns 0 If support returns 1.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.mirrored (' 9 '))

Print (unicodedata.mirrored (' A '))

Print (unicodedata.mirrored (' cai '))

The resulting output is as follows:

>>>

0

0

0

>>>

Unicodedata.decomposition (CHR)

Returns a broken-down character into a two - based value, or null if it is not decomposed.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.decomposition (' 9 '))

Print (Unicodedata.decomposition ('-'))

Print (Unicodedata.decomposition (' cai '))

Print (Unicodedata.decomposition (' checkbook '))

The resulting output is as follows:

>>>

30AB 3099

>>>

Unicodedata.normalize (form, UNISTR)

Converts a string of UNICODE strings into normal-format strings that support NFC,nfkc,NFD and nfkd formats. Some text elements can use static, pre-assembled forms, or dynamic combinations. The different representation sequences of Unicode characters are considered equivalent. If two or more sequences are considered equivalent,theUnicode Standard does not specify which particular sequence is correct, and that each sequence is only equivalent to other sequences.

If you need a single, single representation, you can use a normalizedUnicodetext form to reduce the need to differentiate. Unicodethe standard defines four forms of normalization:Normalization Form D (NFD),Normalization Form KD (NFKD),Normalization Form C (NFC), andNormalization Form KC (NFKC). Roughly speaking,NFDand theNFKDThe possible characters are decomposed, and theNFCand theNFKCcombine the possible characters.

Example:

#python 3.4.3

Import Unicodedata

Print (Unicodedata.normalize (' Nfkd ', U ' aあ ? '). Encode (' ASCII ', ' ignore '))

The resulting output is as follows:

>>>

B ' AA '

>>>

Unicodedata.unidata_version

Returns the version of the database used by the current Unicode.

Unicodedata.ucd_3_2_0

Provides ucd3.2 Object-mode access for compatibility with legacy IDNA applications.

Example:

#python 3.4.3

Import Unicodedata

Print (unicodedata.unidata_version)

Print (UNICODEDATA.UCD_3_2_0)

The resulting output is as follows:

>>>

6.3.0

<unicodedata. UCD Object at 0x029b77e8>

>>>

Here's a closer look at the UNICODE data for one character:

u+0062 is the Unicode hex value of the character Latin Small letter B, which are categorized as "lowercase letter" in the U Nicode 6.0 character table.

Unicode Character Information

Unicode hexu+0062

Character namelatin SMALL Letter B

General categorylowercase Letter [Code:ll]

Canonical combining CLASS0

Bidirectional Categoryl

Mirroredn

Uppercase versionu+0042

Titlecase versionu+0042

Unicode Character Encodings

Latin Small Letter B HTML entityb (decimal entity), B (hex entity)

Windows Key codealt 0098 or Alt +00621

Programming Source Code Encodingspython hex:u "\u0062", Hex for C + + and Java: "\u0062"

UTF-8 hexadecimal encoding0x62

Most of the functions above are queries against the data and return the corresponding values.



Cai Junsheng qq:9073204 Shenzhen

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

3.5 Unicodedata--unicode Database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.