Python method for calculating the width of a character

Source: Internet
Author: User
The example in this article describes how Python calculates the width of a character. Share to everyone for your reference, as follows:

A CLI applet has recently been written in Python, which involves calculating the width of a character, with the goal of intercepting a long string in a friendly manner as a fragment of equal width.

For Unicode characters, the Len function of Python can accurately calculate the number of characters it contains, but the number does not represent the width, such as:

>>>len (U ' Hello A ') 3

Therefore, it is not easy to use this method to calculate the width.

GBK Decode

First I think of GBK encoding, the 00–7f range of characters is a byte encoding, the rest is a double-byte encoding, exactly the width of the character is roughly the same, so there is such an opportunistic approach (assuming 8 width):

>>> a = U ' Hello Hi ' >>> b=a.encode (' GBK ') >>> try:  ... Print B[:8].decode (' GBK ') ... except:  ... Print B[:7].decode (' GBK ') ... hello you

As shown in the code, the Unicode string is GBK encoded first, then the width of the 8 bytes is intercepted and then attempted to decode with GBK, if the decoding fails, then a width is truncated, and a decoding of 7 bytes is used GBK.

Although the problem was initially solved, the mishap was obvious. First, the code is not elegant, in a trial-and-error manner, followed by GBK can represent a limited number of characters, for a lot of characters other than GBK can not be supported.

East_asian_width

After wandering for a long time, I stumbled upon the East_asian_width attribute in the Unicode Character Database standard with the following possible values:

# East_asian_width (EA) EA; A     ; Ambiguous  not sure EA; F     ; Fullwidth  full width ea; H     ; Halfwidth  half-width ea; N     ; Neutral   neutral ea; Na    ; Narrow    Narrow ea; W     ; Wide     Width

In addition to a uncertainty, f/h/n/na/w can clearly know the width, if conservative, a as a width of 2, it is easy to give the width of a single character:

>>> Import unicodedata>>> def chr_width (c): ...  if (Unicodedata.east_asian_width (c) in (' F ', ' W ', ' A ')):   ... Return 2  ... else:   ... return 1>>> chr_width (U ' you ') 2>>> chr_width (U ' a ') 1

Now seems to be able to meet the requirements, but the actual use of the attribute is found to be a character is very much see, the most typical is the Chinese double quotation marks:

>>> chr_width (U ' "') 2

In most of the equal-width font, the Chinese double quotation marks are only one wide, if there are multiple Chinese double quotation marks in a row, then the accumulated false judgment width will make the interception effect greatly discounted, no doubt this is not the best way.

Urwid Solutions

Urwid is a mature Python terminal UI library that wraps HTML-like controls on curses to display textual content, and if there is a development requirement, this library is much more convenient than using the curses library directly. Very good is it on the Unicode text width interception is very accurate, let me greatly surprised, so opened its source code to explore, the text width calculation of its core codes are as follows:

widths = [  (126,  1), (159,  0), (687,   1), (710,  0)  , (711, 1), (727, 0), (733, 1), ( 879,   0), (1154, 1), (1161, 0), (  4347,  1),  (4447, 2), (7467, 1), (7521, 0), (8369, 1), (  8426 ,  0), (9000,  1), (9002,  2), (11021, 1), (12350, 2), (12351, 1), (12438, 2), (12442, 0), (  19893, 2 ), (19967, 1),  (55203, 2), (63743, 1), (64106,  2), (65039, 1), (65059, 0), (65131, 2), (65279, 1), (  65376,< c23/>2), (65500, 1), (65510, 2), (  120831, 1), (262141, 2), (1114109, 1),]def get_width (o): "" "  Return the Scree n column width for Unicode ordinal o.  "" " Global widths  if o = = 0xe or o = = 0xf:    return 0  for num, wid in widths:    if o <= num:      return WI D  return 1

As shown in the code, first sort out the range table of the character widths according to the official Unicode Eastasianwidth document, and then use the Unicode code to look up the table. Use the previous example to test:

>>> Get_width (Ord (U ' A ')) 1>>> get_width (Ord (U ' You ')) 2>>> get_width (Ord (U ' "")) 1

Completely accurate, and in the actual application of the performance is also relatively good, is an ideal solution, more tips please refer to Urwid old_str_util.py source code.

More interested in Python related content readers can view this site topic: "Python Picture Operation skills Summary", "Python data structure and algorithm tutorial", "Python Socket Programming Skills Summary", "Python function Use Tips", " Python string manipulation Tips Summary, Python Introductory and Advanced classic tutorials, and Python file and directory operations tips

I hope this article is helpful for Python program design.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.