Python implements full-width and half-width conversion, and python full-width and half-width characters

Source: Internet
Author: User

Python implements full-width and half-width conversion, and python full-width and half-width characters

Preface

I believe that for every programmer, the full-width and half-width inconsistency often occurs during text processing. So the program needs to be able to quickly switch between the two. Since the full-width half-width has a ing relationship, the processing is not complicated.

The specific rules are as follows:

Full-width unicode encoding from 65281 ~ 65374 (hexadecimal 0xFF01 ~ 0xFF5E)

The unicode encoding of halfwidth characters ranges from 33 ~ 126 (hexadecimal 0x21 ~ 0x7E)

Space is special. The full angle is 12288 (0x3000), and the half angle is 32 (0x20)

Besides spaces, the full/half-width values are sorted in unicode encoding order (half-width + 65248 = full-width)

Therefore, you can use the +-method to process non-space data and separate spaces.

Some functions used

chr()The function uses a range (0 ~) in the range (256 ~ 255) an integer is used as a parameter and a corresponding character is returned.

unichr()Like it, only Unicode characters are returned.

ord()Function ischr()Function orunichr()The pairing function of a function. It takes a character (string with a length of 1) as the parameter and returns the corresponding ASCII value or Unicode value.

Print the ing first:

for i in xrange(33,127): print i,chr(i),i+65248,unichr(i+65248)

Returned results

33 ! 65281 !34 " 65282 "35 # 65283 #36 $ 65284 $37 % 65285 %38 & 65286 &39 ' 65287 '40 ( 65288 (41 ) 65289 )42 * 65290 *43 + 65291 +44 , 65292 ,45 - 65293 -46 . 65294 .47 / 65295 /48 0 65296 049 1 65297 150 2 65298 251 3 65299 352 4 65300 453 5 65301 554 6 65302 655 7 65303 756 8 65304 857 9 65305 958 : 65306 :59 ; 65307 ;60 < 65308 <61 = 65309 =62 > 65310 >63 ? 65311 ?64 @ 65312 @65 A 65313 A66 B 65314 B67 C 65315 C68 D 65316 D69 E 65317 E70 F 65318 F71 G 65319 G72 H 65320 H73 I 65321 I74 J 65322 J75 K 65323 K76 L 65324 L77 M 65325 M78 N 65326 N79 O 65327 O80 P 65328 P81 Q 65329 Q82 R 65330 R83 S 65331 S84 T 65332 T85 U 65333 U86 V 65334 V87 W 65335 W88 X 65336 X89 Y 65337 Y90 Z 65338 Z91 [ 65339 [92 \ 65340 \93 ] 65341 ]94 ^ 65342 ^95 _ 65343 _96 ` 65344 `97 a 65345 a98 b 65346 b99 c 65347 c100 d 65348 d101 e 65349 e102 f 65350 f103 g 65351 g104 h 65352 h105 i 65353 i106 j 65354 j107 k 65355 k108 l 65356 l109 m 65357 m110 n 65358 n111 o 65359 o112 p 65360 p113 q 65361 q114 r 65362 r115 s 65363 s116 t 65364 t117 u 65365 u118 v 65366 v119 w 65367 w120 x 65368 x121 y 65369 y122 z 65370 z123 { 65371 {124 | 65372 |125 } 65373 }126 ~ 65374 ~

Convert the fullwidth to halfwidth:

def full2half(s): n = [] s = s.decode('utf-8') for char in s: num = ord(char) if num == 0x3000:  num = 32 elif 0xFF01 <= num <= 0xFF5E:  num -= 0xfee0 num = unichr(num) n.append(num)return ''.join(n)

Convert the halfwidth to fullwidth:

def half2full(s): n = [] s = s.decode('utf-8') for char in s: num = char(char) if num == 320:  num = 0x3000 elif 0x21 <= num <= 0x7E:  num += 0xfee0 num = unichr(num) n.append(num)return ''.join(n)

The above implementation method is very simple, but in reality, it may not convert all the characters, for example, in a Chinese document, we expect to convert all the letters and numbers to half-width characters, while the common punctuation marks use the full-width characters in a unified manner. The conversion above is not suitable.

The solution is a custom dictionary.

#! /Usr/bin/env python #-*-coding: UTF-8-*-FH_SPACE = FHS = (u "", u ""),) FH_NUM = FHN = (u "0", u "0"), (u "1", u "1"), (u "2 ", u "2"), (u "3", u "3"), (u "4", u "4"), (u "5 ", u "5"), (u "6", u "6"), (u "7", u "7"), (u "8 ", u "8"), (u "9", u "9"),) FH_ALPHA = FHA = (u "a", u ""), (u "B", u "B"), (u "c", u "c"), (u "d", u "d "), (u "e", u "e"), (u "f", u "f"), (u "g", u "g "), (u "h", u "h"), (u "I", u "I"), (u "j", u "j "), (u "k", u "k "), (U" l ", u" l "), (u" m ", u" m "), (u" n ", u" n "), (u "o", u "o"), (u "p", u "p"), (u "q", u "q "), (u "r", u "r"), (u "s", u "s"), (u "t", u "t "), (u "u", u "u"), (u "v", u "v"), (u "w", u "w "), (u "x", u "x"), (u "y", u "y"), (u "z", u "z "), (u "A", u "A"), (u "B", u "B"), (u "C", u "C "), (u "D", u "D"), (u "E", u "E"), (u "F", u "F "), (u "G", u "G"), (u "H", u "H"), (u "I", u "I "), (u "J", u "J"), (u "K", u "K"), (u "L", u "L "), (u "M", u "M"), (u "N ", U" N "), (u" O ", u" O "), (u" P ", u" P "), (u" Q ", u "Q"), (u "R", u "R"), (u "S", u "S"), (u "T ", u "T"), (u "U", u "U"), (u "V", u "V"), (u "W ", u "W"), (u "X", u "X"), (u "Y", u "Y"), (u "Z ", u "Z"),) FH_PUNCTUATION = FHP = (u ". ", u ". "), (u", ", u", "), (u "! ", U "! "), (U "? ", U "? "), (U", U' "'), (u"' ", u" '"), (u"' ", u "'"), (u "@", u "@"), (u "_", u "_"), (u ":", u ":"), (u "; ", u"; "), (u" # ", u" # "), (u" $ ", u" $ "), (u" % ", u "%"), (u "&", u "&"), (u "(", u "("), (u ")", u ") "), (u"-", u"-"), (u" = ", u" = "), (u" * ", u "*"), (u "+", u "+"), (u "-", u "-"), (u "/", u "/"), (u "<", u "<"), (u ">", u ">"), (u "[", u "["), (u "¥", u "\"), (u "]", u "]"), (u "^", u "^ "), (u "{", u "{"), (u "|", u "|"), (u "}", u "}"), (u "~ ", U "~ "),) FH_ASCII = HAC = lambda: (fr, to) for m in (FH_ALPHA, FH_NUM, FH_PUNCTUATION) for fr, to in m) HF_SPACE = HFS = (u "", u ""),) HF_NUM = HFN = lambda: (h, z) for z, h in FH_NUM) HF_ALPHA = HFA = lambda: (h, z) for z, h in FH_ALPHA) HF_PUNCTUATION = HFP = lambda: (h, z) for z, h in FH_PUNCTUATION) HF_ASCII = ZAC = lambda: (h, z) for z, h in FH_ASCII () def convert (text, * maps, ** ops): "fullwidth/halfwidth Convert args: text: unicode string need to convert maps: conversion maps skip: skip out of character. in a tuple or string return: converted unicode string "if" skip "in ops: skip = ops [" skip "] if isinstance (skip, basestring ): skip = tuple (skip) def replace (text, fr, to): return text if fr in skip else text. replace (fr, to) else: def replace (text, fr, to): return text. replace (fr, to) for m in maps: I F callable (m): m = m () elif isinstance (m, dict): m = m. items () for fr, to in m: text = replace (text, fr, to) return text if _ name _ = '_ main __': text = u "Narita Airport-[JR token Narita region, region, site 2]-Dongjing-[JR shinect, Beijing, Beijing, station 6]-xin qingsen-[JR., station 4]- "print convert (text, FH_ASCII, {u "【": u "[", u "]": u "]", u ",": u ",", u ". ": u ". ", U "? ": U "? ", U "! ": U "! "}, Spit = ",.?! "")

Note:In the English system, quotation marks are not distinguished between the quotation marks and the quotation marks.

Summary

The above describes how to implement full-width and half-width conversion in Python. I hope the content in this article will help you in your study or work. If you have any questions, please leave a message.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.