Python implements full-width and half-width conversion, and python full-width and half-width characters
Preface
I believe that for every programmer, the full-width and half-width inconsistency often occurs during text processing. So the program needs to be able to quickly switch between the two. Since the full-width half-width has a ing relationship, the processing is not complicated.
The specific rules are as follows:
Full-width unicode encoding from 65281 ~ 65374 (hexadecimal 0xFF01 ~ 0xFF5E)
The unicode encoding of halfwidth characters ranges from 33 ~ 126 (hexadecimal 0x21 ~ 0x7E)
Space is special. The full angle is 12288 (0x3000), and the half angle is 32 (0x20)
Besides spaces, the full/half-width values are sorted in unicode encoding order (half-width + 65248 = full-width)
Therefore, you can use the +-method to process non-space data and separate spaces.
Some functions used
chr()
The function uses a range (0 ~) in the range (256 ~ 255) an integer is used as a parameter and a corresponding character is returned.
unichr()
Like it, only Unicode characters are returned.
ord()
Function ischr()
Function orunichr()
The pairing function of a function. It takes a character (string with a length of 1) as the parameter and returns the corresponding ASCII value or Unicode value.
Print the ing first:
for i in xrange(33,127): print i,chr(i),i+65248,unichr(i+65248)
Returned results
33 ! 65281 !34 " 65282 "35 # 65283 #36 $ 65284 $37 % 65285 %38 & 65286 &39 ' 65287 '40 ( 65288 (41 ) 65289 )42 * 65290 *43 + 65291 +44 , 65292 ,45 - 65293 -46 . 65294 .47 / 65295 /48 0 65296 049 1 65297 150 2 65298 251 3 65299 352 4 65300 453 5 65301 554 6 65302 655 7 65303 756 8 65304 857 9 65305 958 : 65306 :59 ; 65307 ;60 < 65308 <61 = 65309 =62 > 65310 >63 ? 65311 ?64 @ 65312 @65 A 65313 A66 B 65314 B67 C 65315 C68 D 65316 D69 E 65317 E70 F 65318 F71 G 65319 G72 H 65320 H73 I 65321 I74 J 65322 J75 K 65323 K76 L 65324 L77 M 65325 M78 N 65326 N79 O 65327 O80 P 65328 P81 Q 65329 Q82 R 65330 R83 S 65331 S84 T 65332 T85 U 65333 U86 V 65334 V87 W 65335 W88 X 65336 X89 Y 65337 Y90 Z 65338 Z91 [ 65339 [92 \ 65340 \93 ] 65341 ]94 ^ 65342 ^95 _ 65343 _96 ` 65344 `97 a 65345 a98 b 65346 b99 c 65347 c100 d 65348 d101 e 65349 e102 f 65350 f103 g 65351 g104 h 65352 h105 i 65353 i106 j 65354 j107 k 65355 k108 l 65356 l109 m 65357 m110 n 65358 n111 o 65359 o112 p 65360 p113 q 65361 q114 r 65362 r115 s 65363 s116 t 65364 t117 u 65365 u118 v 65366 v119 w 65367 w120 x 65368 x121 y 65369 y122 z 65370 z123 { 65371 {124 | 65372 |125 } 65373 }126 ~ 65374 ~
Convert the fullwidth to halfwidth:
def full2half(s): n = [] s = s.decode('utf-8') for char in s: num = ord(char) if num == 0x3000: num = 32 elif 0xFF01 <= num <= 0xFF5E: num -= 0xfee0 num = unichr(num) n.append(num)return ''.join(n)
Convert the halfwidth to fullwidth:
def half2full(s): n = [] s = s.decode('utf-8') for char in s: num = char(char) if num == 320: num = 0x3000 elif 0x21 <= num <= 0x7E: num += 0xfee0 num = unichr(num) n.append(num)return ''.join(n)
The above implementation method is very simple, but in reality, it may not convert all the characters, for example, in a Chinese document, we expect to convert all the letters and numbers to half-width characters, while the common punctuation marks use the full-width characters in a unified manner. The conversion above is not suitable.
The solution is a custom dictionary.
#! /Usr/bin/env python #-*-coding: UTF-8-*-FH_SPACE = FHS = (u "", u ""),) FH_NUM = FHN = (u "0", u "0"), (u "1", u "1"), (u "2 ", u "2"), (u "3", u "3"), (u "4", u "4"), (u "5 ", u "5"), (u "6", u "6"), (u "7", u "7"), (u "8 ", u "8"), (u "9", u "9"),) FH_ALPHA = FHA = (u "a", u ""), (u "B", u "B"), (u "c", u "c"), (u "d", u "d "), (u "e", u "e"), (u "f", u "f"), (u "g", u "g "), (u "h", u "h"), (u "I", u "I"), (u "j", u "j "), (u "k", u "k "), (U" l ", u" l "), (u" m ", u" m "), (u" n ", u" n "), (u "o", u "o"), (u "p", u "p"), (u "q", u "q "), (u "r", u "r"), (u "s", u "s"), (u "t", u "t "), (u "u", u "u"), (u "v", u "v"), (u "w", u "w "), (u "x", u "x"), (u "y", u "y"), (u "z", u "z "), (u "A", u "A"), (u "B", u "B"), (u "C", u "C "), (u "D", u "D"), (u "E", u "E"), (u "F", u "F "), (u "G", u "G"), (u "H", u "H"), (u "I", u "I "), (u "J", u "J"), (u "K", u "K"), (u "L", u "L "), (u "M", u "M"), (u "N ", U" N "), (u" O ", u" O "), (u" P ", u" P "), (u" Q ", u "Q"), (u "R", u "R"), (u "S", u "S"), (u "T ", u "T"), (u "U", u "U"), (u "V", u "V"), (u "W ", u "W"), (u "X", u "X"), (u "Y", u "Y"), (u "Z ", u "Z"),) FH_PUNCTUATION = FHP = (u ". ", u ". "), (u", ", u", "), (u "! ", U "! "), (U "? ", U "? "), (U", U' "'), (u"' ", u" '"), (u"' ", u "'"), (u "@", u "@"), (u "_", u "_"), (u ":", u ":"), (u "; ", u"; "), (u" # ", u" # "), (u" $ ", u" $ "), (u" % ", u "%"), (u "&", u "&"), (u "(", u "("), (u ")", u ") "), (u"-", u"-"), (u" = ", u" = "), (u" * ", u "*"), (u "+", u "+"), (u "-", u "-"), (u "/", u "/"), (u "<", u "<"), (u ">", u ">"), (u "[", u "["), (u "¥", u "\"), (u "]", u "]"), (u "^", u "^ "), (u "{", u "{"), (u "|", u "|"), (u "}", u "}"), (u "~ ", U "~ "),) FH_ASCII = HAC = lambda: (fr, to) for m in (FH_ALPHA, FH_NUM, FH_PUNCTUATION) for fr, to in m) HF_SPACE = HFS = (u "", u ""),) HF_NUM = HFN = lambda: (h, z) for z, h in FH_NUM) HF_ALPHA = HFA = lambda: (h, z) for z, h in FH_ALPHA) HF_PUNCTUATION = HFP = lambda: (h, z) for z, h in FH_PUNCTUATION) HF_ASCII = ZAC = lambda: (h, z) for z, h in FH_ASCII () def convert (text, * maps, ** ops): "fullwidth/halfwidth Convert args: text: unicode string need to convert maps: conversion maps skip: skip out of character. in a tuple or string return: converted unicode string "if" skip "in ops: skip = ops [" skip "] if isinstance (skip, basestring ): skip = tuple (skip) def replace (text, fr, to): return text if fr in skip else text. replace (fr, to) else: def replace (text, fr, to): return text. replace (fr, to) for m in maps: I F callable (m): m = m () elif isinstance (m, dict): m = m. items () for fr, to in m: text = replace (text, fr, to) return text if _ name _ = '_ main __': text = u "Narita Airport-[JR token Narita region, region, site 2]-Dongjing-[JR shinect, Beijing, Beijing, station 6]-xin qingsen-[JR., station 4]- "print convert (text, FH_ASCII, {u "【": u "[", u "]": u "]", u ",": u ",", u ". ": u ". ", U "? ": U "? ", U "! ": U "! "}, Spit = ",.?! "")
Note:In the English system, quotation marks are not distinguished between the quotation marks and the quotation marks.
Summary
The above describes how to implement full-width and half-width conversion in Python. I hope the content in this article will help you in your study or work. If you have any questions, please leave a message.