issue: Ensure that all Unicode strings have the same underlying
solution: to solve the same text with many different representation problems, you should first unify the text as a canonical form, which can be done through the Unicodedata module,
Unicodedata.normalize (string Specifies the specification used, string).
In Unicode, some specific characters can be represented as a number of legitimate code point sequences.
NFC: denotes a fully composed character (i.e., using a single code point if possible);
NFD: Represents the combination of characters, each character should be able to completely break open;
S1 ='Spicy jalape\u00f1o' #spicy jalapeño uses the full constituent form of the character "n" (U+00F1) S2='Spicy jalapen\u0303o' #Spicy jalapeño is using the Latin alphabet "n" followed by a "~" combination of characters (u+0303) #(a) Print them out (usually looks identical)Print(S1)Print(S2)#(b) Examine equality and lengthPrint('S1 = = s2?', S1 = =S2)Print('len (S1) =', Len (S1),'len (S1) =', Len (S2))Print('---------------------------')#(c) Normalize and try the same experimentImportunicodedatan_s1= Unicodedata.normalize ('NFC', s1) n_s2= Unicodedata.normalize ('NFC', S2)Print('n_s1 = = n_s2?', n_s1 = =n_s2)Print('len (n_s1) =', Len (N_S1),'Len (N_S2)', Len (n_s2))Print('*****************************')#(d) Example of normalizing to a decomposed form and stripping accentsT1 = Unicodedata.normalize ('NFD', s1) T2= Unicodedata.normalize ('NFD', S2)Print('T1 = = t2?', t1==T2)Print('len (t1) =', Len (T1),'len (t2) =', Len (T2))Print("'. Join (c forCinchT1if notUnicodedata.combining (c)))
>>> ================================ RESTART ================================>>>= = S2? Falselen (S1)= + len (S1) =---------------------------= = n_s2? Truelen (n_s1)= (n_s2) 14*****************************= = T2? Truelen (t1)= [T2] =Spicy jalapeno
Add:
Normalization also occupies an important part in filtering and purifying text. Suppose you want to remove all the note marks from some text (possibly for searching or matching):
T1 = Unicodedata.normalize (‘NFD‘, S1)Print". Join (cFor Cin T1IfNotunicodedata.combining (c)))
< Span style= "color: #800000;" >< Span style= "color: #800000;" >< Span style= "color: #0000ff;" >unicodedata.combining () Check the character to determine if it is a combination character
This example shows another important function of the Unicodedata module: to determine whether a character belongs to a character type;
"Python Cookbook" "String and text" 9. Uniform representation of Unicode text as canonical form