Data cleansing, using Python data to clean the CVS with Chinese characters, the intention is to use a dictionary corresponding to Chinese characters, that is, the key value is the Chinese characters, value is index, self-increment can be, using the dictionary data structure does not duplicate the key value of the attribute, the Chinese characters are mapped to the value index.
The Python code is as follows: (CSV format for data)
Import CSV
Dict2 = {} #C
Dict4 = {} #E
Dict25 = {} #z
Dict26 = {} #AA
Dict27 = {} #AB
Dict37 = {} #AL
Dict38 = {} #AM
DICT40 = {} #AO
dict41 = {} #AP
Dict42 = {} #AQ
Dict45 = {} #AT
dict49 = {} #AX
index = 0
Flag = False
# print (row[2],dict[row[2]])
With open ("E:/yuce/test.csv", ' w+ ', newline= ") as Csv_file_write:
writer = Csv.writer (csv_file_write)
With open (' E:/yuce/b.csv ', ' R ', newline= ') as Csv_file_read:
Reader = Csv.reader (csv_file_read)
For row in reader:
if (flag):
DICT2[ROW[2]] = Index
DICT4[ROW[4]] = Index
DICT25[ROW[25]] = Index
DICT26[ROW[26]] = Index
DICT27[ROW[27]] = Index
DICT37[ROW[37]] = Index
DICT38[ROW[38]] = Index
DICT40[ROW[40]] = Index
DICT41[ROW[41]] = Index
DICT42[ROW[42]] = Index
DICT45[ROW[45]] = Index
DICT49[ROW[49]] = Index
ROW[2] = dict2[row[2]]
ROW[4] = dict4[row[4]]
ROW[25] = dict25[row[25]]
ROW[26] = dict26[row[26]]
ROW[27] = dict27[row[27]]
ROW[37] = dict37[row[37]]
ROW[38] = dict38[row[38]]
ROW[40] = dict40[row[40]]
ROW[41] = dict41[row[41]]
ROW[42] = dict42[row[42]]
ROW[45] = dict45[row[45]]
ROW[49] = dict49[row[49]]
index = index + 1
Writer.writerow (Row)
Flag = True
Csv_file_read.close ()
Csv_file_write.close ()
Print (' done! ')
The above example is real data processing, with 200 columns of properties and 30,000 data of the original. These include Chinese characters, and missing values, which require a step-by-step cleaning.
Python data cleansing CVS with Chinese characters