[Machine Learning] data preprocessing: converting data of different types into numerical values and preprocessing Data Conversion
Before performing python data analysis, you must first perform data preprocessing.
Sometimes I have to deal with non-numeric data. Well, what I want to talk about today is how to deal with the data.
Three methods are available:
1. Use LabelEncoder for fast conversion;
2. Use mapping to map a category to a value. However, this method has limited applicability;
3. Use the get_dummies Method for conversion.
1 import pandas as pd 2 from io import StringIO 3 4 csv_data = ''' A, B, C, D 5, 6 5, 8 7, 12, '''8 9 df = pd. read_csv (StringIO (csv_data) 10 print (df) 11 # count the number of null 12 print (df. isnull (). sum () 13 print (df. values) 14 15 # discard null 16 print (df. dropna () 17 print ('after', df) 18 from sklearn. preprocessing import Imputer19 # axis = 0 column axis = 1 row 20 imr = Imputer (missing_values = 'nan ', strategy = 'mean', axis = 0) 21 imr. fit (df) # Build fit to get data 22 imputed_data = imr. transform (df. values) # transform fills data with 23 print (imputed_data) 24 25 df = pd. dataFrame ([['green', 'M', 10.1, 'class1'], 26 ['red', 'l', 13.5, 'class2'], 27 ['Blue ', 'xl', 15.3, 'class1']) 28 df. columns = ['color', 'SIZE', 'price', 'classlabel '] 29 print (df) 30 31 size_mapping = {'xl': 3, 'L ': 2, 'M': 1} 32 df ['SIZE'] = df ['SIZE']. map (size_mapping) 33 print (df) 34 35 # traverse Series36 for idx, label in enumerate (df ['classlabel ']): 37 print (idx, label) 38 39 #1. Use the LabelEncoder class for fast encoding, but it is not suitable for color at this time. 40 # it looks like there is a size of 41 from sklearn. preprocessing import LabelEncoder42 class_le = LabelEncoder () 43 color_le = LabelEncoder () 44 df ['classlabel '] = class_le.fit_transform (df ['classlabel']. values) 45 # df ['color'] = color_le.fit_transform (df ['color']. values) 46 print (df) 47 48 #2. The ing dictionary converts the class label to an integer 49 import numpy as np50 class_mapping = {label: idx for idx, label in enumerate (np. unique (df ['classlabel '])} 51 df ['classlabel'] = df ['classlabel ']. map (class_mapping) 52 print ('2, ', df) 53 54 55 #3, process 1 56 not applicable # create a new virtual feature 57 from sklearn. preprocessing import OneHotEncoder58 pf = pd. get_dummies (df [['color']) 59 df = pd. concat ([df, pf], axis = 1) 60 df. drop (['color'], axis = 1, inplace = True) 61 print (df)