The main tasks of data preprocessing are:
First, data preprocessing
1. Data cleaning
2. Data integration
3. Data Conversion
4. Data reduction
1. Data cleaning
Real-world data is generally incomplete, noisy, and inconsistent. The data cleanup routine attempts to populate the missing values, smoothing the noise and identifying outliers, and correcting inconsistencies in the data.
(The data used above)
① Ignore tuples: This is usually done when the class label is missing. This method is not effective unless a tuple has multiple properties that are missing a value.
Import pandas as PD
data=pd.read_csv (' train.csv ')
data.drop (' Cabin ', axis=1,inplace=true)
② manually fill in missing values: In general, this method is time-consuming.
③ uses a global constant to populate the missing value: replace the missing value with the same constant (such as unknown or ﹣∞). If missing values are replaced with Unknown, the mining program may mistakenly assume that they form an interesting concept because they all have the same value "Unknown". Therefore, this method is simple but unreliable.
Data. Cabin.fillna (' unknown ', inplace=true)
#data里cabin的用unknown填充
④ fills the missing value with the mean value of the attribute: for example, if the customer's average revenue is 56000 USD, the value is substituted for the missing value in income.
Data. Age.fillna (data. Age.mean (), inplace=true)
#Age里的缺失值用Age的平均数填充
⑤ uses the attribute mean of all samples of the same class as a given tuple, for example, by classifying the customer by Credit_risk, replacing the missing value in income with the average revenue of a customer with the same credit degree given a tuple.
⑥ uses the most probable value to populate the missing value: it can be summed up with regression, a formalized tool based on inference, or a decision tree using Bayesian formalism. For example, using attributes from other customers in the dataset, a decision tree can be constructed to predict the missing value of the income.
col=[' Pclass ', ' sibsp ', ' parch ', ' Fare ', ' Cabin ', ' embarked ', ' female ', ' male '
#选出不含空值的属性 Note: The data here has been processed, and the string has been converted to a digital type
Notnull=data[pd.notnull (data. Age)]
isnull=data[pd.isnull (data. Age)] from
sklearn.ensemble import gradientboostingregressor
g=gradientboostingregressor ()
G.fit ( Notnull[col].values,notnull. Age)
IsNull. Age=g.predict (Isnull[col])
data. Age[pd.isnull (data. Age)]=isnull. Age
(2) Noise data processing
Noise (noise) is the random error or variance of the variable being measured. Given a numeric attribute (such as price), how to smooth the data and remove the noise. Data smoothing techniques are described below.
① (binning): The box method is to smooth the value of ordered data by examining the "nearest neighbor" of the data. The ordered values are distributed to some barrels or boxes. Because the box method inspects the nearest neighbor's value, it is the local smoothing of the data.
Example: After price sort data (US $): 4,8,15,21,21,24,25,28,34
Divided into (Equal frequency) box:
Box 1:4,8,15
Box 2:21,21,24
Box 3:25,28,34
C=[4,8,15,21,21,24,25,28,34]
c=pd. Series (c)
S=pd.qcut (c,[0,0.33,0.66,1])
pd.groupby (c,by=s). Mean ()
The mean value of the box is smooth:
Box 1:9,9,9
Box 2:22,22,22
Box 3:29,29,29
Pd.groupby (c,by=s). Min ()
Smooth with box boundary:
Box 1:4,4,15
Box 2:21,21,24
Box 3:25,25,34
Pd.groupby (c,by=s). Max ()
pd.groupby (c,by=s). Min ()
② regression: You can use a function (such as a regression function) to fit data to smooth data.
③ Clustering: You can detect outliers by clustering and organize similar values in groups or clusters. Intuitively, values that fall outside the cluster collection are considered outliers.
From sklearn.cluster import Kmeans
c=[4,8,15,21,21,24,25,28,34]
K=kmeans (n_clusters=3)
import NumPy as NP
C=np.array (c)
C=c.reshape (9,1)
K.fit (c)
Center=k.cluster_centers_
c=center[k.predict (c) ]
Print C
。。。 There are a lot of smooth ways
(3) Processing of inconsistent data
As a data analyst, be aware of inconsistencies in coding usage and inconsistencies in data representation (such as the date "2004/12/25" and "25/12/2004"). Field overload (overloading) is another source of error. This is usually caused by a developer squeezing the definition of a new attribute into the unused (bit) portion of a property that has already been defined (for example, using a bit that is not used by a property that has already used 31 bits in 32 digits).
Reference Literature https://www.douban.com/note/128949687/
I small white, the ability is limited, if has the mistake, please correct me.