Data preprocessing (1)--Data cleansing using Python (sklearn,pandas,numpy) implementation

Last Update:2018-07-24 Source: Internet

Author: User

Tags constant

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The main tasks of data preprocessing are:

First, data preprocessing

1. Data cleaning

2. Data integration

3. Data Conversion

4. Data reduction

1. Data cleaning
Real-world data is generally incomplete, noisy, and inconsistent. The data cleanup routine attempts to populate the missing values, smoothing the noise and identifying outliers, and correcting inconsistencies in the data.

(The data used above)

① Ignore tuples: This is usually done when the class label is missing. This method is not effective unless a tuple has multiple properties that are missing a value.

Import pandas as PD
data=pd.read_csv (' train.csv ')
data.drop (' Cabin ', axis=1,inplace=true)

② manually fill in missing values: In general, this method is time-consuming.
③ uses a global constant to populate the missing value: replace the missing value with the same constant (such as unknown or ﹣∞). If missing values are replaced with Unknown, the mining program may mistakenly assume that they form an interesting concept because they all have the same value "Unknown". Therefore, this method is simple but unreliable.

Data. Cabin.fillna (' unknown ', inplace=true)

#data里cabin的用unknown填充

④ fills the missing value with the mean value of the attribute: for example, if the customer's average revenue is 56000 USD, the value is substituted for the missing value in income.

Data. Age.fillna (data. Age.mean (), inplace=true)

#Age里的缺失值用Age的平均数填充

⑤ uses the attribute mean of all samples of the same class as a given tuple, for example, by classifying the customer by Credit_risk, replacing the missing value in income with the average revenue of a customer with the same credit degree given a tuple.

⑥ uses the most probable value to populate the missing value: it can be summed up with regression, a formalized tool based on inference, or a decision tree using Bayesian formalism. For example, using attributes from other customers in the dataset, a decision tree can be constructed to predict the missing value of the income.

col=[' Pclass ', ' sibsp ', ' parch ', ' Fare ', ' Cabin ', ' embarked ', ' female ', ' male '

#选出不含空值的属性  Note: The data here has been processed, and the string has been converted to a digital type

Notnull=data[pd.notnull (data. Age)]
isnull=data[pd.isnull (data. Age)] from
sklearn.ensemble import gradientboostingregressor
g=gradientboostingregressor ()
G.fit ( Notnull[col].values,notnull. Age)
IsNull. Age=g.predict (Isnull[col])
data. Age[pd.isnull (data. Age)]=isnull. Age

(2) Noise data processing
Noise (noise) is the random error or variance of the variable being measured. Given a numeric attribute (such as price), how to smooth the data and remove the noise. Data smoothing techniques are described below.
① (binning): The box method is to smooth the value of ordered data by examining the "nearest neighbor" of the data. The ordered values are distributed to some barrels or boxes. Because the box method inspects the nearest neighbor's value, it is the local smoothing of the data.
Example: After price sort data (US $): 4,8,15,21,21,24,25,28,34
Divided into (Equal frequency) box:
Box 1:4,8,15
Box 2:21,21,24
Box 3:25,28,34

C=[4,8,15,21,21,24,25,28,34]
c=pd. Series (c)
S=pd.qcut (c,[0,0.33,0.66,1])
pd.groupby (c,by=s). Mean ()

The mean value of the box is smooth:
Box 1:9,9,9

Box 2:22,22,22

Box 3:29,29,29

Pd.groupby (c,by=s). Min ()

Smooth with box boundary:
Box 1:4,4,15

Box 2:21,21,24

Box 3:25,25,34

Pd.groupby (c,by=s). Max ()
pd.groupby (c,by=s). Min ()

② regression: You can use a function (such as a regression function) to fit data to smooth data.

③ Clustering: You can detect outliers by clustering and organize similar values in groups or clusters. Intuitively, values that fall outside the cluster collection are considered outliers.

From sklearn.cluster import Kmeans
c=[4,8,15,21,21,24,25,28,34]
K=kmeans (n_clusters=3)
import NumPy as NP
C=np.array (c)
C=c.reshape (9,1)
K.fit (c)
Center=k.cluster_centers_
c=center[k.predict (c) ]
Print C

。。。 There are a lot of smooth ways

(3) Processing of inconsistent data
As a data analyst, be aware of inconsistencies in coding usage and inconsistencies in data representation (such as the date "2004/12/25" and "25/12/2004"). Field overload (overloading) is another source of error. This is usually caused by a developer squeezing the definition of a new attribute into the unused (bit) portion of a property that has already been defined (for example, using a bit that is not used by a property that has already used 31 bits in 32 digits).

Reference Literature https://www.douban.com/note/128949687/
I small white, the ability is limited, if has the mistake, please correct me.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More