7-Step Data Cleaning

Last Update:2020-06-01 Source: Internet

Author: User

Keywords data cleaning data cleaning process data cleaning python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data cleaning is the first step in data analysis. It often takes a lot of time to clean the data or convert the format.

First. Data preprocessing
1. Deploy environment, import analysis package and data

import pandas as pd
import numpy as np
DataDF = pd.read_csv ('C: /Users/jzgao/Desktop/ecommerce-data/data.csv',encoding = "ISO-8859-1", dtype = str)

# dtype = str, it is best to read in the form of a string when reading, otherwise the data may be distorted
# For example, a 0010008 number may be read as 10008
# encoding = "ISO-8859-1"-what to use for decoding, generally the default system encoding, if it is Chinese, use "utf-8"

2. Try to understand this data set

We can judge whether the data can meet our questions by asking the data set. Whether the data is clean requires no further processing.

3. Let's take a look at the data in combination with the code

# 1. Look at the data from a macro perspective: view the dataframe information
DataDF.info ()

! [DataDF.info ()] (https://pic4.zhimg.com/80/v2-1e545431538653299eba0e3c716caa0f_hd.jpg)
`` `python # 2. Check for missing data # If you want to check the number of missing data in each column, using the following code is the fastest method. # Allows you to better understand which columns are missing more data, so as to determine how to perform the next data cleaning and analysis operations. DataDF.isnull (). Sum (). Sort_values (ascending = False) `` `
! [Check for missing data] (https://pic3.zhimg.com/80/v2-f92c16a31bfc8156e69dc24ab3e04b5a_hd.jpg)
`` `python # 3. Extract some data, understand the meaning of the data intuitively, and find some problems as much as possible DataDF.head ()` ``
! [Extract some data] (https://pic4.zhimg.com/80/v2-75282587dc4e5139f870d064713e3d9f_hd.jpg)
can be seen:
1) The specific time of InvoiceDate appears, you can delete it
2) Description Large probability is the data filled in manually, and there are usually more format problems.
It is guessed that there will be problems with punctuation / inconsistent capitalization, and so on.

# View the duplicate items of this product name
DataDF ['Description']. Unique ()

! [View duplicate items] (https://pic2.zhimg.com/80/v2-97354e2574bc482209057091b4376bf5_hd.jpg)
`` `python # Set the output of all content # threshold is how many items are set, it will be omitted # (such as threshold = 10 means that more than 10 will be omitted) np.set_printoptions (threshold = np.inf) DataDF ['Description']. Unique () `` `
! [All content] (https://pic4.zhimg.com/80/v2-2c0a927c6d6709fb4549708df0845be7_hd.jpg)
Found a problem with many spaces

Second, adjust the data type
DataDF.dtypes

! [Before data type adjustment] (https://pic1.zhimg.com/80/v2-3528fb2a4eadd539198913bfc9147e9c_hd.jpg)
`` `python # Convert string to numeric value (integer type) DataDF ['Quantity'] = DataDF ['Quantity']. astype ('int') # Convert string to numeric value (float type) DataDF ['UnitPrice' ] = DataDF ['UnitPrice']. Astype ('float')
DataDF.dtypes

 
! [After data type adjustment] (https://pic2.zhimg.com/80/v2-5ad6c37d0fd41c73669e9ef53cdd48b1_hd.jpg)
 

Third, modify the column name

`` `python
DataDF.head (10)

! [Before column name modification] (https://pic3.zhimg.com/80/v2-47ef427e9fe02e05bab4d71751872816_hd.jpg)
`` `python # Create a dictionary dictionary: the correspondence between the old column name and the new column name colNameDict = {'InvolceDate': 'SaleDate', 'StockCode': 'StockNo'} # !! ⚠️ Be sure to put the old column name before the colon # The correspondence between each group is separated by [comma] DataDF.rename (columns = colNameDict, inplace = True)
DataDF.head (10)

 
! [Column name modified] (https://pic2.zhimg.com/80/v2-cacb109690ff0eb5dc6f617688c4b8b9_hd.jpg)
 

Fourth, select some subsets

`` `python
DataDF

! [Data Set] (https://pic2.zhimg.com/80/v2-96e7e990e1558bcd00a74b74b055dec9_hd.jpg)
`` `python #Select a subset, select one of the columns subDataDF1 = DataDF [" InvoiceDate "]
subDataDF1

 
! [One column subset] (https://pic4.zhimg.com/80/v2-0934de9fb5b296706b7f4bcc05bdeea7_hd.jpg)
 
`` `python
#Select a subset, select two of them
subDataDF2 = DataDF [["InvoiceDate", "UnitPrice"]]

subDataDF2

! [Two-column subset] (https://pic4.zhimg.com/80/v2-b278ff19a5857645d6d75071ba904057_hd.jpg)
Filter data by slice df.loc

The loc code is a bit like the left mouse button in Excel, you can pull the data you need to slice.

Use comma as the delimited boundary, index on the left and column on the right

subDataDF3 = DataDF.loc [:, "InvoiceDate"]
subDataDF3
#A single colon means unrestricted selection

! [subDataDF3] (https://pic1.zhimg.com/80/v2-ad7318105f9819980b745294977e76fc_hd.jpg)
`` `python subDataDF4 = DataDF.loc [0: 9 ,:] subDataDF4` ``
! [subDataDF4] (https://pic2.zhimg.com/80/v2-8722bc9b92101f6d47c9f1296906e879_hd.jpg)
`` `python subDataDF5 = DataDF.loc [1: 8," StockNo ":" CustomerID "] subDataDF5` ``
! [subDataDF5] (https://pic4.zhimg.com/80/v2-dd5a50421201e3163a8dfcb7c327609b_hd.jpg)
Fifth. Logic problems need to be screened

It is also the knowledge point of the function Dataframe.loc.

Because loc can also determine whether the condition is True

DataDF.loc [:, 'UnitPrice']> 0

! [UnitPrice> 0] (https://pic3.zhimg.com/80/v2-77f3843482b8cc62918c5b183126e6fe_hd.jpg)
Generally speaking, the price cannot be negative, so logically, if the price is less than 0, the data should be filtered out
#Delete outliers: filter data by condition judgment
#Query conditions
querySer = DataDF.loc [:, 'Quantity']> 0
#Apply query conditions
print ('Before deleting outliers:', DataDF.shape)
DataDF = DataDF.loc [querySer ,:]
print ('After deleting outliers:', DataDF.shape)

! [After deleting outliers] (https://pic3.zhimg.com/80/v2-1e0b67efb65821fae3e43160a304c706_hd.jpg)
Sixth. Format consistency

1. Capitalization / space removal

Change everything in the Descrption column of the data to uppercase:

DataDF ['Description'] = DataDF ['Description']. Str.upper ()

DataDF.head ()

2. Remove string symbols

3. Space separation

Finally assign back

Seventh, deal with missing values

1. Remove missing values

# Again reminder to check for missing data

DataDF.isnull (). Sum (). Sort_values (ascending = False)
Knowledge points to remove missing values:

DataFrame.dropna

DataFrame.dropna (axis = 0, how = 'any', thresh = None, subset = None, inplace = False)

# The default (axis = 0) is to remove the entire row on the diurnal value, set the keyword parameter axis = 1 to denote the entire column on the diurnal value
# 'any' If any of the data in a row (or a column) appears Nan, remove the entire row,
# 'all' one row (or column) each data is Nan before removing this entire row

DataDF.dropna (how = 'any')
DataDF.dropna (how = 'all')

# Finer thresh parameter, it indicates how many [non-missing values] are required when leaving this row (or column)
DataDF.dropna (thresh = 6)
2. Fill in missing content: Some missing values can be filled in, there are four methods:

Fill in missing values with business knowledge or experience speculation (default)

Fill in missing values with calculation results of the same indicator (mean, median, mode, etc.)

Fill missing values with adjacent values

Fill missing values with calculation results of different indicators

Knowledge points to remove missing values:

DataFrame.fillna

1) Fill with default values-df.fillna ('')

We should remove those unfriendly NaN values. But what value should we replace? At this time, you may need to combine your understanding of this data set to see what data is more appropriate.

In this data set, we roughly judge that if CustomerID is less important, we can use "" empty string or other default values.

DataDF.Country = DataDF.Country.fillna ('Not Given')
2) Fill in the missing values with the calculation results of the same indicator (mean, median, mode, etc.)

Average-df.fillna (df.mean ())

It may be possible to reduce errors by using numeric data.

For example, the price in this case. It is not appropriate to use 0 or "Not Given" to fill in, but this approximate price can be estimated based on other data.

DataDF.UnitPrice = DataDF.UnitPrice.fillna (DataDF.UnitPrice.mean ())
3) In addition, there is a common method, which is to fill with adjacent values,

This is quite common in time series analysis. It can be filled backward with the value of the previous neighbor, or forward with the value of the next neighbor.

print (DataDF)
print (DataDF.UnitPrice.fillna (method = 'ffill')) # fill forward and backward
print (DataDF.UnitPrice.fillna (method = 'bfill')) # Fill before and after
4) Fill in missing values with the calculation results of different indicators

In this method, the age field is missing, but there is a shielded six-digit ID number to calculate the specific age.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More