Data cleaning is the first step in data analysis. It often takes a lot of time to clean the data or convert the format.
First. Data preprocessing
1. Deploy environment, import analysis package and data
import pandas as pd
import numpy as np
DataDF = pd.read_csv ('C: /Users/jzgao/Desktop/ecommerce-data/data.csv',encoding = "ISO-8859-1", dtype = str)
# dtype = str, it is best to read in the form of a string when reading, otherwise the data may be distorted
# For example, a 0010008 number may be read as 10008
# encoding = "ISO-8859-1"-what to use for decoding, generally the default system encoding, if it is Chinese, use "utf-8"
2. Try to understand this data set
We can judge whether the data can meet our questions by asking the data set. Whether the data is clean requires no further processing.
3. Let's take a look at the data in combination with the code
# 1. Look at the data from a macro perspective: view the dataframe information
DataDF.info ()
! [DataDF.info ()] (https://pic4.zhimg.com/80/v2-1e545431538653299eba0e3c716caa0f_hd.jpg)
`` `python # 2. Check for missing data # If you want to check the number of missing data in each column, using the following code is the fastest method. # Allows you to better understand which columns are missing more data, so as to determine how to perform the next data cleaning and analysis operations. DataDF.isnull (). Sum (). Sort_values (ascending = False) `` `
! [Check for missing data] (https://pic3.zhimg.com/80/v2-f92c16a31bfc8156e69dc24ab3e04b5a_hd.jpg)
`` `python # 3. Extract some data, understand the meaning of the data intuitively, and find some problems as much as possible DataDF.head ()` ``
! [Extract some data] (https://pic4.zhimg.com/80/v2-75282587dc4e5139f870d064713e3d9f_hd.jpg)
can be seen:
1) The specific time of InvoiceDate appears, you can delete it
2) Description Large probability is the data filled in manually, and there are usually more format problems.
It is guessed that there will be problems with punctuation / inconsistent capitalization, and so on.
# View the duplicate items of this product name
DataDF ['Description']. Unique ()
! [View duplicate items] (https://pic2.zhimg.com/80/v2-97354e2574bc482209057091b4376bf5_hd.jpg)
`` `python # Set the output of all content # threshold is how many items are set, it will be omitted # (such as threshold = 10 means that more than 10 will be omitted) np.set_printoptions (threshold = np.inf) DataDF ['Description']. Unique () `` `
! [All content] (https://pic4.zhimg.com/80/v2-2c0a927c6d6709fb4549708df0845be7_hd.jpg)
Found a problem with many spaces
Second, adjust the data type
DataDF.dtypes
! [Before data type adjustment] (https://pic1.zhimg.com/80/v2-3528fb2a4eadd539198913bfc9147e9c_hd.jpg)
`` `python # Convert string to numeric value (integer type) DataDF ['Quantity'] = DataDF ['Quantity']. astype ('int') # Convert string to numeric value (float type) DataDF ['UnitPrice' ] = DataDF ['UnitPrice']. Astype ('float')
DataDF.dtypes
<br>
! [After data type adjustment] (https://pic2.zhimg.com/80/v2-5ad6c37d0fd41c73669e9ef53cdd48b1_hd.jpg)
<br>
Third, modify the column name
`` `python
DataDF.head (10)
! [Before column name modification] (https://pic3.zhimg.com/80/v2-47ef427e9fe02e05bab4d71751872816_hd.jpg)
`` `python # Create a dictionary dictionary: the correspondence between the old column name and the new column name colNameDict = {'InvolceDate': 'SaleDate', 'StockCode': 'StockNo'} # !! ⚠️ Be sure to put the old column name before the colon # The correspondence between each group is separated by [comma] DataDF.rename (columns = colNameDict, inplace = True)
DataDF.head (10)
<br>
! [Column name modified] (https://pic2.zhimg.com/80/v2-cacb109690ff0eb5dc6f617688c4b8b9_hd.jpg)
<br>
Fourth, select some subsets
`` `python
DataDF
! [Data Set] (https://pic2.zhimg.com/80/v2-96e7e990e1558bcd00a74b74b055dec9_hd.jpg)
`` `python #Select a subset, select one of the columns subDataDF1 = DataDF [" InvoiceDate "]
subDataDF1
<br>
! [One column subset] (https://pic4.zhimg.com/80/v2-0934de9fb5b296706b7f4bcc05bdeea7_hd.jpg)
<br>
`` `python
#Select a subset, select two of them
subDataDF2 = DataDF [["InvoiceDate", "UnitPrice"]]
subDataDF2
! [Two-column subset] (https://pic4.zhimg.com/80/v2-b278ff19a5857645d6d75071ba904057_hd.jpg)
Filter data by slice df.loc
The loc code is a bit like the left mouse button in Excel, you can pull the data you need to slice.
Use comma as the delimited boundary, index on the left and column on the right
subDataDF3 = DataDF.loc [:, "InvoiceDate"]
subDataDF3
#A single colon means unrestricted selection
! [subDataDF3] (https://pic1.zhimg.com/80/v2-ad7318105f9819980b745294977e76fc_hd.jpg)
`` `python subDataDF4 = DataDF.loc [0: 9 ,:] subDataDF4` ``
! [subDataDF4] (https://pic2.zhimg.com/80/v2-8722bc9b92101f6d47c9f1296906e879_hd.jpg)
`` `python subDataDF5 = DataDF.loc [1: 8," StockNo ":" CustomerID "] subDataDF5` ``
! [subDataDF5] (https://pic4.zhimg.com/80/v2-dd5a50421201e3163a8dfcb7c327609b_hd.jpg)
Fifth. Logic problems need to be screened
It is also the knowledge point of the function Dataframe.loc.
Because loc can also determine whether the condition is True
DataDF.loc [:, 'UnitPrice']> 0
! [UnitPrice> 0] (https://pic3.zhimg.com/80/v2-77f3843482b8cc62918c5b183126e6fe_hd.jpg)
Generally speaking, the price cannot be negative, so logically, if the price is less than 0, the data should be filtered out
#Delete outliers: filter data by condition judgment
#Query conditions
querySer = DataDF.loc [:, 'Quantity']> 0
#Apply query conditions
print ('Before deleting outliers:', DataDF.shape)
DataDF = DataDF.loc [querySer ,:]
print ('After deleting outliers:', DataDF.shape)
! [After deleting outliers] (https://pic3.zhimg.com/80/v2-1e0b67efb65821fae3e43160a304c706_hd.jpg)
Sixth. Format consistency
1. Capitalization / space removal
Change everything in the Descrption column of the data to uppercase:
DataDF ['Description'] = DataDF ['Description']. Str.upper ()
DataDF.head ()
2. Remove string symbols
3. Space separation
Finally assign back
Seventh, deal with missing values
1. Remove missing values
# Again reminder to check for missing data
DataDF.isnull (). Sum (). Sort_values (ascending = False)
Knowledge points to remove missing values:
DataFrame.dropna
DataFrame.dropna (axis = 0, how = 'any', thresh = None, subset = None, inplace = False)
# The default (axis = 0) is to remove the entire row on the diurnal value, set the keyword parameter axis = 1 to denote the entire column on the diurnal value
# 'any' If any of the data in a row (or a column) appears Nan, remove the entire row,
# 'all' one row (or column) each data is Nan before removing this entire row
DataDF.dropna (how = 'any')
DataDF.dropna (how = 'all')
# Finer thresh parameter, it indicates how many [non-missing values] are required when leaving this row (or column)
DataDF.dropna (thresh = 6)
2. Fill in missing content: Some missing values can be filled in, there are four methods:
Fill in missing values with business knowledge or experience speculation (default)
Fill in missing values with calculation results of the same indicator (mean, median, mode, etc.)
Fill missing values with adjacent values
Fill missing values with calculation results of different indicators
Knowledge points to remove missing values:
DataFrame.fillna
1) Fill with default values-df.fillna ('')
We should remove those unfriendly NaN values. But what value should we replace? At this time, you may need to combine your understanding of this data set to see what data is more appropriate.
In this data set, we roughly judge that if CustomerID is less important, we can use "" empty string or other default values.
DataDF.Country = DataDF.Country.fillna ('Not Given')
2) Fill in the missing values with the calculation results of the same indicator (mean, median, mode, etc.)
Average-df.fillna (df.mean ())
It may be possible to reduce errors by using numeric data.
For example, the price in this case. It is not appropriate to use 0 or "Not Given" to fill in, but this approximate price can be estimated based on other data.
DataDF.UnitPrice = DataDF.UnitPrice.fillna (DataDF.UnitPrice.mean ())
3) In addition, there is a common method, which is to fill with adjacent values,
This is quite common in time series analysis. It can be filled backward with the value of the previous neighbor, or forward with the value of the next neighbor.
print (DataDF)
print (DataDF.UnitPrice.fillna (method = 'ffill')) # fill forward and backward
print (DataDF.UnitPrice.fillna (method = 'bfill')) # Fill before and after
4) Fill in missing values with the calculation results of different indicators
In this method, the age field is missing, but there is a shielded six-digit ID number to calculate the specific age.