[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandas
Preview Data
This time, we use Artworks.csv, And we select 100 rows of data to complete this content. Procedure:
DataFrame is the built-in data display structure of Pandas, and the display speed is very fast. With DataFrame, we can quickly preview and analyze data. The Code is as follows:
import pandas as pddf = pd.read_csv('../data/Artworks.csv').head(100)df.head(10)
Statistical date data
Let's take a closer look at the data in the Date column. Some data is in the year range (1976-1977), rather than a separate year. When we use year data to draw images, we cannot draw images as easily as a separate year. Now we will use value_counts () of Pandas to count the quantity of each data type.
First, select the column to be counted and call value_counts ():
df['Date'].value_counts()
Date data problems
Date column data, except that the Year is a range, there are three abnormal formats. Below we will list these types:
- Question 1, time range (1976-77)
- Question 2: Estimation (c. Before and after)
- Question 3: Unknown)
- Question 4: meaningless data (n. d .)
Next, we will deal with each of the above problems and use Pandas to convert these irregular data into a unified format.
Problem 1 and problem 2 are that only the format of the data is incorrect. Problem 3 and Problem 4 are not actually valid data. For the first two problems, we can format the data in the code to achieve the goal of cleaning. However, for the last two problems, the code can only treat them as missing values. For simplicity, We will process problem 3 and 4 as 0.
Troubleshooting Question 1
The data in question 1 is within two years. We select one year as the cleaned data. For the sake of simplicity, we use the start time to replace the problematic data, because this time is a four-digit number. If we want to use the end year, we need to add the first two digits.
First, we need to find the data for problem 1 so that we can update it. Ensure that other data is not updated, because other data may have been formatted or processed below.
The time range of the data to be processed, including "-", so that we can filter the data to be processed through this special string, and then use split () use "-" to split the data and use the first part of the result as the final result for processing.
The Code is as follows:
row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows(): df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()
Problem 2
The data in question 2 reflects the inaccuracy of the data itself. It is an estimated year time. If we convert it into a year, we only need to keep the last four digits, the feature of this data is that the data contains "c", so that we can filter the data to be converted through this feature.
row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows(): df.at[i,'Date'] = row['Date'][-4:]df[row_with_cs]
Issue 3 and 4
Assign the value of the data in question 3 and 4 to the initial value 0.
df['Date'] = df['Date'].replace('Unknown','0',regex=True)df['Date'] = df['Date'].replace('n.d.','0',regex=True)df['Date']
Code Integration
mport pandas as pddf = pd.read_csv('../data/Artworks.csv').head(100)df.head(10)df['Date'].value_counts()row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows(): df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows(): df.at[i,'Date'] = row['Date'][-4:]df['Date'].value_counts()df['Date'] = df['Date'].replace('Unknown','0',regex=True)df['Date'] = df['Date'].replace('n.d.','0',regex=True)df['Date'].value_counts()