[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandas

Source: Internet
Author: User

[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandas
Preview Data

This time, we use Artworks.csv, And we select 100 rows of data to complete this content. Procedure:

DataFrame is the built-in data display structure of Pandas, and the display speed is very fast. With DataFrame, we can quickly preview and analyze data. The Code is as follows:

import pandas as pddf = pd.read_csv('../data/Artworks.csv').head(100)df.head(10)

 

 

Statistical date data

Let's take a closer look at the data in the Date column. Some data is in the year range (1976-1977), rather than a separate year. When we use year data to draw images, we cannot draw images as easily as a separate year. Now we will use value_counts () of Pandas to count the quantity of each data type.

First, select the column to be counted and call value_counts ():

df['Date'].value_counts()

 

 

 

Date data problems

Date column data, except that the Year is a range, there are three abnormal formats. Below we will list these types:

  • Question 1, time range (1976-77)
  • Question 2: Estimation (c. Before and after)
  • Question 3: Unknown)
  • Question 4: meaningless data (n. d .)

Next, we will deal with each of the above problems and use Pandas to convert these irregular data into a unified format.

Problem 1 and problem 2 are that only the format of the data is incorrect. Problem 3 and Problem 4 are not actually valid data. For the first two problems, we can format the data in the code to achieve the goal of cleaning. However, for the last two problems, the code can only treat them as missing values. For simplicity, We will process problem 3 and 4 as 0.

 

Troubleshooting Question 1

The data in question 1 is within two years. We select one year as the cleaned data. For the sake of simplicity, we use the start time to replace the problematic data, because this time is a four-digit number. If we want to use the end year, we need to add the first two digits.

First, we need to find the data for problem 1 so that we can update it. Ensure that other data is not updated, because other data may have been formatted or processed below.

The time range of the data to be processed, including "-", so that we can filter the data to be processed through this special string, and then use split () use "-" to split the data and use the first part of the result as the final result for processing.

The Code is as follows:

row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows():    df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()

 

 

Problem 2

The data in question 2 reflects the inaccuracy of the data itself. It is an estimated year time. If we convert it into a year, we only need to keep the last four digits, the feature of this data is that the data contains "c", so that we can filter the data to be converted through this feature.

row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows():    df.at[i,'Date'] = row['Date'][-4:]df[row_with_cs]

 

 

Issue 3 and 4

Assign the value of the data in question 3 and 4 to the initial value 0.

df['Date'] = df['Date'].replace('Unknown','0',regex=True)df['Date'] = df['Date'].replace('n.d.','0',regex=True)df['Date']

 

 

Code Integration
mport pandas as pddf = pd.read_csv('../data/Artworks.csv').head(100)df.head(10)df['Date'].value_counts()row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows():    df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows():    df.at[i,'Date'] = row['Date'][-4:]df['Date'].value_counts()df['Date'] = df['Date'].replace('Unknown','0',regex=True)df['Date'] = df['Date'].replace('n.d.','0',regex=True)df['Date'].value_counts()

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.