[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandas

Last Update:2018-01-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandas
Preview Data

This time, we use Artworks.csv, And we select 100 rows of data to complete this content. Procedure:

DataFrame is the built-in data display structure of Pandas, and the display speed is very fast. With DataFrame, we can quickly preview and analyze data. The Code is as follows:

import pandas as pddf = pd.read_csv('../data/Artworks.csv').head(100)df.head(10)

Statistical date data

Let's take a closer look at the data in the Date column. Some data is in the year range (1976-1977), rather than a separate year. When we use year data to draw images, we cannot draw images as easily as a separate year. Now we will use value_counts () of Pandas to count the quantity of each data type.

First, select the column to be counted and call value_counts ():

df['Date'].value_counts()

Date data problems

Date column data, except that the Year is a range, there are three abnormal formats. Below we will list these types:

Question 1, time range (1976-77)
Question 2: Estimation (c. Before and after)
Question 3: Unknown)
Question 4: meaningless data (n. d .)

Next, we will deal with each of the above problems and use Pandas to convert these irregular data into a unified format.

Problem 1 and problem 2 are that only the format of the data is incorrect. Problem 3 and Problem 4 are not actually valid data. For the first two problems, we can format the data in the code to achieve the goal of cleaning. However, for the last two problems, the code can only treat them as missing values. For simplicity, We will process problem 3 and 4 as 0.

Troubleshooting Question 1

The data in question 1 is within two years. We select one year as the cleaned data. For the sake of simplicity, we use the start time to replace the problematic data, because this time is a four-digit number. If we want to use the end year, we need to add the first two digits.

First, we need to find the data for problem 1 so that we can update it. Ensure that other data is not updated, because other data may have been formatted or processed below.

The time range of the data to be processed, including "-", so that we can filter the data to be processed through this special string, and then use split () use "-" to split the data and use the first part of the result as the final result for processing.

The Code is as follows:

row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows():    df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()

Problem 2

The data in question 2 reflects the inaccuracy of the data itself. It is an estimated year time. If we convert it into a year, we only need to keep the last four digits, the feature of this data is that the data contains "c", so that we can filter the data to be converted through this feature.

row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows():    df.at[i,'Date'] = row['Date'][-4:]df[row_with_cs]

Issue 3 and 4

Assign the value of the data in question 3 and 4 to the initial value 0.

df['Date'] = df['Date'].replace('Unknown','0',regex=True)df['Date'] = df['Date'].replace('n.d.','0',regex=True)df['Date']

Code Integration

mport pandas as pddf = pd.read_csv('../data/Artworks.csv').head(100)df.head(10)df['Date'].value_counts()row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows():    df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows():    df.at[i,'Date'] = row['Date'][-4:]df['Date'].value_counts()df['Date'] = df['Date'].replace('Unknown','0',regex=True)df['Date'] = df['Date'].replace('n.d.','0',regex=True)df['Date'].value_counts()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandas

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandas

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support