[Data cleansing]-cleaning looks like a number

Source: Internet
Author: User

[Data cleansing]-cleaning looks like a number
Data is incorrect (incorrect format, inaccurate data, and missing data. The first step in data analysis during data cleansing is also the most time-consuming step. Data cleansing is boring, but as data cleansing techniques become increasingly sophisticated, the more likely it is to get more useful information from other people's documents.

This time I want to explain that it seems to be correct numerical data, which is different between humans and machines.

After Pandas loads data, head () is previewed. It seems that the data is good, but it is likely that it is blinded by the data representation. In Python, 2 is a number and 2 is a character. They are different data types, but they can all be computed in mathematics. Example:

 

People who are new to Python may be confused. What is this? Is it an unexpected example. "2" * 2 => 222*2 => 4 is not just *. in Python, + is also the same, as long as the data types on both sides of their operations are consistent. Note: If you add a string number and a numeric number, the error "TypeError: must be str, not int" appears"

 

"*" And "*" operations are flexible. It does not seem to be a problem to understand these actions. This problem is mainly determined by the Language designers. They just didn't use the same operator to concatenate strings and add values. Here we will create some data that looks like numerical data in DataFrame.

 

From the output point of view, data is of the numerical type. Next, we will perform some simple data analysis. Assuming the requirement, increase all values by ten times.

 

It seems that there is a gap between the results and the original assumptions. The data in the row Data2 looks like a numerical value. However, the result shows that it is not like a numerical value. What is the data type of each column that we urgently need to know now? Pandas has provided the attribute for viewing the Data Type of each column in DataFrame.

 

Pandas does not recognize that all data is of the object type. Therefore, data cleansing is necessary before data analysis starts. Pandas provides a method to convert the value type, to_numeric (). Now we are trying to convert the data of row Data2 to the numerical type.

 

Conversion failed. to_numeric () cannot convert the string "F" to the numeric type, and we have not controlled it in the code, so an exception is thrown. Pandas provides an optional parameter errors. When you pass in errors = 'coerce ', when Pandas encounters data that cannot be converted, it is assigned a value of NaN (Not a Number)

 

From the result, it seems that all data except "F" is null has been converted to the corresponding value. We run the computation ten times again.

 

Next we will look at the data type again.

 

Now the data is the same as we imagined. These blogs involve lambda usage. If you need to provide a lambda article, please leave a message so that I can plan the time.Integrated code

 

# Demonstrate the differences between numbers and strings two_char = '2' two _ num = 2def doubule (x): return x * 2 print ('Char :{}'. format (doubule (two_char) print ('num :{}'. format (doubule (two_num) print ('text :{}'. format (doubule ('test text end') # error, incorrect type print ("2" + 2) # simulate data import pandas as pddf = pd. dataFrame ([[1, 2, 3, 4, 16], ['1', '2', '3', '4', 'F'], index = ['data1', 'data2 ']) print (df) # times over 10 times. Check the difference between the result and the expected result. df. apply (lambda x: x * 10) # view the data type df. dtypes # df. loc ['data2 '] = pd. to_numeric (df. loc ['data2 ']) # converts only the data that can be converted. The value that cannot be converted is NaN (Not a Number) df. loc ['data2 '] = pd. to_numeric (df. loc ['data2 '], errors = 'coerce') # view the converted result df. loc ['data2 '] # Calculate the result again and check the difference between the result and the expected result. df = df. apply (lambda x: x * 10) print (df) # View data type df. dtypes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.