Pandas some tips for implementing data type conversions

Source: Internet
Author: User
This article mainly introduces the pandas to achieve data type conversion of some of the skills, has a certain reference value, now share to everyone, the need for friends can refer to

Objective

Pandas is an important data analysis tool in Python, it is important to ensure that the correct data types are used when using pandas for data analysis, or it may cause some unpredictable errors to occur.

Pandas data types: Data types are essentially internal structures that programming languages use to understand how to store and manipulate data. For example, a program needs to understand that you can add two numbers together, such as 5 + 10 to get 15. Or, if you have two strings, such as "cat" and "hat", you can connect them together (plus) to get "cathat". Shang School • Hundred-game programmer Mr. Chen pointed out that one possible confusion about the Pandas data types is that there is some overlap between the data types of Pandas, Python, and NumPy.

In most cases, you don't have to worry about whether you should explicitly cast the panda type to the corresponding NumPy type. In general, you can use Pandas's default int64 and Float64. The only reason I'm listing this table is that sometimes you might see the type of Numpy in the code line or during your own parsing.
Data types are one of the things you don't care about until you encounter an error or unexpected result. But when you load new data into Pandas for further analysis, this is the first thing you should check.

I have been using pandas for some time, but I still make mistakes on some minor issues, traced to the fact that some feature columns are not the types that pandas can handle when manipulating data. So this article will discuss some tips on how to turn Python's basic data types into data types that pandas can handle.

Pandas, Numpy, Python data types supported by each


As can be seen from the table above, pandas supports the most abundant data types, in some cases numpy data types can be transformed from pandas data types, after all pandas libraries are developed on the basis of numpy.

Introduction of actual data for analysis

Data types are things that you might not normally care about until you get the wrong results, so introduce an example of real-world data analysis to deepen your understanding.

Import NumPy as Npimport pandas as Pddata = Pd.read_csv (' data.csv ', encoding= ' GBK ') #因为数据中含有中文数据data


The data is loaded, and if you want to do something about that data now, add the data column 2016 and 2017, for example.

data[' + data['] #想当然的做法


As a result, there is no value added as imagined, because the addition of the object type in pandas is equivalent to the addition of strings in Python.

Data.info () #在对数据进行处理之前应该先查看加载数据的相关信息


After seeing the information about loading the data, you can find the following questions:

    • The data type of the customer number is int64 rather than object type

    • 2016, 2017 column data type is object instead of numeric type (Int64, float64)

    • Growth rate, the data type of the owning group should be numeric type instead of object type

    • The data type for year, month, and day should be datetime64 type instead of object type

There are three basic methods for data type conversions in Pandas:

    • Forcing type conversions using the Astype () function

    • Custom functions for data type conversions

    • Use functions provided by pandas such as To_numeric (), To_datetime ()

Type conversion using the Astype () function

The simplest way to make data type conversions on data columns is to use the Astype () function

data[' Customer number '].astype (' object ') data[' customer number ' = data[' customer number '].astype (' object ') #对原始数据进行转换并覆盖原始数据列

The above results look good, and then we give some examples of how several astype () functions work on column data but fail.

data['].astype (' float ')

Data[' belongs to Group '].astype (' int ')


As can be seen from the above two examples, when the column to be converted contains a special value that cannot be converted (in the example ¥,errorvalue, etc.) the Astype () function will be invalidated. Sometimes the Astype () function executes successfully and does not necessarily mean that the execution results are in line with expectations (God pit!)

data[' state '].astype (' bool ')


At first glance, the results look good, but after careful observation, you will find a big problem. That is, all values are replaced with true, but the column contains several n flags, so the Astype () function is invalidated in that column.

Summarize the case where the Astype () function is valid:

    • Each unit in the data column can be simply interpreted as a number (2, 2.12, etc.)

    • Each unit in a data column is a numeric type and converts to the string object type

If the data contains missing values, the special character astype () function may fail.

Using a custom function for data type conversions

This method is especially suitable for the complex data of data columns to be converted, and can be applied to each data column by constructing a function and converting it to the appropriate data type.

For the currency in the above data, you need to convert it to float type, so you can write a conversion function:

def convert_currency (value): "" "Convert string number to float type-remove ¥,-Convert to float type" "" New_value = Value.replace (', ', '). Replace (' ¥ ', ") return Np.float (New_value)

You can now use the pandas's apply function to apply the Covert_currency function to all the data in the 2016 column.

Data['].apply (convert_currency)


All the data in this column is converted to the corresponding numeric type, so you can perform common mathematical operations on the column data. If you rewrite the code with a lambda expression, it might be simpler but less friendly to the novice.

data['].apply (lambda x:x.replace (' ¥ ', '). Replace (', ', '). Astype (' float ')

When a function needs to be applied repeatedly to more than one column, the first method is recommended, and one benefit of defining the function first is that it can be used with the Read_csv () function (described later).

#2016, 2017 columns complete conversion code data[') = data['].apply (convert_currency) data['] = data['].apply (convert_ Currency

The same approach applies to growth rates, first building custom functions

def convert_percent (value): "" "convert string percentage to float type decimal-remove%-divide by 100 Convert to Decimal" "" New_value = value.replace ('% ', ') return float (new_value)/100

The Apply function using Pandas is applied to all data in the growth rate column through the Covert_percent function.

data[' growth rate '].apply (convert_percent)

Using a lambda expression:

data[' growth rate '].apply (lambda x:x.replace ('% ', ')). Astype (' float ')/100

The results are the same:


In order to convert the Status column, you can use the WHERE function in NumPy to map the value Y to true, and all other values to be mapped to false.

data[' state ' = Np.where (data[' state ') = = ' Y ', True, False)

You can also use custom functions or lambda expressions to solve the problem perfectly, and this is just a way of thinking.

Using some auxiliary functions of pandas for type conversion

There is an intermediate segment between the Astype () function of Pandas and the complex custom function, which is the auxiliary function of pandas. These auxiliary functions are useful for transformations of certain data types (such as To_numeric (), To_datetime ()). The owning Group data column contains a non-numeric value, with the Astype () conversion error, but processing with the to_numeric () function is much more elegant.

Pd.to_numeric (data[' belongs to group '], errors= ' coerce '). Fillna (0)


As you can see, non-numeric values are replaced with 0.0, and of course this padding value can be selected, as specified in the document
Pandas.to_numeric-pandas 0.22.0 Documentation

The To_datetime () function in pandas can combine separate year, month, and day three columns into a single timestamp.

Pd.to_datetime (data[[' Day ', ' Month ', ' year ')]

To complete the replacement of a data column

data[' new_date ' = pd.to_datetime (data[[' Day ', ' Month ', ' Year ')]) #新产生的一列数据data [' owning group '] = pd.to_numeric (data[' belongs to Group '), errors= ' coerce '). Fillna (0)

All the data columns are converted and the final data is displayed:


The data type is converted when the data is read, one step

Data2 = Pd.read_csv ("Data.csv",   converters={    ' customer number ': str, ' "    : Convert_currency, '}    ': Convert _currency,    ' growth rate ': convert_percent,    ' owning group ': Lambda x:pd.to_numeric (x, errors= ' coerce '),    ' state ': lambda x: Np.where (x = = "Y", True, False)    },   encoding= ' GBK ')

It also shows that it is much easier to use a custom function than a lambda expression. (In most cases, the lambda is still very concise, the author himself also likes to use)

Summarize

The first step in working with a dataset is to ensure that the correct data types are set up before data can be analyzed, visualized, and so on, and pandas provides a number of very handy functions that can be easily analyzed with these functions.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.