Python data cleansing-data merging, conversion, filtering, sorting, and python sorting

Source: Internet
Author: User

Python data cleansing-data merging, conversion, filtering, sorting, and python sorting

Previously, we used pandas to perform some basic operations. Next we will learn more about data operations,
Data cleansing has always been an extremely important part of data analysis.

Data Merging

In pandas, you can use merge to merge data.

import numpy as npimport pandas as pddata1 = pd.DataFrame({'level':['a','b','c','d'],         'numeber':[1,3,5,7]})data2=pd.DataFrame({'level':['a','b','c','e'],         'numeber':[2,3,6,10]})print(data1)

Result:

print(data2) 

Result:

print(pd.merge(data1,data2)) 

Result:


We can see that data1 and data2 are used for displaying fields with the same tag, while other fields are discarded, which is equivalent to the inner join Operation in SQL.
In addition, connections such as outer, ringt, and left are represented by the keyword "how.

data3 = pd.DataFrame({'level1':['a','b','c','d'],         'numeber1':[1,3,5,7]})data4=pd.DataFrame({'level2':['a','b','c','e'],         'numeber2':[2,3,6,10]})print(pd.merge(data3,data4,left_on='level1',right_on='level2'))

Result:


If the column names in the two data boxes are different, we can connect the data by specifying the letf_on and right_on parameters.

print(pd.merge(data3,data4,left_on='level1',right_on='level2',how='left')) 

Result:

Other detailed parameter descriptions

Merge overlapping data

Sometimes we may encounter overlapping data that needs to be merged. In this case, we can use the comebine_first function.

data3 = pd.DataFrame({'level':['a','b','c','d'],         'numeber1':[1,3,5,np.nan]}) data4=pd.DataFrame({'level':['a','b','c','e'],         'numeber2':[2,np.nan,6,10]}) print(data3.combine_first(data4))

Result:


You can see that the content under the same tag first displays the content of data3. If one data box is missing, then the elements in the other data box will be added.

The usage here is similar to np. where (isnull (a), B,)

Data remodeling and axial rotation

This content is mentioned in the previous pandas article. Data remodeling mainly uses the reshape function, while rotation mainly uses the unstack and stack functions.

data=pd.DataFrame(np.arange(12).reshape(3,4),       columns=['a','b','c','d'],       index=['wang','li','zhang'])print(data)

Result:

print(data.unstack()) 

Result:

Data Conversion

Delete duplicate row data

data=pd.DataFrame({'a':[1,3,3,4],       'b':[1,3,3,5]})print(data)

Result:

print(data.duplicated()) 

Result:


The third row duplicates the data of the second row. Therefore, the result is True.

In addition, the drop_duplicates method can be used to remove duplicate rows.

print(data.drop_duplicates()) 

Result:

Replacement value

In addition to the fillna method mentioned in the previous article, you can also use the replace method, which is simpler and faster.

data=pd.DataFrame({'a':[1,3,3,4],       'b':[1,3,3,5]})print(data.replace(1,2))

Result:


Change multiple data together

print(data.replace([1,4],np.nan)) 

Data Segmentation

data=[11,15,18,20,25,26,27,24]bins=[15,20,25]print(data)print(pd.cut(data,bins))

Result:
[11, 15, 18, 20, 25, 26, 27, 24] [NaN, NaN, (15, 20], (15, 20], (20, 25], naN, NaN, (20, 25]
Categories (2, object): [(15, 20] <(20, 25]

We can see the results after segmentation. The data not in the segmentation is displayed as the na value, and other data is displayed as the segmentation.

print(pd.cut(data,bins).labels) 

Result:

[-1-1 0 0 1-1-1 1]

Display the segmented sorting label

print(pd.cut(data,bins).levels) 

Result:

Index (['(15, 20]', '(20, 25]'], dtype = 'object ')

Display the segmentation label

print(value_counts(pd.cut(data,bins))) 

Result:


Show the number of values for each segment

In addition, there is a qcut function that can perform a 4-bit cut on the data. Its usage is similar to cut.

Arrangement and sampling

We know that there are several sorting methods, such as sort, order, rank, and other functions can sort data.
Now we want to talk about random sorting of data (permutation)

data=np.random.permutation(5)print(data)

Result:

[1 0 4 2 3]

Here, the peemutation function sorts 0-4 Data randomly.
You can also sample the data.

df=pd.DataFrame(np.arange(12).reshape(4,3))samp=np.random.permutation(3)print(df)

Result:

Print (samp)

Result:
[1 0 2]

Print (df. take (samp ))

Result:


Here, the result of taking is that samples are extracted from df in the samp order.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.