Python data cleansing-data merging, conversion, filtering, sorting, and python sorting

Last Update:2017-02-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Previously, we used pandas to perform some basic operations. Next we will learn more about data operations,
Data cleansing has always been an extremely important part of data analysis.

Data Merging

In pandas, you can use merge to merge data.

import numpy as npimport pandas as pddata1 = pd.DataFrame({'level':['a','b','c','d'],         'numeber':[1,3,5,7]})data2=pd.DataFrame({'level':['a','b','c','e'],         'numeber':[2,3,6,10]})print(data1)

Result:

print(data2)

Result:

print(pd.merge(data1,data2))

Result:

We can see that data1 and data2 are used for displaying fields with the same tag, while other fields are discarded, which is equivalent to the inner join Operation in SQL.
In addition, connections such as outer, ringt, and left are represented by the keyword "how.

data3 = pd.DataFrame({'level1':['a','b','c','d'],         'numeber1':[1,3,5,7]})data4=pd.DataFrame({'level2':['a','b','c','e'],         'numeber2':[2,3,6,10]})print(pd.merge(data3,data4,left_on='level1',right_on='level2'))

Result:

If the column names in the two data boxes are different, we can connect the data by specifying the letf_on and right_on parameters.

print(pd.merge(data3,data4,left_on='level1',right_on='level2',how='left'))

Result:

Other detailed parameter descriptions

Merge overlapping data

Sometimes we may encounter overlapping data that needs to be merged. In this case, we can use the comebine_first function.

data3 = pd.DataFrame({'level':['a','b','c','d'],         'numeber1':[1,3,5,np.nan]}) data4=pd.DataFrame({'level':['a','b','c','e'],         'numeber2':[2,np.nan,6,10]}) print(data3.combine_first(data4))

Result:

You can see that the content under the same tag first displays the content of data3. If one data box is missing, then the elements in the other data box will be added.

The usage here is similar to np. where (isnull (a), B,)

Data remodeling and axial rotation

This content is mentioned in the previous pandas article. Data remodeling mainly uses the reshape function, while rotation mainly uses the unstack and stack functions.

data=pd.DataFrame(np.arange(12).reshape(3,4),       columns=['a','b','c','d'],       index=['wang','li','zhang'])print(data)

Result:

print(data.unstack())

Result:

Data Conversion

Delete duplicate row data

data=pd.DataFrame({'a':[1,3,3,4],       'b':[1,3,3,5]})print(data)

Result:

print(data.duplicated())

Result:

The third row duplicates the data of the second row. Therefore, the result is True.

In addition, the drop_duplicates method can be used to remove duplicate rows.

print(data.drop_duplicates())

Result:

Replacement value

In addition to the fillna method mentioned in the previous article, you can also use the replace method, which is simpler and faster.

data=pd.DataFrame({'a':[1,3,3,4],       'b':[1,3,3,5]})print(data.replace(1,2))

Result:

Change multiple data together

print(data.replace([1,4],np.nan))

Data Segmentation

data=[11,15,18,20,25,26,27,24]bins=[15,20,25]print(data)print(pd.cut(data,bins))

Result:
[11, 15, 18, 20, 25, 26, 27, 24] [NaN, NaN, (15, 20], (15, 20], (20, 25], naN, NaN, (20, 25]
Categories (2, object): [(15, 20] <(20, 25]

We can see the results after segmentation. The data not in the segmentation is displayed as the na value, and other data is displayed as the segmentation.

print(pd.cut(data,bins).labels)

Result:

[-1-1 0 0 1-1-1 1]

Display the segmented sorting label

print(pd.cut(data,bins).levels)

Result:

Index (['(15, 20]', '(20, 25]'], dtype = 'object ')

Display the segmentation label

print(value_counts(pd.cut(data,bins)))

Result:

Show the number of values for each segment

In addition, there is a qcut function that can perform a 4-bit cut on the data. Its usage is similar to cut.

Arrangement and sampling

We know that there are several sorting methods, such as sort, order, rank, and other functions can sort data.
Now we want to talk about random sorting of data (permutation)

data=np.random.permutation(5)print(data)

Result:

[1 0 4 2 3]

Here, the peemutation function sorts 0-4 Data randomly.
You can also sample the data.

df=pd.DataFrame(np.arange(12).reshape(4,3))samp=np.random.permutation(3)print(df)

Result:

Print (samp)

Result:
[1 0 2]

Print (df. take (samp ))

Result:

Here, the result of taking is that samples are extracted from df in the samp order.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python data cleansing-data merging, conversion, filtering, sorting, and python sorting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python data cleansing-data merging, conversion, filtering, sorting, and python sorting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support