Data merging, conversion, filtering, sorting of Python data cleansing

Last Update:2017-02-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We used pandas to do some basic operations, then further understand the operation of the data,

Data cleansing has always been a very important part of data analysis.

Data merge

In pandas, you can merge data through merge.

Import NumPy as Npimport pandas as Pddata1 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '],         ' numeber ': [1,3,5,7]}) data2=pd. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' e '],         ' numeber ': [2,3,6,10]}) print (DATA1)

The result is:

Print (DATA2)

The result is:

Print (Pd.merge (DATA1,DATA2))

The result is:

You can see that the fields used for the same label in Data1 and Data2 are displayed, while the other fields are discarded, which is equivalent to doing a inner JOIN operation in SQL.
In addition, there are outer,ringt,left and other connection methods, using the keyword how to express.

Data3 = PD. DataFrame ({' Level1 ': [' A ', ' B ', ' C ', ' d '],         ' numeber1 ': [1,3,5,7]}) data4=pd. DataFrame ({' Level2 ': [' A ', ' B ', ' C ', ' e '],         ' numeber2 ': [2,3,6,10]}) print (Pd.merge (data3,data4,left_on= ' Level1 '), right_on= ' Level2 '))

The result is:

Two data frames if the column names are different, we can connect the data by specifying the letf_on and right_on two parameters

Print (Pd.merge (data3,data4,left_on= ' level1 ', right_on= ' Level2 ', how= ' left '))

The result is:

Other detailed parameter description

Overlapping data merging

Sometimes we encounter overlapping data that needs to be merged, so we can use the Comebine_first function.

Data3 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '],         ' numeber1 ': [1,3,5,np.nan]}) data4=pd. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' e '],         ' numeber2 ': [2,np.nan,6,10]}) print (Data3.combine_first (DATA4))

The result is:

You can see that the content under the same label takes precedence over the contents of the data3, and if one of the data frames is missing, the elements in the other data frame will be

The usage here is similar to Np.where (IsNull (a), b,a)

Data reshaping and axial rotation

This content we have mentioned in the previous article pandas. Data reshaping mainly uses the reshape function, which rotates primarily using the unstack and stack two functions.

DATA=PD. DataFrame (Np.arange) reshape (3,4),       columns=[' A ', ' B ', ' C ', ' d '],       index=[' Wang ', ' Li ', ' Zhang ']) print ( Data

The result is:

Print (Data.unstack ())

The result is:

Data conversion

Delete duplicate row data

DATA=PD. DataFrame ({' A ': [1,3,3,4],       ' B ': [1,3,3,5]}) print (data)

The result is:

Print (data.duplicated ())

The result is:

You can see that the third line is repeating the second row of data so the display result is true

In addition, the Drop_duplicates method can be used to remove duplicate rows

Print (Data.drop_duplicates ())

The result is:

Replace value

In addition to using the Fillna method mentioned in our previous article, you can also use the Replace method, which is easier and quicker

DATA=PD. DataFrame ({' A ': [1,3,3,4],       ' B ': [1,3,3,5]}) print (Data.replace)

The result is:

Multiple data exchange together

Print (Data.replace ([1,4],np.nan))

Data fragmentation

Data=[11,15,18,20,25,26,27,24]bins=[15,20,25]print (data) print (Pd.cut (data,bins))

The result is:
[11, 15, 18, 20, 25, 26, 27, 24] [Nan, Nan, (+), (+), (+), Nan, Nan, (20, 25]]
Categories (2, object): [(] < (20, 25)]

You can see the result after the segmentation, the data that is not in the segment is displayed as Na value, and the others show the segment where the data resides.

Print (Pd.cut (data,bins). Labels)

The result is:

[-1-1 0 0 1-1-1 1]

Show the segment sort labels

Print (Pd.cut (data,bins). Levels)

The result is:

Index ([' (+] ', ' (+) ', dtype= ' object ')

Display so segmented label

Print (Value_counts (Pd.cut (data,bins)))

The result is:

Show the number of each segment worth

There is also a qcut function that can perform 4-bit cutting of data, similar in usage to cut.

Arranging and sampling

We know that there are several methods for sorting, such as Sort,order,rank, which can sort the data.
Now, this is a random sort of data (permutation)

Data=np.random.permutation (5) print (data)

The result is:

[1 0 4 2 3]

The Peemutation function here results in a random ordering of 0-4 of the data.
The data can also be sampled

DF=PD. DataFrame (Np.arange) reshape (4,3)) samp=np.random.permutation (3) print (DF)

The result is:

Print (SAMP)

The result is:
[1 0 2]

Print (Df.take (SAMP))

The result is:

The result of using take here is to extract the sample from DF in the order in which it was samp.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data merging, conversion, filtering, sorting of Python data cleansing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data merging, conversion, filtering, sorting of Python data cleansing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support