Data merging, conversion, filtering, sorting of Python data cleansing

Source: Internet
Author: User
We used pandas to do some basic operations, then further understand the operation of the data,

Data cleansing has always been a very important part of data analysis.

Data merge

In pandas, you can merge data through merge.

Import NumPy as Npimport pandas as Pddata1 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '],         ' numeber ': [1,3,5,7]}) data2=pd. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' e '],         ' numeber ': [2,3,6,10]}) print (DATA1)

The result is:

Print (DATA2)

The result is:

Print (Pd.merge (DATA1,DATA2))

The result is:


You can see that the fields used for the same label in Data1 and Data2 are displayed, while the other fields are discarded, which is equivalent to doing a inner JOIN operation in SQL.
In addition, there are outer,ringt,left and other connection methods, using the keyword how to express.

Data3 = PD. DataFrame ({' Level1 ': [' A ', ' B ', ' C ', ' d '],         ' numeber1 ': [1,3,5,7]}) data4=pd. DataFrame ({' Level2 ': [' A ', ' B ', ' C ', ' e '],         ' numeber2 ': [2,3,6,10]}) print (Pd.merge (data3,data4,left_on= ' Level1 '), right_on= ' Level2 '))

The result is:


Two data frames if the column names are different, we can connect the data by specifying the letf_on and right_on two parameters

Print (Pd.merge (data3,data4,left_on= ' level1 ', right_on= ' Level2 ', how= ' left '))

The result is:

Other detailed parameter description

Overlapping data merging

Sometimes we encounter overlapping data that needs to be merged, so we can use the Comebine_first function.

Data3 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '],         ' numeber1 ': [1,3,5,np.nan]}) data4=pd. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' e '],         ' numeber2 ': [2,np.nan,6,10]}) print (Data3.combine_first (DATA4))

The result is:


You can see that the content under the same label takes precedence over the contents of the data3, and if one of the data frames is missing, the elements in the other data frame will be

The usage here is similar to Np.where (IsNull (a), b,a)

Data reshaping and axial rotation

This content we have mentioned in the previous article pandas. Data reshaping mainly uses the reshape function, which rotates primarily using the unstack and stack two functions.

DATA=PD. DataFrame (Np.arange) reshape (3,4),       columns=[' A ', ' B ', ' C ', ' d '],       index=[' Wang ', ' Li ', ' Zhang ']) print ( Data

The result is:

Print (Data.unstack ())

The result is:

Data conversion

Delete duplicate row data

DATA=PD. DataFrame ({' A ': [1,3,3,4],       ' B ': [1,3,3,5]}) print (data)

The result is:

Print (data.duplicated ())

The result is:


You can see that the third line is repeating the second row of data so the display result is true

In addition, the Drop_duplicates method can be used to remove duplicate rows

Print (Data.drop_duplicates ())

The result is:

Replace value

In addition to using the Fillna method mentioned in our previous article, you can also use the Replace method, which is easier and quicker

DATA=PD. DataFrame ({' A ': [1,3,3,4],       ' B ': [1,3,3,5]}) print (Data.replace)

The result is:


Multiple data exchange together

Print (Data.replace ([1,4],np.nan))

Data fragmentation


Data=[11,15,18,20,25,26,27,24]bins=[15,20,25]print (data) print (Pd.cut (data,bins))

The result is:
[11, 15, 18, 20, 25, 26, 27, 24] [Nan, Nan, (+), (+), (+), Nan, Nan, (20, 25]]
Categories (2, object): [(] < (20, 25)]

You can see the result after the segmentation, the data that is not in the segment is displayed as Na value, and the others show the segment where the data resides.

Print (Pd.cut (data,bins). Labels)

The result is:

[-1-1 0 0 1-1-1 1]

Show the segment sort labels

Print (Pd.cut (data,bins). Levels)

The result is:

Index ([' (+] ', ' (+) ', dtype= ' object ')

Display so segmented label

Print (Value_counts (Pd.cut (data,bins)))

The result is:


Show the number of each segment worth

There is also a qcut function that can perform 4-bit cutting of data, similar in usage to cut.

Arranging and sampling

We know that there are several methods for sorting, such as Sort,order,rank, which can sort the data.
Now, this is a random sort of data (permutation)

Data=np.random.permutation (5) print (data)

The result is:

[1 0 4 2 3]

The Peemutation function here results in a random ordering of 0-4 of the data.
The data can also be sampled

DF=PD. DataFrame (Np.arange) reshape (4,3)) samp=np.random.permutation (3) print (DF)

The result is:

Print (SAMP)

The result is:
[1 0 2]

Print (Df.take (SAMP))

The result is:


The result of using take here is to extract the sample from DF in the order in which it was samp.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.