We used pandas to do some basic operations, then further understand the operation of the data,
Data cleansing has always been a very important part of data analysis.
Data merge
In pandas, you can merge data through merge.
Import NumPy as Npimport pandas as Pddata1 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '], ' numeber ': [1,3,5,7]}) data2=pd. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' e '], ' numeber ': [2,3,6,10]}) print (DATA1)
The result is:
Print (DATA2)
The result is:
Print (Pd.merge (DATA1,DATA2))
The result is:
You can see that the fields used for the same label in Data1 and Data2 are displayed, while the other fields are discarded, which is equivalent to doing a inner JOIN operation in SQL.
In addition, there are outer,ringt,left and other connection methods, using the keyword how to express.
Data3 = PD. DataFrame ({' Level1 ': [' A ', ' B ', ' C ', ' d '], ' numeber1 ': [1,3,5,7]}) data4=pd. DataFrame ({' Level2 ': [' A ', ' B ', ' C ', ' e '], ' numeber2 ': [2,3,6,10]}) print (Pd.merge (data3,data4,left_on= ' Level1 '), right_on= ' Level2 '))
The result is:
Two data frames if the column names are different, we can connect the data by specifying the letf_on and right_on two parameters
Print (Pd.merge (data3,data4,left_on= ' level1 ', right_on= ' Level2 ', how= ' left '))
The result is:
Other detailed parameter description
Overlapping data merging
Sometimes we encounter overlapping data that needs to be merged, so we can use the Comebine_first function.
Data3 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '], ' numeber1 ': [1,3,5,np.nan]}) data4=pd. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' e '], ' numeber2 ': [2,np.nan,6,10]}) print (Data3.combine_first (DATA4))
The result is:
You can see that the content under the same label takes precedence over the contents of the data3, and if one of the data frames is missing, the elements in the other data frame will be
The usage here is similar to Np.where (IsNull (a), b,a)
Data reshaping and axial rotation
This content we have mentioned in the previous article pandas. Data reshaping mainly uses the reshape function, which rotates primarily using the unstack and stack two functions.
DATA=PD. DataFrame (Np.arange) reshape (3,4), columns=[' A ', ' B ', ' C ', ' d '], index=[' Wang ', ' Li ', ' Zhang ']) print ( Data
The result is:
Print (Data.unstack ())
The result is:
Data conversion
Delete duplicate row data
DATA=PD. DataFrame ({' A ': [1,3,3,4], ' B ': [1,3,3,5]}) print (data)
The result is:
Print (data.duplicated ())
The result is:
You can see that the third line is repeating the second row of data so the display result is true
In addition, the Drop_duplicates method can be used to remove duplicate rows
Print (Data.drop_duplicates ())
The result is:
Replace value
In addition to using the Fillna method mentioned in our previous article, you can also use the Replace method, which is easier and quicker
DATA=PD. DataFrame ({' A ': [1,3,3,4], ' B ': [1,3,3,5]}) print (Data.replace)
The result is:
Multiple data exchange together
Print (Data.replace ([1,4],np.nan))
Data fragmentation
Data=[11,15,18,20,25,26,27,24]bins=[15,20,25]print (data) print (Pd.cut (data,bins))
The result is:
[11, 15, 18, 20, 25, 26, 27, 24] [Nan, Nan, (+), (+), (+), Nan, Nan, (20, 25]]
Categories (2, object): [(] < (20, 25)]
You can see the result after the segmentation, the data that is not in the segment is displayed as Na value, and the others show the segment where the data resides.
Print (Pd.cut (data,bins). Labels)
The result is:
[-1-1 0 0 1-1-1 1]
Show the segment sort labels
Print (Pd.cut (data,bins). Levels)
The result is:
Index ([' (+] ', ' (+) ', dtype= ' object ')
Display so segmented label
Print (Value_counts (Pd.cut (data,bins)))
The result is:
Show the number of each segment worth
There is also a qcut function that can perform 4-bit cutting of data, similar in usage to cut.
Arranging and sampling
We know that there are several methods for sorting, such as Sort,order,rank, which can sort the data.
Now, this is a random sort of data (permutation)
Data=np.random.permutation (5) print (data)
The result is:
[1 0 4 2 3]
The Peemutation function here results in a random ordering of 0-4 of the data.
The data can also be sampled
DF=PD. DataFrame (Np.arange) reshape (4,3)) samp=np.random.permutation (3) print (DF)
The result is:
Print (SAMP)
The result is:
[1 0 2]
Print (Df.take (SAMP))
The result is:
The result of using take here is to extract the sample from DF in the order in which it was samp.