Python data cleansing-data merging, conversion, filtering, sorting, and python sorting
Previously, we used pandas to perform some basic operations. Next we will learn more about data operations,
Data cleansing has always been an extremely important part of data analysis.
Data Merging
In pandas, you can use merge to merge data.
import numpy as npimport pandas as pddata1 = pd.DataFrame({'level':['a','b','c','d'], 'numeber':[1,3,5,7]})data2=pd.DataFrame({'level':['a','b','c','e'], 'numeber':[2,3,6,10]})print(data1)
Result:
print(data2)
Result:
print(pd.merge(data1,data2))
Result:
We can see that data1 and data2 are used for displaying fields with the same tag, while other fields are discarded, which is equivalent to the inner join Operation in SQL.
In addition, connections such as outer, ringt, and left are represented by the keyword "how.
data3 = pd.DataFrame({'level1':['a','b','c','d'], 'numeber1':[1,3,5,7]})data4=pd.DataFrame({'level2':['a','b','c','e'], 'numeber2':[2,3,6,10]})print(pd.merge(data3,data4,left_on='level1',right_on='level2'))
Result:
If the column names in the two data boxes are different, we can connect the data by specifying the letf_on and right_on parameters.
print(pd.merge(data3,data4,left_on='level1',right_on='level2',how='left'))
Result:
Other detailed parameter descriptions
Merge overlapping data
Sometimes we may encounter overlapping data that needs to be merged. In this case, we can use the comebine_first function.
data3 = pd.DataFrame({'level':['a','b','c','d'], 'numeber1':[1,3,5,np.nan]}) data4=pd.DataFrame({'level':['a','b','c','e'], 'numeber2':[2,np.nan,6,10]}) print(data3.combine_first(data4))
Result:
You can see that the content under the same tag first displays the content of data3. If one data box is missing, then the elements in the other data box will be added.
The usage here is similar to np. where (isnull (a), B,)
Data remodeling and axial rotation
This content is mentioned in the previous pandas article. Data remodeling mainly uses the reshape function, while rotation mainly uses the unstack and stack functions.
data=pd.DataFrame(np.arange(12).reshape(3,4), columns=['a','b','c','d'], index=['wang','li','zhang'])print(data)
Result:
print(data.unstack())
Result:
Data Conversion
Delete duplicate row data
data=pd.DataFrame({'a':[1,3,3,4], 'b':[1,3,3,5]})print(data)
Result:
print(data.duplicated())
Result:
The third row duplicates the data of the second row. Therefore, the result is True.
In addition, the drop_duplicates method can be used to remove duplicate rows.
print(data.drop_duplicates())
Result:
Replacement value
In addition to the fillna method mentioned in the previous article, you can also use the replace method, which is simpler and faster.
data=pd.DataFrame({'a':[1,3,3,4], 'b':[1,3,3,5]})print(data.replace(1,2))
Result:
Change multiple data together
print(data.replace([1,4],np.nan))
Data Segmentation
data=[11,15,18,20,25,26,27,24]bins=[15,20,25]print(data)print(pd.cut(data,bins))
Result:
[11, 15, 18, 20, 25, 26, 27, 24] [NaN, NaN, (15, 20], (15, 20], (20, 25], naN, NaN, (20, 25]
Categories (2, object): [(15, 20] <(20, 25]
We can see the results after segmentation. The data not in the segmentation is displayed as the na value, and other data is displayed as the segmentation.
print(pd.cut(data,bins).labels)
Result:
[-1-1 0 0 1-1-1 1]
Display the segmented sorting label
print(pd.cut(data,bins).levels)
Result:
Index (['(15, 20]', '(20, 25]'], dtype = 'object ')
Display the segmentation label
print(value_counts(pd.cut(data,bins)))
Result:
Show the number of values for each segment
In addition, there is a qcut function that can perform a 4-bit cut on the data. Its usage is similar to cut.
Arrangement and sampling
We know that there are several sorting methods, such as sort, order, rank, and other functions can sort data.
Now we want to talk about random sorting of data (permutation)
data=np.random.permutation(5)print(data)
Result:
[1 0 4 2 3]
Here, the peemutation function sorts 0-4 Data randomly.
You can also sample the data.
df=pd.DataFrame(np.arange(12).reshape(4,3))samp=np.random.permutation(3)print(df)
Result:
Print (samp)
Result:
[1 0 2]
Print (df. take (samp ))
Result:
Here, the result of taking is that samples are extracted from df in the samp order.