This article mainly introduces the data merging, conversion, filtering, and sorting of python Data Cleansing. For more information, see pandas, next, we will learn more about data operations,
Data cleansing has always been an extremely important part of data analysis.
Data merging
In pandas, you can use merge to merge data.
import numpy as npimport pandas as pddata1 = pd.DataFrame({'level':['a','b','c','d'], 'numeber':[1,3,5,7]})data2=pd.DataFrame({'level':['a','b','c','e'], 'numeber':[2,3,6,10]})print(data1)
Result:
In addition, connections such as outer, ringt, and left are represented by the keyword "how.
data3 = pd.DataFrame({'level1':['a','b','c','d'], 'numeber1':[1,3,5,7]})data4=pd.DataFrame({'level2':['a','b','c','e'], 'numeber2':[2,3,6,10]})print(pd.merge(data3,data4,left_on='level1',right_on='level2'))
Result:
Merge overlapping data
Sometimes we may encounter overlapping data that needs to be merged. in this case, we can use the comebine_first function.
data3 = pd.DataFrame({'level':['a','b','c','d'], 'numeber1':[1,3,5,np.nan]}) data4=pd.DataFrame({'level':['a','b','c','e'], 'numeber2':[2,np.nan,6,10]}) print(data3.combine_first(data4))
Result:
The usage here is similar to np. where (isnull (a), B,)
Data remodeling and axial rotation
This content is mentioned in the previous pandas article. Data remodeling mainly uses the reshape function, while rotation mainly uses the unstack and stack functions.
data=pd.DataFrame(np.arange(12).reshape(3,4), columns=['a','b','c','d'], index=['wang','li','zhang'])print(data)
Result:
Data Conversion
Delete duplicate row data
data=pd.DataFrame({'a':[1,3,3,4], 'b':[1,3,3,5]})print(data)
Result:
Replacement value
In addition to the fillna method mentioned in the previous article, you can also use the replace method, which is simpler and faster.
data=pd.DataFrame({'a':[1,3,3,4], 'b':[1,3,3,5]})print(data.replace(1,2))
Result:
data=[11,15,18,20,25,26,27,24]bins=[15,20,25]print(data)print(pd.cut(data,bins))
Result:
[11, 15, 18, 20, 25, 26, 27, 24] [NaN, NaN, (15, 20], (15, 20], (20, 25], naN, NaN, (20, 25]
Categories (2, object): [(15, 20] <(20, 25]
We can see the results after Segmentation. The data not in the segmentation is displayed as the na value, and other data is displayed as the segmentation.
print(pd.cut(data,bins).labels)
Result:
[-1-1 0 0 1-1-1 1]
Display the segmented sorting label
print(pd.cut(data,bins).levels)
Result:
Index (['(15, 20]', '(20, 25]'], dtype = 'object ')
Display the segmentation label
print(value_counts(pd.cut(data,bins)))
Result:
Now we want to talk about random sorting of data (permutation)
data=np.random.permutation(5)print(data)
Result:
[1 0 4 2 3]
Here, the peemutation function sorts 0-4 data randomly.
You can also sample the data.
df=pd.DataFrame(np.arange(12).reshape(4,3))samp=np.random.permutation(3)print(df)
Result:
Here, the result of taking is that samples are extracted from df in the samp order.
For more articles about data merging, conversion, filtering, and sorting in python data cleansing, please follow the PHP Chinese website!