Data analysis and presentation-Pandas data feature analysis and data analysis pandas
Sequence of Pandas data feature analysis data
The basic statistics (including sorting), distribution/accumulative statistics, and data features (correlation, periodicity, etc.) can be obtained through summarization (lossy process of extracting data features), data mining (Knowledge formation ).
- The. sort_index () method sorts indexes on the specified axis. By default, the indexes are sorted in ascending order.
- . Sort_index (axis = 0, ascending = True)
In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c','a','d','b'])In [4]: bOut[4]: 0 1 2 3 4c 0 1 2 3 4a 5 6 7 8 9d 10 11 12 13 14b 15 16 17 18 19In [5]: b.sort_index()Out[5]: 0 1 2 3 4a 5 6 7 8 9b 15 16 17 18 19c 0 1 2 3 4d 10 11 12 13 14In [6]: b.sort_index(ascending=False)Out[6]: 0 1 2 3 4d 10 11 12 13 14c 0 1 2 3 4b 15 16 17 18 19a 5 6 7 8 9In [7]: c = b.sort_index(axis=1, ascending=False)In [8]: cOut[8]: 4 3 2 1 0c 4 3 2 1 0a 9 8 7 6 5d 14 13 12 11 10b 19 18 17 16 15In [9]: c = c.sort_index()In [10]: cOut[10]: 4 3 2 1 0a 9 8 7 6 5b 19 18 17 16 15c 4 3 2 1 0d 14 13 12 11 10
- The. sort_values () method sorts values on the specified axis in ascending order by default.
Series. sort_values (axis = 0, ascending = True) DataFrame. sort_values (by, axis = 0, ascending = True) # by: an index or index list on the axis
In [11]: c = b.sort_values(2,ascending=False)In [12]: cOut[12]: 0 1 2 3 4b 15 16 17 18 19d 10 11 12 13 14a 5 6 7 8 9c 0 1 2 3 4In [13]: c = c.sort_values('a',axis=1,ascending=False)In [14]: cOut[14]: 4 3 2 1 0b 19 18 17 16 15d 14 13 12 11 10a 9 8 7 6 5c 4 3 2 1 0
Nan is placed at the end of sorting.
In [15]: a = pd.DataFrame(np.arange(12).reshape(3,4), index=['a','b','c'])In [16]: aOut[16]: 0 1 2 3a 0 1 2 3b 4 5 6 7c 8 9 10 11In [17]: c = a + bIn [18]: cOut[18]: 0 1 2 3 4a 5.0 7.0 9.0 11.0 NaNb 19.0 21.0 23.0 25.0 NaNc 8.0 10.0 12.0 14.0 NaNd NaN NaN NaN NaN NaNIn [19]: c.sort_values(2,ascending=False)Out[19]: 0 1 2 3 4b 19.0 21.0 23.0 25.0 NaNc 8.0 10.0 12.0 14.0 NaNa 5.0 7.0 9.0 11.0 NaNd NaN NaN NaN NaN NaNIn [20]: c.sort_values(2,ascending=True)Out[20]: 0 1 2 3 4a 5.0 7.0 9.0 11.0 NaNc 8.0 10.0 12.0 14.0 NaNb 19.0 21.0 23.0 25.0 NaNd NaN NaN NaN NaN NaN
Basic statistical analysis of data basic statistical analysis functions
Applicable to Series and DataFrame types
Method |
Description |
. Sum () |
Calculate the sum of data in the zero axis, the same below |
. Count () |
Number of non-Nan values |
. Mean (). median () |
Calculate the arithmetic mean and median of the data. |
. Var (). std () |
Calculate the variance and standard deviation of data |
. Min (). max () |
Calculate the minimum and maximum values of data |
Applicable to Series type
Method |
Description |
. Argmin (). argmax () |
Calculate the index location where the maximum and minimum values of data are located (automatic index) |
. Idxmin (). idxmax () |
Calculate the index where the maximum and minimum values of data are located (custom index) |
Applicable to Series and DataFrame types
Method |
Description |
. Describe () |
Statistical summary of 0-axis (columns) |
In [21]: a = pd.Series([9,8,7,6], index=['a','b','c','d'])In [22]: aOut[22]: a 9b 8c 7d 6dtype: int64In [23]: a.describe()Out[23]: count 4.000000mean 7.500000std 1.290994min 6.00000025% 6.75000050% 7.50000075% 8.250000max 9.000000dtype: float64In [24]: type(a.describe())Out[24]: pandas.core.series.SeriesIn [25]: a.describe()['count']Out[25]: 4.0In [26]: a.describe()['max']Out[26]: 9.0In [27]: b.describe()Out[27]: 0 1 2 3 4count 4.000000 4.000000 4.000000 4.000000 4.000000mean 7.500000 8.500000 9.500000 10.500000 11.500000std 6.454972 6.454972 6.454972 6.454972 6.454972min 0.000000 1.000000 2.000000 3.000000 4.00000025% 3.750000 4.750000 5.750000 6.750000 7.75000050% 7.500000 8.500000 9.500000 10.500000 11.50000075% 11.250000 12.250000 13.250000 14.250000 15.250000max 15.000000 16.000000 17.000000 18.000000 19.000000In [28]: type(b.describe())Out[28]: pandas.core.frame.DataFrameIn [29]: In [30]: b.describe().ix['max']__main__:1: DeprecationWarning: .ix is deprecated. Please use.loc for label based indexing or.iloc for positional indexingSee the documentation here:http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecatedOut[30]: 0 15.01 16.02 17.03 18.04 19.0Name: max, dtype: float64In [31]: b.describe()[2]Out[31]: count 4.000000mean 9.500000std 6.454972min 2.00000025% 5.75000050% 9.50000075% 13.250000max 17.000000Name: 2, dtype: float64
Accumulative statistical analysis of data cumulative Statistical Analysis Function
Applicable to Series and DataFrame types, accumulative computing
Method |
Description |
. Cumsum () |
First 1, 2 ,... Sum of n numbers |
. Cumprod () |
First 1, 2 ,... , N number of products |
. Cummax () |
First 1, 2 ,... , The maximum number of n |
. Cummin () |
First 1, 2 ,... , Minimum n count |
In [32]: b.cumsum()Out[32]: 0 1 2 3 4c 0 1 2 3 4a 5 7 9 11 13d 15 18 21 24 27b 30 34 38 42 46In [33]: b.cumprod()Out[33]: 0 1 2 3 4c 0 1 2 3 4a 0 6 14 24 36d 0 66 168 312 504b 0 1056 2856 5616 9576In [34]: b.cummin()Out[34]: 0 1 2 3 4c 0 1 2 3 4a 0 1 2 3 4d 0 1 2 3 4b 0 1 2 3 4In [35]: b.cummax()Out[35]: 0 1 2 3 4c 0 1 2 3 4a 5 6 7 8 9d 10 11 12 13 14b 15 16 17 18 19
Applicable to Series and DataFrame types, rolling computing (window computing)
Method |
Description |
. Rolling (w). sum () |
Calculate the sum of the adjacent w elements in turn |
. Rolling (w). mean () |
Calculate the arithmetic mean of the neighboring w elements in turn |
. Rolling (w). var () |
Calculate the variance of the neighboring w elements in sequence |
. Rolling (w). std () |
Calculate the standard deviation of the adjacent w elements in turn |
. Rolling (w). min (). max () |
Calculate the minimum and maximum values of the adjacent w elements in sequence. |
In [36]: b.rolling(2).sum()Out[36]: 0 1 2 3 4c NaN NaN NaN NaN NaNa 5.0 7.0 9.0 11.0 13.0d 15.0 17.0 19.0 21.0 23.0b 25.0 27.0 29.0 31.0 33.0In [37]: b.rolling(3).sum()Out[37]: 0 1 2 3 4c NaN NaN NaN NaN NaNa NaN NaN NaN NaN NaNd 15.0 18.0 21.0 24.0 27.0b 30.0 33.0 36.0 39.0 42.0
Data-Related Analysis
Two things are represented by X and Y. How can we determine the correlations between them?
Correlation
- X increases, Y increases, and two variables are positively correlated.
- X increases, Y decreases, and two variables are negatively correlated.
- X increases, Y ignores, and two variables are irrelevant.
Covariance
- Covariance> 0, positive correlation between X and Y
- Covariance <0, X and Y are negative correlation
- Covariance = 0, X and Y independent
Pearson Correlation Coefficient
R value range: [-1, 1]
- 0.8-1.0 strong correlation
- Strong correlation between 0.6 and 0.8
- 0.4-0.6 moderate correlation
- 0.2-0.4 weak correlation
- 0.0-0.2 extremely weak or unrelated
Applicable to Series and DataFrame types
Method |
Description |
. Cov () |
Returns the covariance matrix. |
. Corr () |
Calculate the correlation coefficient matrix, Pearson, Pearson, Kendall, and other coefficients. |
In [38]: import pandas as pdIn [39]: hprice = pd.Series([3.04, 22.93, 12.75, 22.6, 12.33], index=['2008', '2009', '2010', '2011', '2012'])In [40]: m2 = pd.Series([8.18, 18.38, 9.13, 7.82, 6.69], index=['2008', '2009', '2010','2011', '2012'])In [41]: hprice.corr(m2)Out[41]: 0.5239439145220387