Data analysis and presentation-Pandas data feature analysis and data analysis pandas

Source: Internet
Author: User

Data analysis and presentation-Pandas data feature analysis and data analysis pandas
Sequence of Pandas data feature analysis data

The basic statistics (including sorting), distribution/accumulative statistics, and data features (correlation, periodicity, etc.) can be obtained through summarization (lossy process of extracting data features), data mining (Knowledge formation ).

  • The. sort_index () method sorts indexes on the specified axis. By default, the indexes are sorted in ascending order.
  • . Sort_index (axis = 0, ascending = True)
In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c','a','d','b'])In [4]: bOut[4]:     0   1   2   3   4c   0   1   2   3   4a   5   6   7   8   9d  10  11  12  13  14b  15  16  17  18  19In [5]: b.sort_index()Out[5]:     0   1   2   3   4a   5   6   7   8   9b  15  16  17  18  19c   0   1   2   3   4d  10  11  12  13  14In [6]: b.sort_index(ascending=False)Out[6]:     0   1   2   3   4d  10  11  12  13  14c   0   1   2   3   4b  15  16  17  18  19a   5   6   7   8   9In [7]: c = b.sort_index(axis=1, ascending=False)In [8]: cOut[8]:     4   3   2   1   0c   4   3   2   1   0a   9   8   7   6   5d  14  13  12  11  10b  19  18  17  16  15In [9]: c = c.sort_index()In [10]: cOut[10]:     4   3   2   1   0a   9   8   7   6   5b  19  18  17  16  15c   4   3   2   1   0d  14  13  12  11  10
  • The. sort_values () method sorts values on the specified axis in ascending order by default.
Series. sort_values (axis = 0, ascending = True) DataFrame. sort_values (by, axis = 0, ascending = True) # by: an index or index list on the axis
In [11]: c = b.sort_values(2,ascending=False)In [12]: cOut[12]:     0   1   2   3   4b  15  16  17  18  19d  10  11  12  13  14a   5   6   7   8   9c   0   1   2   3   4In [13]: c = c.sort_values('a',axis=1,ascending=False)In [14]: cOut[14]:     4   3   2   1   0b  19  18  17  16  15d  14  13  12  11  10a   9   8   7   6   5c   4   3   2   1   0

Nan is placed at the end of sorting.

In [15]: a = pd.DataFrame(np.arange(12).reshape(3,4), index=['a','b','c'])In [16]: aOut[16]:    0  1   2   3a  0  1   2   3b  4  5   6   7c  8  9  10  11In [17]: c = a + bIn [18]: cOut[18]:       0     1     2     3   4a   5.0   7.0   9.0  11.0 NaNb  19.0  21.0  23.0  25.0 NaNc   8.0  10.0  12.0  14.0 NaNd   NaN   NaN   NaN   NaN NaNIn [19]: c.sort_values(2,ascending=False)Out[19]:       0     1     2     3   4b  19.0  21.0  23.0  25.0 NaNc   8.0  10.0  12.0  14.0 NaNa   5.0   7.0   9.0  11.0 NaNd   NaN   NaN   NaN   NaN NaNIn [20]: c.sort_values(2,ascending=True)Out[20]:       0     1     2     3   4a   5.0   7.0   9.0  11.0 NaNc   8.0  10.0  12.0  14.0 NaNb  19.0  21.0  23.0  25.0 NaNd   NaN   NaN   NaN   NaN NaN
Basic statistical analysis of data basic statistical analysis functions

Applicable to Series and DataFrame types

Method Description
. Sum () Calculate the sum of data in the zero axis, the same below
. Count () Number of non-Nan values
. Mean (). median () Calculate the arithmetic mean and median of the data.
. Var (). std () Calculate the variance and standard deviation of data
. Min (). max () Calculate the minimum and maximum values of data

Applicable to Series type

Method Description
. Argmin (). argmax () Calculate the index location where the maximum and minimum values of data are located (automatic index)
. Idxmin (). idxmax () Calculate the index where the maximum and minimum values of data are located (custom index)

Applicable to Series and DataFrame types

Method Description
. Describe () Statistical summary of 0-axis (columns)

 

In [21]: a = pd.Series([9,8,7,6], index=['a','b','c','d'])In [22]: aOut[22]: a    9b    8c    7d    6dtype: int64In [23]: a.describe()Out[23]: count    4.000000mean     7.500000std      1.290994min      6.00000025%      6.75000050%      7.50000075%      8.250000max      9.000000dtype: float64In [24]: type(a.describe())Out[24]: pandas.core.series.SeriesIn [25]: a.describe()['count']Out[25]: 4.0In [26]: a.describe()['max']Out[26]: 9.0In [27]: b.describe()Out[27]:                0          1          2          3          4count   4.000000   4.000000   4.000000   4.000000   4.000000mean    7.500000   8.500000   9.500000  10.500000  11.500000std     6.454972   6.454972   6.454972   6.454972   6.454972min     0.000000   1.000000   2.000000   3.000000   4.00000025%     3.750000   4.750000   5.750000   6.750000   7.75000050%     7.500000   8.500000   9.500000  10.500000  11.50000075%    11.250000  12.250000  13.250000  14.250000  15.250000max    15.000000  16.000000  17.000000  18.000000  19.000000In [28]: type(b.describe())Out[28]: pandas.core.frame.DataFrameIn [29]: In [30]: b.describe().ix['max']__main__:1: DeprecationWarning: .ix is deprecated. Please use.loc for label based indexing or.iloc for positional indexingSee the documentation here:http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecatedOut[30]: 0    15.01    16.02    17.03    18.04    19.0Name: max, dtype: float64In [31]: b.describe()[2]Out[31]: count     4.000000mean      9.500000std       6.454972min       2.00000025%       5.75000050%       9.50000075%      13.250000max      17.000000Name: 2, dtype: float64
Accumulative statistical analysis of data cumulative Statistical Analysis Function

Applicable to Series and DataFrame types, accumulative computing

Method Description
. Cumsum () First 1, 2 ,... Sum of n numbers
. Cumprod () First 1, 2 ,... , N number of products
. Cummax () First 1, 2 ,... , The maximum number of n
. Cummin () First 1, 2 ,... , Minimum n count

 

In [32]: b.cumsum()Out[32]:     0   1   2   3   4c   0   1   2   3   4a   5   7   9  11  13d  15  18  21  24  27b  30  34  38  42  46In [33]: b.cumprod()Out[33]:    0     1     2     3     4c  0     1     2     3     4a  0     6    14    24    36d  0    66   168   312   504b  0  1056  2856  5616  9576In [34]: b.cummin()Out[34]:    0  1  2  3  4c  0  1  2  3  4a  0  1  2  3  4d  0  1  2  3  4b  0  1  2  3  4In [35]: b.cummax()Out[35]:     0   1   2   3   4c   0   1   2   3   4a   5   6   7   8   9d  10  11  12  13  14b  15  16  17  18  19

Applicable to Series and DataFrame types, rolling computing (window computing)

Method Description
. Rolling (w). sum () Calculate the sum of the adjacent w elements in turn
. Rolling (w). mean () Calculate the arithmetic mean of the neighboring w elements in turn
. Rolling (w). var () Calculate the variance of the neighboring w elements in sequence
. Rolling (w). std () Calculate the standard deviation of the adjacent w elements in turn
. Rolling (w). min (). max () Calculate the minimum and maximum values of the adjacent w elements in sequence.

 

In [36]: b.rolling(2).sum()Out[36]:       0     1     2     3     4c   NaN   NaN   NaN   NaN   NaNa   5.0   7.0   9.0  11.0  13.0d  15.0  17.0  19.0  21.0  23.0b  25.0  27.0  29.0  31.0  33.0In [37]: b.rolling(3).sum()Out[37]:       0     1     2     3     4c   NaN   NaN   NaN   NaN   NaNa   NaN   NaN   NaN   NaN   NaNd  15.0  18.0  21.0  24.0  27.0b  30.0  33.0  36.0  39.0  42.0
Data-Related Analysis

Two things are represented by X and Y. How can we determine the correlations between them?

Correlation
  • X increases, Y increases, and two variables are positively correlated.
  • X increases, Y decreases, and two variables are negatively correlated.
  • X increases, Y ignores, and two variables are irrelevant.
Covariance

  • Covariance> 0, positive correlation between X and Y
  • Covariance <0, X and Y are negative correlation
  • Covariance = 0, X and Y independent
Pearson Correlation Coefficient

R value range: [-1, 1]

  • 0.8-1.0 strong correlation
  • Strong correlation between 0.6 and 0.8
  • 0.4-0.6 moderate correlation
  • 0.2-0.4 weak correlation
  • 0.0-0.2 extremely weak or unrelated

Applicable to Series and DataFrame types

Method Description
. Cov () Returns the covariance matrix.
. Corr () Calculate the correlation coefficient matrix, Pearson, Pearson, Kendall, and other coefficients.

 

In [38]: import pandas as pdIn [39]: hprice = pd.Series([3.04, 22.93, 12.75, 22.6, 12.33], index=['2008', '2009', '2010', '2011', '2012'])In [40]: m2 = pd.Series([8.18, 18.38, 9.13, 7.82, 6.69], index=['2008', '2009', '2010','2011', '2012'])In [41]: hprice.corr(m2)Out[41]: 0.5239439145220387

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.