Python Data Analysis Package: Pandas basics

Source: Internet
Author: User
Tags arithmetic python list

Pandas is a data analysis package built on Numpy that contains more advanced structures and tools

The core of the Numpy is that Ndarray,pandas also revolves around the Series and DataFrame two core data structures. Series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures, respectively. The following are the conventional methods of importing pandas:

From pandas import Series,dataframeimport pandas as PD

Series

A Series can be seen as a fixed-length ordered dictionary. Basic arbitrary one-dimensional data can be used to construct Series objects:

>>> s = Series ([1,2,3.0, ' abc ']) >>> s0    abcdtype:object

Although it dtype:object can contain a variety of basic data types, the overall feeling will affect the performance of the appearance, it is best to maintain a simple dtype.

The Series object contains two main attributes: Index and values, respectively, of the left and right columns in the previous example. Because a list is passed to the constructor, the value of index is an integer incremented from 0, and if a key-value pair structure of a dictionary is passed in, a index-value corresponding Series is generated, or an Index object is explicitly specified with the keyword argument at initialization time:

>>> s = Series (Data=[1,3,5,7],index = [' A ', ' B ', ' x ', ' y ']) >>> SA    1b    3x    5y    7dtype: Int64>>> S.indexindex ([' A ', ' B ', ' x ', ' y '], dtype= ' object ') >>> S.valuesarray ([1, 3, 5, 7], Dtype=int64 )

The elements of a Series object are constructed strictly according to the index given, which means that if the data parameter is a key-value pair, only the keys contained in index are used, and if the key for the response is missing in data, the key is added even if the NaN value is given.

Note that there is a correspondence between the index of the Series and the elements of values, but this is different from the dictionary mapping. Index and values are actually still separate ndarray arrays, so the performance of the Series object is completely OK.

The greatest benefit of this series of data structures using key-value pairs is that index is automatically aligned when arithmetic operations are performed between series.

In addition, the Series object and its index contain a name property:

>>> s.name = ' a_series ' >>> s.index.name = ' the_index ' >>> sthe_indexa            1b            3x            5y            7name:a_series, Dtype:int64

DataFrame

DataFrame is a tabular data structure that contains a set of ordered columns (similar to index), each of which can be a different value type (unlike Ndarray can have only one dtype). You can basically think of DataFrame as a collection of Series that shares the same index.

DataFrame is constructed in a similar way to Series, except that it can accept multiple one-dimensional data sources at the same time, and each one becomes a separate column:

>>> data = {' state ': [' Ohino ', ' Ohino ', ' Ohino ', ' Nevada ', ' Nevada '],        ' year ': [2000,2001,2002,2001,2002],        ' Pop ':[1.5,1.7,3.6,2.4,2.9]}>>> df = DataFrame (data) >>> DF   pop   State  year0  1.5   Ohino  20001  1.7   Ohino  20012  3.6   Ohino  20023  2.4  Nevada  20014  2.9  Nevada  2002[5 rows x 3 columns]

Although the parameter data appears to be a dictionary, the key of the dictionary is not the role of the index of the DataFrame, but the "name" property of the Series. The index generated here is still "01234".

The more complete DataFrame constructor parameter is: DataFrame(data=None,index=None,coloumns=None) , Columns is "name":

>>> df = DataFrame (data,index=[' one ', ' one ', ' three ', ' four ', ' five '),               columns=[' year ', ' state ', ' Pop ', ' Debt ']) >>> DF Year State  pop debtone   ohino  1.5  nantwo    2001   Ohino  1.7  nanthree  2002   Ohino  3.6  nanfour   2001  Nevada  2.4 Nanfive   2002  Nevada  2.9  nan[5 rows x 4 columns]

The same missing value is made up of NaN. Take a look at the types of index, columns, and indexes:

>>> Df.indexindex ([' One ', ' one ', ' one ', ' three ', ' four ', ' five '], dtype= ' object ') >>> df.columnsindex ([' Year ', ' state ', ' pop ', ' Debt '], dtype= ' object ') >>> type (df[' debt ']) <class ' pandas.core.series.Series ' >

DataFrame line-oriented and column-oriented operations are basically balanced, and any column that is drawn out is a series.

Object Properties Re-index

The re-indexing of a Series object is implemented by its .reindex(index=None,**kwargs) method. **kwargsthere are two common parameters: method=None,fill_value=np.NaN

Ser = Series ([4.5,7.2,-5.3,3.6],index=[' d ', ' B ', ' A ', ' C ']) >>> a = [' A ', ' B ', ' C ', ' d ', ' E ']>>> Ser.reindex (a) a   -5.3b    7.2c    3.6d    4.5e    nandtype:float64>>> ser.reindex (a,fill_value= 0) A   -5.3b    7.2c    3.6d    4.5e    0.0dtype:float64>>> ser.reindex (a,method= ' Ffill ') a   -5.3b    7.2c    3.6d    4.5e    4.5dtype:float64>>> ser.reindex (a,fill_value=0,method= ' Ffill ') a   -5.3b    7.2c    3.6d    4.5e    4.5dtype:float64

.reindex()The Ffill method returns a new object whose index strictly follows the given parameter, which method:{‘backfill‘, ‘bfill‘, ‘pad‘, ‘ffill‘, None} specifies the interpolation (padding) method, and when not given, is automatically populated with the fill_value default NaN (= = Pad,bfill = back fill, respectively, referring to the interpolated value forward or Post-fetch value)

The DataFrame object's re-indexing method is: .reindex(index=None,columns=None,**kwargs) . There is an optional columns parameter for the column index, which is more than the Series. The usage is similar to the previous example except that the interpolation method method parameter can only be applied to rows, that is, Axis 0.

>>> state = [' Texas ', ' Utha ', ' California ']>>> df.reindex (columns=state,method= ' Ffill ')    Texas  Utha  CALIFORNIAA      1   nan           2c      4   nan           5  D      7   nan           8[3 rows x 3 columns]>> > Df.reindex (index=[' A ', ' B ', ' C ', ' d '],columns=state,method= ' Ffill ')   Texas  Utha  CALIFORNIAA      1   Nan           2b      1   nan           2c      4   nan           5d      7   NaN           8[4 rows x 3 columns]

But it fill_value 's still valid. A smart little partner might have thought that it would be possible to df.T.reindex(index,method=‘**‘).T achieve interpolation on the column in such a way that the answer was feasible. Also note that when used reindex(index,method=‘**‘) , index must be monotonous, otherwise it will be thrown ValueError: Must be monotonic for forward fill , such as the last call in the previous example, if index=[‘a‘,‘b‘,‘d‘,‘c‘] the use of the words will not.

Delete an item on a specified axis

That is, delete the elements of a Series or the meaning of a row (column) of DataFrame, by means of the object .drop(labels, axis=0) :

>>> serd    4.5b    7.2a   -5.3c    3.6dtype:float64>>> df   Ohio  Texas  CALIFORNIAA     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> ser.drop ( ' C ') d    4.5b    7.2a   -5.3dtype:float64>>> df.drop (' a ')   Ohio  Texas  californiac     3      4           5d     6      7           8[2 rows x 3 columns]>>> df.drop ([' Ohio ', ' Texas '],axis=1)   CALIFORNIAA           2c           5d           8[3 rows x 1 Columns]

.drop()A new object is returned, and the meta-object is not changed.

Indexes and slices

Like Numpy,pandas also supports obj[::] indexing and slicing in a way, and filtering through a Boolean array.

However, it is important to note that because the index of the Pandas object is not limited to integers, it is included at the end when using a non-integer as the tile index.

>>> fooa    4.5b    7.2c   -5.3d    3.6dtype:float64>>> bar0    4.51    7.22   -5.33    3.6dtype:float64>>> foo[:2]a    4.5b    7.2dtype:float64>>> bar[:2]0    4.51    7.2dtype:float64>>> foo[: ' C ']a    4.5b    7.2c   -5.3dtype:float64

Here foo and bar only index different--bar index is an integer sequence. Visible when you use an integer index slice, the result is the same as the default for a Python list or Numpy, and ‘c‘ when you change to such a string index, the result contains the boundary element.

Another special thing is the way the DataFrame object is indexed, because he has two axes (double-indexed).

You can understand this: the standard slicing syntax for DataFrame objects is: .ix[::,::] . The IX object can accept two sets of slices, respectively, in the direction of rows (axis=0) and Columns (Axis=1):

>>> DF   Ohio  Texas  CALIFORNIAA     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> df.ix[:2,:2]   Ohio  Texasa     0      1c     3      4[2 rows x 2 columns]>>> df.ix[' A ', ' Ohio ']0

Without the use of IX, the direct cut case is special:

    • Index, the column is selected
    • When slicing, the row is selected

This may seem illogical, but the author explains that "This grammatical setting comes from practice" and we believe him.

>>> df[' Ohio ']a    0c    3d    6name:ohio, dtype:int32>>> df[: ' C ']   Ohio  Texas  CALIFORNIAA     0      1           2c     3      4           5[2 rows x 3 columns]>>> df[:2]   Ohio  Texas  CALIFORNIAA     0      1           2c     3      4           5[2 rows x 3 columns]

With the case of a Boolean array, note the different tangent of rows and columns (Liecefa : cannot be saved):

>>> df[' Texas ']>=4a    falsec     trued     truename:texas, dtype:bool>>> df[df[' Texas '] >=4]   Ohio  Texas  californiac     3      4           5d     6      7           8[2 rows x 3 columns]>>> df.ix[:,df.ix[' C ']>=4]   Texas  CALIFORNIAA      1           2c      4           5d      7           8[3 rows x 2 Columns

Arithmetic operations and data alignment

One of the most important features of pandas is that it can perform arithmetic operations on objects of different indexes. When an object is added, the index of the result is the set of the index pair. Automatic data alignment introduces null values at non-overlapping indexes, and defaults to NaN.

>>> foo = series ({' A ': 1, ' B ': 2}) >>> FOOA    1b    2dtype:int64>>> bar = series ({' B ': 3, ' d ': 4}) >>> Barb    3d    4dtype:int64>>> foo + bara   nanb     5d   Nandtype:float64

DataFrame alignment operations occur on both rows and columns.

When you do not want the NA value to appear in the result of the operation, you can use the arguments mentioned in the previous reindex, fill_value but in order to pass this parameter, you need to use the object's method instead of the operator: df1.add(df2,fill_value=0) . Other arithmetic methods are: sub(), div(), mul() .

The arithmetic operation between Series and DataFrame involves broadcasting, but not speaking for the time being.

function Application and Mapping

Numpy's Ufuncs (element progression group method) can also be used to manipulate pandas objects.

You can use a method when you want to apply a function to a row or column of a DataFrame object .apply(func, axis=0, args=(), **kwds) .

f = Lambda X:x.max ()-x.min () >>> DF   Ohio  Texas  CALIFORNIAA     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> df.apply (f) Ohio          6Texas         6California    6dtype:int64>>> df.apply (F,axis=1) a    2c    2d    2dtype:int64

Sort and rank

The Series sort_index(ascending=True) method can sort the index, and the ascending parameter is used to control ascending or descending, and the default is ascending.

To sort the series by value, .order() any missing values are placed at the end of the series by default when the method is used.

On DataFrame, the .sort_index(axis=0, by=None, ascending=True) method has an axial selection parameter and a by parameter, and the by parameter is sorted against a (some) column (the by parameter cannot be used on the row):

>>> df.sort_index (by= ' Ohio ')   Ohio  Texas  CALIFORNIAA     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> df.sort_index (by=[' California ', ' Texas '])   Ohio  Texas  CALIFORNIAA     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> Df.sort_index (Axis=1)   California  Ohio  texasa           2     0      1c           5     3      4d           8     6      7[3 rows x 3 columns]

The difference between the role of a rank ( Series.rank(method=‘average‘, ascending=True) ) and the sort is that he will replace the values of the object with the rank (from 1 to N). The only problem at this point is how to deal with a peer, the parameter in the method method is the function, he has four values to choose: average, min, max, first .

>>> ser=series ([3,2,0,3],index=list (' ABCD ')) >>> sera    3b    2c    0d    3dtype:int64> >> Ser.rank () a    3.5b    2.0c    1.0d    3.5dtype:float64>>> ser.rank (method= ' min ') a    3b    2c    1d    3dtype:float64>>> ser.rank (method= ' Max ') a    4b    2c    1d 4dtype    : Float64>>> Ser.rank (method= ' first ') a    3b    2c    1d    4dtype:float64

Note that the different method parameters show different positions on the ser[0]=ser[3.

The DataFrame .rank(axis=0, method=‘average‘, ascending=True) method has a number of axis parameters, optionally ranked by row or column, as if there is no ranking method for all elements at the moment.

Statistical methods

There are some statistical methods for pandas objects. Most of them are reduction and summary statistics, used to extract a single value from a series, or to extract a series from a DataFrame row or column.

For example DataFrame.mean(axis=0,skipna=True) , when an NA value exists in a dataset, these values are simply skipped, unless the entire slice (row or column) is all Na, and if you don't want to, you can skipna=False disable this feature by:

>>> DF    One  twoa  1.40  nanb  7.10-4.5c   NaN  NaNd  0.75-1.3[4 rows x 2 columns] >>> Df.mean () one    3.083333two   -2.900000dtype:float64>>> Df.mean (Axis=1) a    1.400b    1.300c      NAND   -0.275dtype:float64>>> Df.mean (axis=1,skipna=false) a      nanb    1.300c      NaNd   - 0.275dtype:float64

Other commonly used statistical methods are:

######################## ******************************************
Count Number of non-NA values
Describe Calculate summary statistics for columns of series or DF
Min, max Minimum value and maximum value
Argmin, Argmax Index position (integer) of minimum and maximum values
Idxmin, Idxmax Index values for minimum and maximum values
Quantile Sample sub-positions (0 to 1)
Sum Sum
Mean Mean value
Median Number of Median
Mad Average absolute deviation based on mean value
Var Variance
Std Standard deviation
Skew The skewness of the sample value (third-order moment)
Kurt Kurtosis of sample values (four-order moment)
Cumsum The cumulative sum of the sample values
Cummin, Cummax Cumulative maximum and cumulative minimum values for sample values
Cumprod Cumulative product of sample values
Diff Calculate first-order difference (useful for time series)
Pct_change Calculate percent Change

Processing missing data

The main performance of Na in Pandas is Np.nan, and None of Python's built-in will be treated as NA.

There are four ways to deal with NA: dropna , fillna , isnull , notnull .

is (not) null

This pair of methods makes an element-level application to the object, and then returns a Boolean array, which is typically used for Boolean indexes.

Dropna

Returns a Series that contains only non-null data and index values for a series,dropna.

The problem is how to deal with DataFrame, because if you drop, you will lose at least one row (column). The workaround here is similar to the previous one, or with an additional parameter: dropna(axis=0, how=‘any‘, thresh=None) The How parameter can optionally have a value of any or all. All discards the row (column) only if the slice element is all NA. Another interesting parameter is Thresh, whose type is an integer, which is, for example, Thresh=3, which is retained when there are at least 3 non-NA values in a row.

Fillna

fillna(value=None, method=None, axis=0)In addition to the base type, the value parameter can use a dictionary, which enables different values to be populated for different columns. The method is used in the .reindex() same way as before, and is not mentioned here.

InPlace parameters

There is a point in the front that has not been said, the results of the entire example written down to find it is very important. Is the method of Series and DataFrame object, usually has an optional parameter that modifies the array and returns a new one replace=False . If set to True manually, then the original array can be replaced.

Python Data Analysis Package: Pandas basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.