Pandas basics, pandas

Last Update:2017-06-15 Source: Internet

Author: User

Tags python list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pandas basics, pandas

Pandas is a data analysis package built based on Numpy that contains more advanced data structures and tools.

Similar to Numpy, the core is ndarray, and pandas is centered around the two core data structures of Series and DataFrame. Series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures respectively. Pandas uses the following methods to import data:

from pandas import Series,DataFrameimport pandas as pd

Series

Series can be seen as an ordered dictionary with a fixed length. Almost any one-dimensional data can be used to construct a Series object:

>>> s = Series([1,2,3.0,'abc'])>>> s0      11      22      33    abcdtype: object

Althoughdtype:objectIt can contain a variety of basic data types, but it always seems to affect the performance. It is best to maintain a simple dtype.

The Series object contains two main attributes: index and values, which are the left and right columns in the preceding example. Because the list is passed to the constructor, the index value is an integer that increases progressively from 0. If the input is a dictionary-like key-Value Pair structure, A Series corresponding to index-value is generated; or an index object is explicitly specified with a keyword parameter during initialization:

>>> s = Series(data=[1,3,5,7],index = ['a','b','x','y'])>>> sa    1b    3x    5y    7dtype: int64>>> s.indexIndex(['a', 'b', 'x', 'y'], dtype='object')>>> s.valuesarray([1, 3, 5, 7], dtype=int64)

The elements of the Series object are constructed strictly according to the given index, which means that if the data parameter has a key-value pair, only the keys contained in the index will be used; and if the response key is missing from data, the key will be added even if the NaN value is given.

Note that although there is a correspondence between the index of Series and the values elements, this is different from the dictionary ing. Index and values are actually independent ndarray arrays, so the performance of Series objects is completely OK.

The biggest benefit of using a data structure such as a key-value pair is that the index is automatically aligned during arithmetic operations between Series.

In addition, both the Series object and its index containnameAttribute:

>>> s.name = 'a_series'>>> s.index.name = 'the_index'>>> sthe_indexa            1b            3x            5y            7Name: a_series, dtype: int64

DataFrame

DataFrame is a table-type data structure that contains a group of ordered columns (similar to index). Each column can be of different value types (unlike ndarray, it can only have one dtype ). DataFrame can be considered as a set of Series that share the same index.

The construction method of DataFrame is similar to that of Series, except that multiple one-dimensional data sources can be accepted at the same time. Each data source is a separate column:

>>> data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],        'year':[2000,2001,2002,2001,2002],        'pop':[1.5,1.7,3.6,2.4,2.9]}>>> df = DataFrame(data)>>> df   pop   state  year0  1.5   Ohino  20001  1.7   Ohino  20012  3.6   Ohino  20023  2.4  Nevada  20014  2.9  Nevada  2002[5 rows x 3 columns]

Although the data parameter looks like a dictionary, the dictionary key does not act as the index of DataFrame, but is the "name" attribute of Series. The index generated here is still "01234 ".

The complete DataFrame constructor parameters are:DataFrame(data=None,index=None,coloumns=None), Columns is "name ":

>>> df = DataFrame(data,index=['one','two','three','four','five'],               columns=['year','state','pop','debt'])>>> df       year   state  pop debtone    2000   Ohino  1.5  NaNtwo    2001   Ohino  1.7  NaNthree  2002   Ohino  3.6  NaNfour   2001  Nevada  2.4  NaNfive   2002  Nevada  2.9  NaN[5 rows x 4 columns]

The missing value is also supplemented by NaN. Let's take a look at the index, columns, and index types:

>>> df.indexIndex(['one', 'two', 'three', 'four', 'five'], dtype='object')>>> df.columnsIndex(['year', 'state', 'pop', 'debt'], dtype='object')>>> type(df['debt'])<class 'pandas.core.series.Series'>

The row-oriented and column-oriented operations of DataFrame are basically balanced. Any column extracted is a Series.

Re-indexing object properties

Re-indexing a Series object through its.reindex(index=None,**kwargs)Method implementation.**kwargsCommon parameters include:method=None,fill_value=np.NaN:

ser = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])>>> a = ['a','b','c','d','e']>>> ser.reindex(a)a   -5.3b    7.2c    3.6d    4.5e    NaNdtype: float64>>> ser.reindex(a,fill_value=0)a   -5.3b    7.2c    3.6d    4.5e    0.0dtype: float64>>> ser.reindex(a,method='ffill')a   -5.3b    7.2c    3.6d    4.5e    4.5dtype: float64>>> ser.reindex(a,fill_value=0,method='ffill')a   -5.3b    7.2c    3.6d    4.5e    4.5dtype: float64

.reindex()The method returns a new object whose index strictly follows the given parameter,method:{'backfill', 'bfill', 'pad', 'ffill', None}The parameter is used to specify the interpolation (fill) method. When no interpolation (fill) method is providedfill_valueFill. The default value is NaN (ffill = pad and bfill = back fill, which respectively indicate the forward or backward values during interpolation)

The method for re-indexing A DataFrame object is as follows:.reindex(index=None,columns=None,**kwargs). Only one optional columns parameter is added to the column index. The usage is similar to the previous example, except that the interpolation methodmethodThe parameter can only be applied to rows, that is, axis 0.

>>> state = ['Texas','Utha','California']>>> df.reindex(columns=state,method='ffill')    Texas  Utha  Californiaa      1   NaN           2c      4   NaN           5  d      7   NaN           8[3 rows x 3 columns]>>> df.reindex(index=['a','b','c','d'],columns=state,method='ffill')   Texas  Utha  Californiaa      1   NaN           2b      1   NaN           2c      4   NaN           5d      7   NaN           8[4 rows x 3 columns]

Howeverfill_valueStill valid. Smart friends may have thought of it.df.T.reindex(index,method='**').TIn this way, it is feasible to implement column interpolation. Note thatreindex(index,method='**')The index must be monotonous; otherwise,ValueError: Must be monotonic for forward fillFor example, if you useindex=['a','b','d','c']It won't work.

Deletes an item on a specified axis.

That is, to delete the meaning of a Series element or a row (column) of DataFrame.drop(labels, axis=0)Method:

>>> serd    4.5b    7.2a   -5.3c    3.6dtype: float64>>> df   Ohio  Texas  Californiaa     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> ser.drop('c')d    4.5b    7.2a   -5.3dtype: float64>>> df.drop('a')   Ohio  Texas  Californiac     3      4           5d     6      7           8[2 rows x 3 columns]>>> df.drop(['Ohio','Texas'],axis=1)   Californiaa           2c           5d           8[3 rows x 1 columns]

.drop()The returned object is a new object, and the meta object will not be changed.

Indexing and Slicing

Like Numpy, pandas also supportsobj[::]And filter by using a Boolean array.

Note that the index of the pandas object is not limited to integers, so when a non-integer is used as the slice index, it is included at the end.

>>> fooa    4.5b    7.2c   -5.3d    3.6dtype: float64>>> bar0    4.51    7.22   -5.33    3.6dtype: float64>>> foo[:2]a    4.5b    7.2dtype: float64>>> bar[:2]0    4.51    7.2dtype: float64>>> foo[:'c']a    4.5b    7.2c   -5.3dtype: float64

Here, foo and bar are only different from index -- the index of bar is an integer sequence. It can be seen that when an integer index is used for slicing, the result is the same as that of the Python list or Numpy by default.'c'In such a string index, the result contains this boundary element.

Another special feature is the indexing method of the DataFrame object because it has two axial indexes (dual indexes ).

It can be understood that the standard slicing syntax of the DataFrame object is:.ix[::,::]. The ix object can accept two sets of slices, which are the direction of the row (axis = 0) and column (axis = 1:

>>> df   Ohio  Texas  Californiaa     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> df.ix[:2,:2]   Ohio  Texasa     0      1c     3      4[2 rows x 2 columns]>>> df.ix['a','Ohio']0

Without ix, direct switch is special:

When indexing, columns are selected
When slicing, the row is selected

This seems a bit illogical, but the author explains that "this syntax setting comes from practice" and we trust him.

>>> df['Ohio']a    0c    3d    6Name: Ohio, dtype: int32>>> df[:'c']   Ohio  Texas  Californiaa     0      1           2c     3      4           5[2 rows x 3 columns]>>> df[:2]   Ohio  Texas  Californiaa     0      1           2c     3      4           5[2 rows x 3 columns]

When using a Boolean array, note the different cut methods of rows and columns (column cut method:It cannot be saved ):

>>> df['Texas']>=4a    Falsec     Trued     TrueName: Texas, dtype: bool>>> df[df['Texas']>=4]   Ohio  Texas  Californiac     3      4           5d     6      7           8[2 rows x 3 columns]>>> df.ix[:,df.ix['c']>=4]   Texas  Californiaa      1           2c      4           5d      7           8[3 rows x 2 columns]

Arithmetic Operations and Data Alignment

One of pandas's most important functions is that it can perform arithmetic operations on objects with different indexes. When an object is added, the index of the result is the union of the index pair. Automatic Data Alignment introduces null values in non-overlapping indexes. The default value is NaN.

>>> foo = Series({'a':1,'b':2})>>> fooa    1b    2dtype: int64>>> bar = Series({'b':3,'d':4})>>> barb    3d    4dtype: int64>>> foo + bara   NaNb     5d   NaNdtype: float64

The alignment operation of DataFrame occurs simultaneously on rows and columns.

If you do not want the NA value to appear in the calculation result, you can use the previously mentioned reindexfill_valueParameter, but to pass this parameter, you need to use the object method instead of the operator:df1.add(df2,fill_value=0). Other arithmetic methods include:sub(), div(), mul().

Arithmetic Operations between Series and DataFrame involve broadcasting.

Function Application and ing

Numpy's ufuncs (element-level array method) can also be used to operate pandas objects.

When you want to apply a function to a row or column of A DataFrame object, you can use.apply(func, axis=0, args=(), **kwds)Method.

f = lambda x:x.max()-x.min()>>> df   Ohio  Texas  Californiaa     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> df.apply(f)Ohio          6Texas         6California    6dtype: int64>>> df.apply(f,axis=1)a    2c    2d    2dtype: int64

Sorting and ranking

Seriessort_index(ascending=True)The method can be used to sort the index. The ascending parameter is used to control the ascending or descending order. The default value is ascending.

To sort Series by value.order()Method. All missing values are placed at the end of the Series by default.

On DataFrame,.sort_index(axis=0, by=None, ascending=True)The method has an axial selection parameter and A by parameter. The by parameter is used to sort a certain column (The by parameter cannot be used for rows ):

>>> df.sort_index(by='Ohio')   Ohio  Texas  Californiaa     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> df.sort_index(by=['California','Texas'])   Ohio  Texas  Californiaa     0      1           2c     3      4           5d     6      7           8[3 rows x 3 columns]>>> df.sort_index(axis=1)   California  Ohio  Texasa           2     0      1c           5     3      4d           8     6      7[3 rows x 3 columns]

Rank (Series.rank(method='average', ascending=True)) Is used to replace the value of an object with the rank (from 1 to n ). At this time, the only problem is how to deal with level items.methodThe parameter serves this purpose. It has four optional values:average, min, max, first.

>>> ser=Series([3,2,0,3],index=list('abcd'))>>> sera    3b    2c    0d    3dtype: int64>>> ser.rank()a    3.5b    2.0c    1.0d    3.5dtype: float64>>> ser.rank(method='min')a    3b    2c    1d    3dtype: float64>>> ser.rank(method='max')a    4b    2c    1d    4dtype: float64>>> ser.rank(method='first')a    3b    2c    1d    4dtype: float64

Note the ranking of different method parameters on the level items of ser [0] = ser [3.

DataFrame.rank(axis=0, method='average', ascending=True)The method has multiple axis parameters. You can select to rank by row or column. Currently, it seems that there is no ranking method for all elements.

Statistical methods

Pandas objects have some statistical methods. Most of them belong to reduction and summary statistics, which are used to extract a single value from Series or extract a Series from a row or column of DataFrame.

For exampleDataFrame.mean(axis=0,skipna=True)Method. When there are NA values in the dataset, these values will be skipped, unless the entire slice (row or column) is all NA. If you do not want this, you can useskipna=FalseTo disable this function:

>>> df    one  twoa  1.40  NaNb  7.10 -4.5c   NaN  NaNd  0.75 -1.3[4 rows x 2 columns]>>> df.mean()one    3.083333two   -2.900000dtype: float64>>> df.mean(axis=1)a    1.400b    1.300c      NaNd   -0.275dtype: float64>>> df.mean(axis=1,skipna=False)a      NaNb    1.300c      NaNd   -0.275dtype: float64

Other common statistical methods include:

########################	**************************************
Count	Number of non-NA values
Describe	Column calculation summary statistics for Series or DF
Min, max	Minimum and maximum
Argmin, argmax	Index location of the minimum and maximum values (integer)
Idxmin, idxmax	Index value of the minimum and maximum values
Quantile	Sample quantile (0 to 1)
Sum	Sum
Mean	Mean Value
Median	Median
Mad	Calculate the mean absolute deviation based on the mean value.
Var	Variance
Std	Standard Deviation
Skew	Skewness of sample values (third moment)
Kurt	Kurtosis of sample values (fourth moment)
Cumsum	Sum of sample values
Cummin, cummax	Cumulative maximum and cumulative minimum values of sample values
Cumprod	Cumulative product of sample values
Diff	Calculate the first-order difference (useful for time series)
Pct_change	Calculated percentage change

Process Missing Data

In pandas, NA is mainly expressed as np. nan. In addition, the built-in None of Python will also be processed as NA.

There are four methods to process NA:dropna , fillna , isnull , notnull.

Is (not) null

This method applies the object at the Element Level, and then returns a Boolean array, which can be used for Boolean indexes.

Dropna

For a Series, dropna returns a Series containing only non-null data and index values.

The problem lies in the way DataFrame is processed, because once dropped, at least one row (column) should be dropped ). The solution here is similar to the previous one, or through an additional parameter:dropna(axis=0, how='any', thresh=None), The optional value of the "how" parameter is any or all. All discards this row (column) only when all slice elements are NA ). Another interesting parameter is thresh, which is of the integer type. Its function is to, for example, thresh = 3, it is retained when there are at least three non-NA values in a row.

Fillna

fillna(value=None, method=None, axis=0)In addition to the basic type, you can also use the dictionary to fill different values in different columns. Method.reindex()The methods are the same.

Inplace Parameter

I didn't talk about the previous point. I found it very important to write down the entire example of the result. In the methods of Series and DataFrame objects, any method that modifies the array and returns a new array usually hasreplace=False. If it is set to True manually, the original array can be replaced.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More