"Python Data Analysis" Note--pandas

Last Update:2018-02-04 Source: Internet

Author: User

Tags instance method scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pandas

Pandas is a popular open source Python project that takes the name of panel data and Python data analysis.

Pandas has two important data structures: Dataframe and series

The dataframe of PANDAS data structure

Pandas's DATAFRAME data structure is a tagged two-dimensional object that is very similar to Excel spreadsheets or relational data tables.

You can create dataframe in the following ways:

1. Create a dataframe from another dataframe

2. Generate Dataframe from a numpy array with two-dimensional shapes or an array of complex structures

3. Similarly, a dataframe can be created with another data structure series pandas. About series, later in this article

4.DataFrame can also be generated from files such as CSV

Examining the dataframe of Pandas and its various properties

(1) First, load the data file into Dataframe and display its contents:

 from pandas.io.parsers Import READ_CSVDF=read_csv ("who_first9cols.csv") print (  "Dataframe", DF)

(2) Dataframe has a property that holds dataframe shape data in the form of tuples, which is very similar to Ndarray, and we can query for a dataframe number of rows

Print ("shape", Df.shape) print ("Length" , Len (DF))

(3) Following other properties to examine the headings and data types of each column

Print ("Column Headers", Df.columns) print ("Data types" , Df.dtypes)

(4) Pandas dataframe with an index, similar to the primary key (primary key) of the data table in the relational database. For this index, we can either specify it manually or let pandas create it automatically. When you access an index, you use the appropriate properties to

Print ("Index", Df.index)

(5) Sometimes we want to traverse the underlying data of dataframe, and if you use Pandas iterators, the efficiency of traversing column values can be very low. A better solution would be to extract these values from the underlying numpy array and then handle them accordingly. However, a property of Pandas's Dataframe can help us in this regard

Print ("Values", Df.values)

Series of PANDAS data structures

The Pandas series data structure is a one-dimensional array of elements of different types, and the data structure has a label. You can create a series data structure for pandas in the following ways.

1. Create a series from a Python dictionary

2. Create series by NumPy array

3. Created by a single scalar

When you create a series data structure, you can submit a set of axis labels to the constructor, often called indexes, which is an optional parameter. By default, if you use the NumPy array as the input data, pandas increments the index value from 0. If the data passed to the constructor is a Python dictionary, then the key of the dictionary is sorted into the corresponding index, and if the input data is a scalar value, then we need to provide the appropriate index. Enter a scalar value for each new value in the index. The characteristics and behavior of Pandas's series and dataframe data type interfaces are borrowed from the numpy array and the Python dictionary.

(1) First, select the first column in the input file, which is the country column, and then show the type of the object in the local scope

country_col=df["country"]print ("Type df", Type (DF)) print ("type country col", type (Country_col))

(2) The Pandas series data structure not only shares some of the properties of the Dataframe, but also provides a property related to the name.

Print ("series shape", Country_col.shape) print ("series Index " , Country_col.index) print("Series Values", country_col.values)print ("  Series name", Country_col.name)

(3) In order to demonstrate the slicing function of series, this is illustrated by taking the last two countries of the series variable country as an example.

Print ("last 2 countries", country_col[-2:]) print ("last2 Countries type", type (country_col[-2:]))

(4) The NumPy function also applies to Pandas's Dataframe and series data structures

You can perform various types of numeric operations between Dataframe, series, and numpy arrays.

Querying data with Pandas:

(1) The Functions of head () and tail () are similar to the two commands of the same name in Unix systems, that is, select the first n and the last n data records of Dataframe, where n is an integer parameter:

Print ("Head 2", Sunspots.head (2)) print ("Tail 2  ", Sunspots.tail (2))

(3) The following date is used to query the last year of sunspots related data:

last_data=sunspots.index[-1]print ("lastvalue", Sunspots.loc[last_date])

(4) The following describes how to query a date by using a date string in the YYYYMMDD format, as follows:

Print ("Values Slice by date", sunspots["20020101": "20131231"])

(5) The index list can also be used to query

Print ("Slice from a list of indices", sunspots.iloc[[2,4,-4, 2])

(6) To choose a scalar value, there are two methods, here is the speed of the obvious advantage of the second method. They require two integers as arguments, where the first integer represents a row, and the second integer represents the column:

Print ("Scalar with Iloc", sunspots.iloc[0,0]) print ("  Scalar with iat", sunspots.iat[1,0])

(7) The method of querying Boolean variables is very close to the WHERE clause of SQL

Print ("Boolean selection", Sunspots[sunspots>sunspots.mean ()])

Statistical calculation using the dataframe of Pandas

Pandas's DATAFRAME data structure provides us with a number of statistical functions.

Describe: This method returns descriptive statistics

Count: This method will return the number of non-Nan data items

Mad: This method is used to calculate the mean absolute deviation, a powerful statistical tool similar to the standard deviation

Median: This method is used to return the median

Min: This method will return the minimum value

Max: This method will return the maximum value

Mode: This method will return the majority

Std: This method will return the standard deviation

Var: This method will return the variance

Skew: This method is used to return the skewness coefficient, which represents the degree of symmetry of the data distribution

Kurt: This method returns the Kurtosis, which is used to reflect the peak or flat degree of the top of the data distribution curve.

Using Pandas's dataframe to achieve data aggregation

(1) Specify the seed for the NumPy random number generator to ensure that data generated when the program is repeatedly run is not aliased

Import Pandas asPD fromnumpy.random Import Seed fromnumpy.random Import Rand fromnumpy.random Import Random_integersimport numpy asNpseed ( the) DF=PD. DataFrame ({'Weather':['Cold',' Hot','Cold',' Hot','Cold',' Hot','Cold'],' Food':['Soup','Soup','icecream','Chocolate','icecream','icecream','Soup'],' Price':Ten*rand (7),' Number': Random_integers (1,9, Size= (7,))}) Print (DF)

(2) The data is grouped by weather, and then the data of each group is traversed.

Weather_group=df.groupby ('weather') i=0 for in Weather_group:    i=i+1    print ("group", I, Name)    Print (group)

(3) Variable weather_group is a special pandas object that can be generated by GroupBy (). This object provides us with an aggregate function, which shows how it is used:

Print ("Weather groupFirst", Weather_group.first ())    print ("  Weather_grouplast", Weather_group.last ())    print ("weather_group mean  ", Weather_group.mean ())

(4) As with database query operations, you can also group multiple columns

(5) through the agg () method, a series of numpy functions can be applied to the data set.

Print ("WF aggregrated\n", Weather_group.agg ([Np.mean,np.median]))

Concatenation and additional operation of the Dataframe

The database table has two connection operation types for internal and external connections. In fact, Pandas's dataframe also have similar operations, so we can also concatenate and attach data rows. We'll use the Dataframe in the previous section to practice concatenation and additional operations on data rows

function concat () is a concatenation dataframe, such as a dataframe consisting of 3 rows of data can be concatenated with other data rows in order to reconstruct the original dataframe:

Print ("Concat back together\n", Pd.concat ([df[:3],df[3:]]))

To append data rows, you can use the Append () function:

Print ("appending rows\n", df[:3].append (df[5:]))

Connection Dataframes

The merge () function provided by pandas or the join () instance method of Dataframe can implement a connection operation function similar to the database. By default, the join () instance method is connected by index, but sometimes it does not meet our requirements

Although, pandas supports all of these connection types (internal, left outer, right outer, and full external)

(1) Use the merge () function to connect with the employee number

Print ("Merge () on key\n", Pd.merge (dests,tips,on='empnr') )

(2) When performing a join operation with the Join () method, you need to use a suffix to indicate the left and right manipulation objects:

Print ("dests Join () tips\n", Dests.join (tips,lsuffix='Dest' , rsuffix='Tips'))

This method joins the index value, so the resulting result differs from the SQL internal connection

(3) when using merge () to perform an internal connection, the more explicit method is as follows:

Print ("Inner join with merge () \ n", Pd.merge (dests,tips,how='Inner  '))

As long as you modify it, you can become a full external connection:

Print ("Outer join\n", Pd.merge (dests,tips,how='Outer'))

Handling Missing data issues

For pandas, it will mark the missing value as Nan, which means none, and a similar symbol is NAT, but it represents the Datetime64 type object. When the value of Nan is arithmetic, the result is Nan.

Pandas's IsNull () function can help us examine the missing data, using the following method.

Print ("Null values\n", Pd.isnull (DF))

Similarly, non-missing data can be examined using the Dataframe Notnull () method:

Print ("notNull values\n", Df.notnull ())

With the Fillna () method, you can replace missing data with a scalar (such as 0), although it is sometimes possible to replace missing data with 0, but this is not always the case

Print ("zero filled\n", Df.fillna (0))

Pivot table

PivotTables can aggregate data from rows and columns specified in a flat file, which can be summed, averaged, and standard poor operations

Since the pandas API has provided us with the top-level pivot_table () function and the corresponding Dataframe method, you can let this aggregate function perform functions such as SUM () in NumPy as long as the Aggfunc parameter is set. The parameter cols is used to tell pandas which columns to perform the aggregation operation on.

Print (pd.pivot_table (df,cols=['food'],aggfunc=np.sum))

Python Data Analysis Note--pandas

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More