The source of this article:
Python for Data Anylysis:chapter 5
Ten mintues to Pandas:http://pandas.pydata.org/pandas-docs/stable/10min.html#min
1. Pandas Introduction
After several years of development, pandas has become the most commonly used package in Python processing data. The following is the beginning of the development of pandas, and is now the most commonly used pandas features
A:data structures with labeled axes supporting automatic or explicit data alignment (tuning). This prevents common errors resulting from misaligned data and working with differently-indexed data coming from Differern T sources.
B:integrated Time Series functionality
C:the same data structures handle both time series data and Non-time series data.
D:arithmetic operations and reductions (like summing across a axis) would pass on the metadata (axis labels, metadata).
e:flexible handling of missing data
F: Merge and other relational operations found in popular database databases (sql-based, for example)
There is an article "Don 's use of Hadoop when your data is not that big" points out: Hadoop is a reasonable technology option only at a scale of over 5TB of data. So when it comes to dealing with <5TB data volumes, Python pandas is enough to cope.
2. Pandas data structure
2.1 Series
Series is a one-dimensional array-like object, consisting of two parts: 1. Array 2 of any numpy data type. Data labels, called Index.
So a series has two main parameters: Values and Index
Example to create a series that obtains its value and index process
Create a series by passing a Dictionary object that can be converted to a similar sequence structure:
The dictionary key is represented as index. The index parameter can also be added to the series to specify the order of index, whose value automatically matches the value according to key.
One important feature of the series is that its alignment features (data alignment features) automatically adjust the data of different index to perform mathematical operations on the same data when doing mathematical operations.
And the series object itself and the index parameter have a parameter of name, such as Obj.name= ' population ', obj.index.name = ' state '
2.2 DataFrame
Dataframe can be used to express data of a chart type, database relationship type, which contains several sequential columns, with the same data type in each col, but Col can have inconsistent data types.
Dataframe has two index:row and column
Create Dataframe method: Through the same length of the list or array or tuples dictionary, through the nested dict of dicts, through dicts of seires, etc., see the book table5.1
Fetch column: Gets the column information by obj3[' state ' or obj3.year, returns the type series, and the same index as Dataframe
Extract row: Use the IX function and the location information or name of the row
Common functions:
del: Delete column del obj[' year ']
Common parameters: Index and columns have the name parameter, value
2.3 Index Ojbect and reindexing
Pandas index role: for holding the axis labels and other metadata (like the axis name or names)
The Index object is immutable, meaning that it cannot be modified by the user, so the following code does not pass, which corresponds to the "a" in our introduction
The reindex () method can change/increment/delete the index on the specified axis , which returns a copy of the original data
Reindex () in the parameter description:
Index: The new index, instead of the original, the original index will not copy. Pandas processing will automatically copy the original value, which is different from Ndarry
Method: With Ffill and Bfill
Fill_value: Fill Nan value
Copy, etc.
3. View data
3.1 Sorting: Returns a sorted object
A: Sort by axis (row and column)
Sort_index ()
Parameter description: By default by row sort, Axis=1 is the column
Default Ascending, Descending ascedning=false
B: Sort by value
Order (): The missing value is at the end of the line
3.2 Ranking
Rank (): assigns a value in the order in which the values appear, and returns a new obj. When there is the same value, the default returns the sorted mean
3.3 Unique
Is_unique:tell whether its values is unique or not, returns TRUE or False
Unique: Returns a value that is not duplicated and returns an array
3.4 Value_count: Calculates the number of occurrences of each value in a sequence
3.5 describe () for quick statistical summary of data
4. Select data
4.1 Drop
Drop Line:
Pandas processing will automatically copy the original value, which is different from ndarry, for example, the drop line after the call to the original object, found that there is no change
Drop column: Obj4.drop (' Nevada ', Axis=1)
In the parameters of many functions of Python, the default is to consider row, so there is axis (axis) This parameter
Axis=1 is vertical, that is, the column
Axis=0 is a horizontal,
4.2 Select selection, slice slicing, index
A: Select a separate column, which will return a Series, df[' a ' and DF. A one meaning
B: Select by [] , which will slice the rows
C: Select by tag:endpoint is inclusive i.e. obj[' B ': ' C '] contains ' C ' line
D: Select a subset of row and columns: IX
F: Indexed by Tags: loc
E: Indexed by location: Iloc
`
4.3 uses the Isin () method to filter:
For filtering data
5. Missing value handling
5.1 Missing value
Pandas uses Nan (floating point value) to represent the missing data
5.2 Remove rows or columns that contain missing values
Dropna
Parameter description: how= ' All ' only drop row, all NA
Axis=1, drop Column
Thresh=3, keep only the rows with 3 obseration
5.3 Padding for missing values
Fillna
5.4 IsNull: Returns a Like-type object that contains a Boolean value that indicates whether value is missing
The reaction of Notnull:isnull
6. Calculation function
A: Add "+" to two DF objects of different index, with a result similar to union in the database, with a missing value of Nan
B: Specific add or subtract with Add () or sub (), missing value can be replaced by fill_value
C:sum,count,min,max and so on, including some method
D:correlation and covariance
. Corr ()
. CoV ()
7. Merging reshape
8. Grouping
For a group by operation, we usually refer to one or more of the following procedures:
(splitting) Divide the data into groups according to some rules;
(applying) executes a function for each set of data;
(combining) combines the results into a data structure;
Note: This article is not comprehensive and only summarizes the parts I need at the moment.
Python data processing: Pandas basics