I. Introduction of PANDAS
1. The Python data analysis Library or pandas is a numpy-based tool that is created to resolve data analytics tasks. Pandas incorporates a number of libraries and a number of standard data models, providing the tools needed to efficiently manipulate large datasets. Pandas provides a number of functions and methods that enable us to process data quickly and easily. You will soon discover that it is one of the important factors that make Python a powerful and efficient data analysis environment.
2, Pandas is a Python data analysis package, originally developed by AQR Capital Management in April 2008 and open source at the end of 2009, is currently developed and maintained by the Pydata development team focused on Python packet development, Part of the Pydata project. Pandas was originally developed as a financial data analysis tool, so pandas provides a good support for time series analysis. The name of the pandas comes from panel data and Python data analysis. Panel data is a term for the cube in economics, and it also provides the panel's datatype in pandas.
3. Data structure:
Series: A one-dimensional array, similar to a one-dimensional array in NumPy. The two are similar to the Python basic data Structure list, the difference is that the elements in the list can be different data types, and the array and series only allow the same data types to be stored, so that more efficient use of memory, improve the efficiency of operations. Time-series: A Series that is indexed in time. DataFrame: A two-dimensional tabular data structure. Many functions are similar to the Data.frame in R. Dataframe can be understood as a container of series. The following content is mainly based on Dataframe. Panel: A three-dimensional array that can be understood as a dataframe container. Pandas has two types of basic data structures that are unique to them. The reader should note that it has two data structures, because it is still a library of Python, so the data types in Python still apply here, and you can also use classes to define the data types themselves. But Pandas also defines two types of data: Series and DataFrame, which make the data easier to manipulate.
Second,
Pandas Installation
because pandas is a third-party library of Python, you need to install it before you use it, and the pandas and related components are automatically installed using PIP install Pandas.
Third, the use of Pandas
Note: This operation is performed in Ipython
1. Import the Pandas module and use aliases, and import the series module, the following use is based on this import.
In [1]: From pandas import Series
In [2]: Import pandas as PD
2. Series
A series is like a list, a series of data that corresponds to an index value for each data.
The Series is the "Up" list:
In [3]: s = Series ([1,4, ' ww ', ' TT '])
In [4]: s
OUT[4]:
0 1
1 4
2 WW
3 TT
Dtype:object
The other point is like a list, that is, the type of elements inside, you decide arbitrarily (in fact, it is necessary to decide).
Here, we essentially create a Series object, which of course has its properties and methods. For example, the following two properties can sequentially display the data values and indexes of a Series object:
In [5]: S.index
OUT[5]: Rangeindex (start=0, stop=4, Step=1)
In [8]: S.values
OUT[8]: Array ([1, 4, ' ww ', ' TT '], Dtype=object)
The index of a list can only be an integer starting at 0, and the Series data type is indexed by default. However, unlike the list, theSeries can be custom indexed :
In [9]: S2 = Series ([' Wangxing ', ' Mans ', 24],index=[' name ', ' sex ', ' age ')
In [ten]: s2
OUT[10]:
Name Wangxing
Sex Mans
Age 24
Dtype:object
Each element has an index, and the element can be manipulated according to the index. Do you remember the actions in list? Series, there are similar operations. Look at the simple, view its value according to the index and modify its value :
In []: s2[' name ']
OUT[12]: ' wangxing '
in [[]: s2[' name '] = ' Wudadiao '
in []: S2
OUT[46]:
Name Wudadiao
Sex Mans
Age 24
Dtype:object
Is this a bit like dict data again? That's true. Look at the following to understand.
Does the reader notice that in the previous definition of the Series object, the list, the parameters of the series () method, the first list is its data value, and if you need to define index, put it back, it is still a list. In addition to this method, you can define a Series object in the following ways:
in [+]: sd = {' Python ': 9000, ' C + + ': 9001, ' C # ': 9000}
In []: s3 = Series (SD)
In []: S3
OUT[15]:
C # 9000
C + + 9001
Python 9000
Dtype:int64
Now, do you understand why the previous one is similar to Dict? Because it could have been defined .
At this point, the index can still be customized. The advantages of Pandas are reflected here, if the custom index, the custom index will automatically look for the original index, if the same, take the original index corresponding to the value , this can be referred to as "automatic alignment."
in [+]: S4 = Series (sd,index=[' Java ', ' C + + ', ' C # ')
in [+]: S4
OUT[17]:
Java NaN
C + + 9001.0
C # 9000.0
Dtype:float64
In Pandas, if there is no value, the Zishing is NaN。
given
Pandas has a special method to determine if the value is empty.
in [+]: Pd.isnull (S4)
OUT[19]:
Java True
C + + False
C # False
Dtype:bool
In addition, the Series object has the same method:
In []: S4.isnull ()
OUT[20]:
Java True
C + + False
C # False
Dtype:bool
In fact, the name of the index can be newly defined :
in [+]: S4.index = [' Chinese ', ' math ', ' 中文版 ']
In []: S4
OUT[22]:
Language NaN
Mathematics 9001.0
中文版 9000.0
Dtype:float64
For Series data, you can also perform operations similar to the following (about operations, which are detailed later):
In [ALL]: S4 * 2
OUT[23]:
Language NaN
Mathematics 18002.0
中文版 18000.0
Dtype:float64
In []: s4[s4 > 9000]
OUT[24]:
Mathematics 9001.0
Dtype:float64
Series on the first briefly written here, the following look at pandas another data structure dataframe.
DataFrame
DataFrame is a two-dimensional data structure that is very close to the form of spreadsheets or mysql-like databases. Its vertical lines, called columns, is the same as the previous Series, called Index, which means that the position of a main sentence can be determined by columns and index.
First, import the module.
in [+]: From pandas import series,dataframe
in [+]: data = {"Name": [' Google ', ' Baidu ', ' Yahoo '], ' marks ': [100,200,300], ' price ': [+/-]}
in [+]: F1 = DataFrame (data)
In []: F1
OUT[29]:
Marks name Price
0 Google 1
1 Baidu 2
2 Yahoo 3
This is a common method of defining a DataFrame object--using the DICT definition . The dictionary "key" ("name", "Marks", "price") is the value (name) of DataFrame's columns, and the "value" of each "key" in the dictionary is a list, which is the specific fill data in that vertical column. The index is not determined in the definition above, so, as is customary (the convention already formed in Series) is an integer starting from 0. It is obvious from the results above that this is a two-dimensional data structure (similar to the view in Excel or MySQL).
The above data shows that the order of columns is not specified, just as the order of the keys in the dictionary, but in DataFrame, there is a distinct difference between columns and the dictionary key, that is, the order can be specified , to do the following:
in [+]: F2 = DataFrame (data,columns=[' name ', ' Price ', ' marks '))
in [+]: F2
OUT[32]:
Name Price marks
0 Google 1 100
1 Baidu 2 200
2 Yahoo 3 300
Like Series, theindex of the DataFrame data can also be customized
in [+]: F3 = DataFrame (data,columns=[' name ', ' Marks ', ' Price '],index=[' a ', ' B ', ' C '))
In [approx]: F3
OUT[36]:
Name Marks Price
A Google 100 1
b Baidu 200 2
C Yahoo 300 3
The method of defining DataFrame, in addition to the above, can also use the "dictionary set Dictionary" way.
in [+]: NewData = {' lang ': {' first ': ' Python ', ' second ': ' Java '}, ' price ': {' first ': ', ' Second ': 2000}}
in [+]: F4 = DataFrame (NewData)
In [all]: F4
OUT[42]:
Lang Price
First Python 5000
Second Java 2000
In the dictionary, the list name (first layer key) and each horizontal index (the second level dictionary key) and the corresponding data (second-level dictionary value) are specified, which means that the data in each data lattice is specified in the dictionary, and no rules are empty.
DataFrame 对象的 columns 属性,能够显示素有的 columns 名称。并且,还能用下面类似字典的方式,得到某竖列的全部内容(当然包含索引):
>>> newdata = {"lang":{"firstline":"python","secondline":"java"}, "price":{"firstline":8000}} >>> f4 = DataFrame(newdata) >>> f4 lang price firstline python 8000 secondline java
>>> DataFrame(newdata, index=["firstline","secondline","thirdline"]) lang price firstline python 8000 secondline java NaN thirdline NaN
DataFrame 对象的 columns 属性,能够显示素有的 columns 名称。并且,还能用下面类似字典的方式,得到某竖列的全部内容(当然包含索引):
In []: f3[' name ']
OUT[44]:
A Google
b Baidu
C Yahoo
Name:name, Dtype:object
The following action assigns a value to the same column
Newdata1 = {' username ': {' first ': ' Wangxing ', ' second ': ' Dadiao '}, ' age ': {' first ': ', ' Second ': 25}}
In [the]: F6 = DataFrame (newdata1,columns=[' username ', ' age ', ' sex ')
In [F6]:
OUT[68]:
Username Age Sex
First Wangxing NaN
Second Dadiao NaN
In [f6[]: "Sex" = ' man '
In []: F6
OUT[70]:
Username Age Sex
First wangxing
Second Dadiao
can also be assigned separately , in addition to uniform assignment, but also to "point-to-point" to add values, combined with the previous series, since the DataFrame object is a series object per vertical column, you can first define a series object, and then put it into The DataFrame object. As follows:
Ssex = Series ([' Male ', ' female '],index=[' first ', ' second ')
In [f6[]: "Sex" = Ssex
In [F6]:
OUT[73]:
Username Age Sex
First Wangxing 24 men
Second Dadiao 25 women
Is it possible to modify the data more accurately? Of course, the operation of the dictionary is completely modeled:
In [the]: f6[' age ' [' second '] = 30
In []: F6
OUT[75]:
Username Age Sex
First Wangxing 24 men
Second Dadiao 30 women
Refer to the http://wiki.jikexueyuan.com/project/start-learning-python/312.html.
Python Pandas simple introduction and use (i)