Pandas
Pandas is the most powerful data analysis and exploration tool under Python. It contains advanced data structures and sophisticated tools, making data processing in Python very fast and simple. Pandas is built on top of Numpy, which makes Numpy-centric applications easy to use. Pandas is very powerful, supports SQL-like data addition, deletion, checking, and modification, and has rich data processing functions; supports time series analysis functions; supports flexible processing of missing data, etc.
The installation of Pandas is relatively easy. After installing Numpy, you can install it directly. You can install it through pip install pandas or download the source code after python setup. Py install. Because we frequently use to read and write Excel, but the default Pandas can not read and write Excel files, you need to install the xlrd (read) and xlwt (write) libraries to support Excel read and write, the method is as follows:
pip install xrd #Add the function of reading Excel for Python
pip install xlwt #Add the function of writing Excel to Python
The basic data structures of Pandas are Series and Dataframe. As the name implies, Series is a sequence, similar to a one-dimensional array; Data Frame is equivalent to a two-dimensional table, similar to a two-dimensional array, each of its columns is a Series. In order to locate the elements in the Series, Pandas provides Index objects.Each Series will have a corresponding Index to mark different elements. The content of the Index is not necessarily a number, but it can also be letters, Chinese, etc., which is similar to The primary key in SQL.
Similarly, Data Frame is equivalent to a combination of multiple Series with the same Index, each Seiries has a unique header, used to identify different Series. for example:
#-*-coding: utf-8-*-
import pandas as pd #Usually use pd as an alias for pandas.
s = pd.Series ([1,2,3], index = [‘a‘, ‘b‘, ‘c‘]) #Create a sequence s
d = pd.DataFrame ([[1,2,3], [4,5,6]], columns = [‘a‘, ‘b‘, ‘c‘]) #Create a table
d2 = pd.DataFrame (s) #You can also use existing sequences to create tables
print (d.head ()) #Preview the first 5 lines of data
print (d.describe ()) #Data basic statistics
pd.read_excel (‘data.xls’) #Read Exce1 file and create Dataframe
pd.read_csv (‘data.csv’, encoding = ‘utf-8’) #Read text format data, generally use encoding to specify encoding.
StatsModels
Pandas focuses on data reading, processing and exploration, while StatsModels pays more attention to statistical modeling analysis of data, which makes Python have the taste of R language. StatsModels supports data interaction with Pandas, so it combines with Pandas and becomes a powerful data mining combination under Python.
Installing StatsModels is fairly simple. It can be installed via both pip and source code. For Windows users, there are even compiled exe files for download on the official website. If you install it manually, you need to solve the dependency problem yourself. Statmodel depends on Pandas (of course also depends on Pandas), and also depends on pasty (a library that describes statistics).
The following is an example of using Stats Models to perform ADF stationarity test.
#-*-coding: utf-8-*-
from statsmodels.tsa.stattools import adfuller as ADF #Import ADF exactly
import numpy as np
ADF. (Np.random.rand (100)) #The returned result has ADF, p value
Scikit-Learn
Scikit-Learn is a powerful machine learning toolkit under Python. It provides a complete machine learning toolbox, including data preprocessing, classification, regression, clustering, prediction, and model analysis. Scikit-Learn depends on Numpy, Scipy and Matplotlib, so you only need to install these libraries in advance, and then install Scikit-Learn basically there is no problem, the installation method is the same as before, or pipinstall scikit-leam installation , Or just download the source code and install it yourself.
Creating a machine learning model is simple:
#-*-coding: utf-8-*-
from sklearn.linear_model import Linearregression #import linear regression model
model = Linearregression () #Build a linear regression model
print (model)
1) The interfaces provided by all models are:
model fit0: training model, fit (x, y) for supervised models, fit (X) for unsupervised models.
2) The interfaces provided by the supervision model are:
model predict (xnew): predict new samples
model predict proba (Xnew): predict probability, only useful for certain models (such as LR)
model score: The higher the score, the better the fit
3) The interfaces provided by the unsupervised model are:
model transform (: learn new "base space" from data
model fit transform: learn a new base from the data and transform this data according to this set of "bases".
Scikit-Learn itself provides some example data, and the more common ones are Anderson iris flower dataset and handwritten image dataset. Now use the iris dataset iris to write a simple machine learning example. For this data set, you can read "R Language Data Mining Practice-Introduction to Data Mining"
#-*-coding: utf-8-*-
from sklearn import datasets #import data sets
iris = datasets.load_iris () #Load dataset
print (iris.data.shape) #View the data set size
from sklearn import svm #import SVM model
clf = svm. LinearSVC () #Create linear SVM classifier
clf.fit (iris.data, iris.target) #Train the model with data
clf.predict ([[5.0,3.6,1.3,0.25]]) #After training the model, enter new data for prediction
clf.coef_ #View the parameters of the trained model
Python data analysis tools-Pandas, StatsModels, Scikit-Learn