Python data analysis Tools--pandas, Statsmodels, Scikit-learn

Source: Internet
Author: User
Tags svm install pandas statsmodels

Pandas
Pandas is the most powerful data analysis and exploration tool under Python. It contains advanced data structures and sophisticated tools, making data processing in Python very fast and simple. Pandas is built on top of Numpy, which makes Numpy-centric applications easy to use. Pandas is very powerful, supports SQL-like data addition, deletion, checking, and modification, and has rich data processing functions; supports time series analysis functions; supports flexible processing of missing data, etc.

The installation of Pandas is relatively easy. After installing Numpy, you can install it directly. You can install it through pip install pandas or download the source code after python setup. Py install. Because we frequently use to read and write Excel, but the default Pandas can not read and write Excel files, you need to install the xlrd (read) and xlwt (write) libraries to support Excel read and write, the method is as follows:

pip install xrd #Add the function of reading Excel for Python

pip install xlwt #Add the function of writing Excel to Python

The basic data structures of Pandas are Series and Dataframe. As the name implies, Series is a sequence, similar to a one-dimensional array; Data Frame is equivalent to a two-dimensional table, similar to a two-dimensional array, each of its columns is a Series. In order to locate the elements in the Series, Pandas provides Index objects.Each Series will have a corresponding Index to mark different elements. The content of the Index is not necessarily a number, but it can also be letters, Chinese, etc., which is similar to The primary key in SQL.

Similarly, Data Frame is equivalent to a combination of multiple Series with the same Index, each Seiries has a unique header, used to identify different Series. for example:

#-*-coding: utf-8-*-

import pandas as pd #Usually use pd as an alias for pandas.

s = pd.Series ([1,2,3], index = [‘a‘, ‘b‘, ‘c‘]) #Create a sequence s

d = pd.DataFrame ([[1,2,3], [4,5,6]], columns = [‘a‘, ‘b‘, ‘c‘]) #Create a table

d2 = pd.DataFrame (s) #You can also use existing sequences to create tables

print (d.head ()) #Preview the first 5 lines of data

print (d.describe ()) #Data basic statistics

pd.read_excel (‘data.xls’) #Read Exce1 file and create Dataframe

pd.read_csv (‘data.csv’, encoding = ‘utf-8’) #Read text format data, generally use encoding to specify encoding.

StatsModels
Pandas focuses on data reading, processing and exploration, while StatsModels pays more attention to statistical modeling analysis of data, which makes Python have the taste of R language. StatsModels supports data interaction with Pandas, so it combines with Pandas and becomes a powerful data mining combination under Python.

Installing StatsModels is fairly simple. It can be installed via both pip and source code. For Windows users, there are even compiled exe files for download on the official website. If you install it manually, you need to solve the dependency problem yourself. Statmodel depends on Pandas (of course also depends on Pandas), and also depends on pasty (a library that describes statistics).

The following is an example of using Stats Models to perform ADF stationarity test.

#-*-coding: utf-8-*-

from statsmodels.tsa.stattools import adfuller as ADF #Import ADF exactly

import numpy as np

ADF. (Np.random.rand (100)) #The returned result has ADF, p value

Scikit-Learn
Scikit-Learn is a powerful machine learning toolkit under Python. It provides a complete machine learning toolbox, including data preprocessing, classification, regression, clustering, prediction, and model analysis. Scikit-Learn depends on Numpy, Scipy and Matplotlib, so you only need to install these libraries in advance, and then install Scikit-Learn basically there is no problem, the installation method is the same as before, or pipinstall scikit-leam installation , Or just download the source code and install it yourself.

Creating a machine learning model is simple:

#-*-coding: utf-8-*-

from sklearn.linear_model import Linearregression #import linear regression model

model = Linearregression () #Build a linear regression model

print (model)

1) The interfaces provided by all models are:

model fit0: training model, fit (x, y) for supervised models, fit (X) for unsupervised models.

2) The interfaces provided by the supervision model are:

model predict (xnew): predict new samples

model predict proba (Xnew): predict probability, only useful for certain models (such as LR)

model score: The higher the score, the better the fit

3) The interfaces provided by the unsupervised model are:

model transform (: learn new "base space" from data

model fit transform: learn a new base from the data and transform this data according to this set of "bases".

Scikit-Learn itself provides some example data, and the more common ones are Anderson iris flower dataset and handwritten image dataset. Now use the iris dataset iris to write a simple machine learning example. For this data set, you can read "R Language Data Mining Practice-Introduction to Data Mining"

#-*-coding: utf-8-*-

from sklearn import datasets #import data sets

iris = datasets.load_iris () #Load dataset

print (iris.data.shape) #View the data set size

from sklearn import svm #import SVM model

clf = svm. LinearSVC () #Create linear SVM classifier

clf.fit (iris.data, iris.target) #Train the model with data

clf.predict ([[5.0,3.6,1.3,0.25]]) #After training the model, enter new data for prediction

clf.coef_ #View the parameters of the trained model

Python data analysis tools-Pandas, StatsModels, Scikit-Learn

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.