Python Data Analysis Module

Last Update:2020-10-20 Source: Internet

Author: User

Keywords data analysis data analysis module data analysis process

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There are several data science Python libraries available as of now. While some of them are already popular, others are improving inch-by-inch to reach the acceptance levels of their peers.

numpy

·——Numpy is a third-party python library that supports math operations such as matrices (collectively referred to as statistics and linear algebra of python with scipy), which is often used in deep learning and used in conjunction with TensorFlow
numpy provides many advanced numerical programming tools, such as matrices, vectors, and sophisticated arithmetic libraries
It is more convenient to use numpy and the sparse matrix calculation package scipy together
(Broadcasting: The matrix and the scalar are multiplied together. At this time, the scalar needs to be expanded according to the shape of the matrix. The expansion process is called broadcasting)
(One of the advantages of using it is that the explicit loop statement is significantly reduced, and the numerical calculation efficiency is higher (the bottom layer directly calls the c operation)

scipy
Similar to numpy, it has a certain expansion of its functions
Both numpy and numpy implement Fourier analysis, the most efficient algorithm fast Fourier transform FFT

Fourier analysis: It is a mathematical method based on Fourier poles. It is a mathematical method of expressing functions. It usually uses an infinite series composed of sine and cosine functions to express the function

matplotlib
·—— is a data display tool, data visualization, can be used for data classification display and related processing such as linear regression, etc.
It is similar to the procedural API of MATLab. The panda method in pandas also encapsulates some related matplotlib functions, and can also complete some data visualization tasks.
Create a graph using the ploy() function in the pyplot sub-library

pandas (panel data)
·——It is a commonly used third-party library for data manipulation and data analysis. With numpy, it can complete various statistical analysis tasks and support data retrieval, processing and storage.
Can be used to read and write CSV, JSON, Excel, etc.
His DataFrame data structure has many statistical functions for data statistics

NLTK
·——It is a third-party library for natural language processing. It is commonly used in the NLP field. It can build a word bag model (word count), support word frequency analysis (word occurrences), pattern recognition, association analysis, and sentiment analysis (word frequency analysis + measurement) Indicators), visualization (+matploylib for analysis graphs), etc., as well as naive Bayes classification, etc.
Download the corpus through the download method, one or two G
Stop words: commonly used words without much information meaning, such as "的" and "to" various conjunction verbs
Cannot create feature vector, but can be assisted by sklearn

Naive Bayes: How to use new evidence to modify the probability of certain events
It is called Naive because it assumes that the features are independent of each other

Metrics: term frequency and inverse document frequency TF-IDF (an important part of information retrieval)
Weight the words (divide stop words, common words, professional words, etc.)
sklearn implemented him and generated a scipy matrix

sklearn (scikit learn)
·——This third-party library implements some machine learning algorithms, and the neural network is not perfect yet
Sklearn is a commonly used python third-party module in machine learning, which encapsulates some commonly used machine learning methods. When performing machine learning tasks, it is not necessary for everyone to implement all algorithms, just simply call sklearn. The module can realize most machine learning tasks.
Machine learning tasks usually include Classification and Regression. Commonly used classifiers include SVM, KNN, Bayesian, linear regression, logistic regression, decision tree, random forest, xgboost, GBDT, boosting, and neural network NN.
Common dimensionality reduction methods include TF-IDF, topic model LDA, principal component analysis PCA, etc.

SVM can map data points to data points in a multi-dimensional space, which is done through a kernel function, which can be linear or non-linear
In this way, the classification problem is simplified to finding a hyperplane that divides the space in two, or multiple hyperplanes that can well divide the data points into different spaces (categories)
Using the concept of hyperplane classification and soft interval to express tolerance to errors
The types of kernel functions can be: sigmoid, radial basis, polynomial, linear

Support vector regression: SVR

————
Clustering: grouping, no need to provide target data, belongs to unsupervised learning, need to infer the number of clusters
The similarity propagation AP algorithm is a clustering method and does not need to infer the number of clusters

There is also a mean shift algorithm Meanshift that does not need to guess the number of clusters
Basic concept: Find clustering points along the direction of increasing density
The algorithm finds the maximum value of a density function through iteration

Application of Meanshift algorithm:
Clustering (K-means clustering), image segmentation (mapping the image to the feature space, and performing mean shift clustering on the sampling points), object contour inspection (light propagation algorithm), target tracking (solving the optimal Bhattacharya coefficient function)

Step: randomly select a center point from the data points
Find all points whose distance is within the bandwidth, set M, these points belong to cluster C
Calculate the vector from the center point to each element and add them to get the offset vector
The center point moves along the offset vector, and the moving distance is the modulo of the offset vector
Calculate the offset vector repeatedly. Offset center point
After the center point reaches the optimal point, randomly select another center point to do the same operation until all points are classified
According to each category, for each point’s access frequency, the category with the highest access frequency is taken as the category of the current point set

————————————

ANN artificial neural network
The computational model is inspired by the brains of higher animals. The so-called neural network is actually a network composed of neurons, which have inputs and outputs...

Decision tree
Very similar to the old-fashioned flowchart, except that the flowchart allows looping
Decision tree learning: The end nodes are usually called leaf nodes, which store the class label of the classification problem. Each non-leaf node corresponds to a Boolean conditional judgment between the feature values. Sklearn uses Gini impurity and entropy as a measure of information. Two indicators measure the probability of a data item being misclassified
Decision trees are very easy to understand, use, visualize and verify
In order to visualize the decision tree, you can use Graphviz decision tree visualization

GPU (graphics processing unit): an integrated circuit dedicated to displaying images efficiently

————————————
With the development of big data, noSQL databases have become popular because of flexibility
Such as sqlite3

SQLAlchemy
It is a well-known object-relational mapping ORM based on design patterns, which can map python classes to tables in the database
You can fill and query the database through him
Dataset lazy database, repackaging SQLAlchemy

PonyORM
It is another ORM package, it can automatically query and optimize, you can query the database through python generator expression

MongoDB
noSQL database, storage method: JSON, BSON
redis
NoSQL database, kv database, the bottom layer is realized by c, the read and write speed is amazing
Apache Cassandra
kv database + tradition, organized by column family, flexible use of each row

statsmodels
Can cooperate with numpy, scipy, pandas to complete signal processing, cointegration, filtering, spectrum analysis, etc.

Restful API
The web uses the Rest architecture style. For HTTP, you can use 6 methods such as getputpostdelete, corresponding to the creation of data items, requesting update and delete
restful returns a JSON string, replacing the specified parameter name with the request position parameter

Beautiful Soup, like urllib and scrapy, is also a crawler framework

Genetic algorithm
The independent branch of machine learning, sample vector (individual), evaluate the group with objective function (natural selection), select, exchange, and mutate individuals according to the evaluation value (fitness) to obtain a new group
Suitable for complex environments, such as scenes with a lot of noise and irrelevant data, constantly updated, and long duration, just like neural networks

_

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More