Python machine learning and practical knowledge Summary

Source: Internet
Author: User

The task of supervised learning in machine learning focuses on predicting the target/marker of an unknown sample based on existing empirical knowledge.

According to the different types of target predictor variables, we divide the task of supervised learning into two categories: Classification learning and regression prediction.

Supervised learning

Basic schema flow of the task: 1 the first preparation of training data can be text image audio, etc. 2 then extract the required features to form the eigenvector; 3 Then, the eigenvectors are fed into the learning algorithm with corresponding markers/targets, and a predictive model is trained; 4 Then, Using the same characteristic method for the new test data, the eigenvector of the data used for the test is obtained; 5 Finally, the predicted eigenvectors are predicted and the results are obtained by using the predictive model.

Classification Learning (two classification multi-category multi-label classification (to determine if a sample belongs to several different categories))

The model of linear classifier which assumes the linear relationship between the characteristic and the classification result can help the category decision by calculating the product of each dimension's characteristic and the respective weight.

①f=wx+b (vector representation)

Two classification problems Hope f∈{0,1} Therefore, a function is required to map the first obtained f∈r to (0,1) so we think of the ②logistic function.

Integration ①② We get the classic linear classifier model line logistic Regression

A fast estimation algorithm for stochastic gradient ascending method

In Sklearn

The method of analytic parameters in Logistic regression is the method of accurate calculation of sgdclasifier analytic parameters as gradient estimation

The former calculates the high performance of the long model, the latter opposite

100,000 the data above is considered when the latter is recommended

Support Vector Machines (classification) Principle: Search for the best of all possible linear classifiers based on the distribution of training samples.

The data points that can be used to really help the decision-making optimal linear classification model are called "Support vectors" (data points for two different categories of data with the smallest two spatial intervals in two-dimensional features)

Naive Bayesian tectonic basis is Bayesian theory

In abstract terms, the naïve Bayesian classifier will consider the conditional probabilities of each dimension characteristic individually, and then synthesize these probabilities and make categorical predictions for the eigenvectors in which they are located, so the mathematical assumption of this model is that the conditional probabilities of each dimension feature being classified are independent of each other.

There are only 2 possible values for each feature under the hypothetical conditions, so only the estimated 2kn parameters are P (X1=0|Y=C1) p (x1=1|y=c1), ... , p (XN=1|Y=CK)

K Nearest Neighbor classification will get different classification effect with different results. Very high computational complexity and memory consumption time cost for nonparametric models

Decision tree Considering the selection order of feature nodes: Information entropy based on the non-existence

Integrated Models (classification)

Random Forest classifier

Gradient Elevation Decision Tree

Linear regression linearregression Sgdregressor

Support Vector Machine regression

Weighted average of K-point arithmetic mean/distance difference in K-nearest neighbor Regression

Regression tree

Integration Model (regression)

General Random Forest

Lift Tree Model

Extreme Random Forest: When constructing a split node of a tree, you do not randomly select features but first collect a subset of the features and then use the information entropy and the Gini index to select the best node features.

Unsupervised Learning Classic Model

Unsupervised learning focuses on discovering the distribution characteristics of the data itself unsupervised learning does not need to tag the data save a lot of manpower also make the scale of data become limitless

1 Discovery Data Community data clustering can also look for outlier samples

2 features reduced dimension preserving data with differentiated low-dimensional features

These are very useful techniques in mass data processing.

Data clustering

K-Means algorithm (the number of preset clusters is constantly updating the cluster center iteration, which is the sum of the squares of all data points to their cluster centers and tends to stabilize)

Process

① first randomly lays out the points in the K-proof space as the initial cluster center

② then finds the closest one from the K-Cluster Center for the eigenvector based on each data and marks the data as subordinate with this cluster center

③ then, after all the data has been labeled, the cluster centers are re-calculated based on the newly allocated clusters of these data.

④ If a round down all data dependent cluster centers with the last allocated class cluster does not change then the iteration can stop or return to ② to continue the loop

Performance Evaluation :

① is used to evaluate the data itself with the correct category information using the ARI Ari indicator is similar to the method of calculating accuracy in the classification problem, while also taking into account the problem that the cluster cannot match the classification mark one by one

② if the data being used for evaluation does not have a category, then we are accustomed to using contour coefficients to measure the quality of the clustering results. The contour factor also takes into account the aggregation degree and the degree of separation of the cluster.

Used to evaluate the effect of clustering and take a range of values [ -1,1]. The larger the value of the contour system, the better the clustering effect.

K-Clustering algorithm two major defects ① easily convergent local optimal solution ② need to set the number of clusters beforehand

Use the Elbow observation method to roughly estimate the relative reasonable number of clusters

Feature Dimension reduction

1 The actual project encounters the very high characteristic dimension training sample, but often cannot build the effective characteristic by the own domain knowledge artificially

2 characteristics of the naked eye can only observe three dimensions reduce the data dimension and also provide the possibility of dimensional data presentation

Principal component Analysis (PCA analysis)

We can think of PCA as a feature selection, but unlike common understanding, this feature selection is the first mapping of the original feature space, so that the new mapping feature space data orthogonal to each other.

In this way, we can preserve the differentiated low latitude data characteristics as much as possible through principal component analysis.

Feature lift (feature extraction and feature filtering)

Feature extraction (data-and data-feature vectors)

The so-called feature extraction is the transformation of the original data into the form of a feature vector, which involves a quantitative representation of the data characteristics.

Raw data:

1 digitized signal data (voice print, image)

2 and a lot of symbolic text.

① We cannot directly use the symbolic text itself for computational tasks, but we need some processing to pre-measure the text into eigenvectors

Some of the data features represented by symbols have been relatively structured and stored in the data structure of the dictionary.

Then we use Dictvectorizer to extract and quantify features

Dictvectorizer structured data structure for processing dictionary storage

Dicevectorizer How to handle a feature (dictionary):

1 category Lines use 0/12 value mode

2 Digital type maintain the original value can be

② other textual data more primitive knowledge a series of strings we use the word-bag method to extract and quantify features

Two ways of calculating the Word bag method

Countvectorizer

Tfidvectorizer

(A.) Countvectorizer-the frequency at which each word (term) appears in the training text (terms Frequency);

(b.) Tfidfvectorizer-In addition to considering the frequency (term Frequency) in which a word appears in the current text, it is also concerned with the reciprocal of the number of text bars containing the term (inverse Document Frequency). The more entries in the training text, the more advantageous is the way in which the Tfidfvectorizer feature is quantified, because the purpose of calculating the word frequency (term Frequency) is to find important words that contribute more to the meaning of the text in which it resides.

If a word appears in almost every article, it means that the word is a common word and does not help the model categorize the text. In addition, the common vocabulary that appears in each text becomes a stop word (stop Words), and the deactivated words need to be filtered out when the text feature is extracted.

Feature filtering (preserves a limited set of good data features)

Good data feature combination does not need too much, it can make the performance of the model outstanding.

Redundant features do not affect the performance of the model, but it makes the CPU computationally useless

Principal component analysis is primarily used to remove redundant, linear-related specialty combinations because these redundant feature combinations do not contribute much to model training. And the bad characteristics will naturally reduce the accuracy of the model.

Feature screening differs slightly from PCA in that it is possible to reconstruct features by selecting the principal components: for PCA, we often cannot explain the characteristics after reconstruction;

However, feature filtering does not have modifications to the eigenvalues and is more focused on finding a small number of features that increase the performance of the model.

Regularization of the model

Under-fitting and over-fitting

Generalization ability of Model: model's ability to predict unknown data

Fit: Machine learning model in the course of training, by updating the parameters, so that the model continues to fit the observable data (training set) process

States of two models with overfitting and under-fitting are present

Regularization methods (regularization of L1 and L2 regularization)

The aim of regularization is to improve the generalization force of the model on the unknown test data, and to avoid the parameter overfitting.

Therefore, the common method of regularization is to increase the penalty for the parameters based on the original model optimization target.

L1 regularization is to let many elements in the parameter vector tend to 0, so that the effective features become sparse, the corresponding L1 regularization model is called lasso.
L2 regularization is to make most of the elements in the parameter vectors become very small, suppressing the differences between the parameters, the corresponding L2 regularization model is called Ridge

Model checking

Without a test set, we split the existing data into a training set and a development set (validation set)

Validate cross-validation with a model verification method

Leave a validation (for early)

A certain percentage of random sampling is used as a training set, leaving the usual proportion of 7 3 as a validation set, but the performance of the model is unstable due to the uncertainty of the validation set random sampling

Cross-validation (leave an advanced version of authentication)

The average result is obtained after leaving one validation multiple times

Hyper-Parametric search

Grid Search

Brute force search for multiple hyper-parameter combinations of space. Each set of hyper-parameter combinations is brought into the learning function as a new model, and in order to compare the performance between the new models, each model is evaluated using a cross-validation approach across the same set of training sets and development datasets

Parallel search

With all the CPU's nuclear training, that's the parallel way.

Python machine learning and practical knowledge Summary

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.