Fragmented used a number of statistical algorithms, in this simple comb, and strive to use elevator speech law to explain each algorithm model (this is the first mourning, finally, hehe). But they don't understand, and they need to work harder. It is more important to reuse the wisdom of others.
Statistical Learning Overview
On the statistical study, the first recommendation of Hangyuan Li Teacher's book, "Statistical learning method." Here is a sentence to define statistical learning: Statistical learning (statistical learning) is about the computer based on the probability model of data building and the use of models to predict and analyze the data of a subject. It can be seen that there are two important points in statistical learning: data, probabilistic models.
There are three elements in the statistical learning method: Model, strategy, algorithm. The model refers to the probability function or decision function to be learned. A strategy is a benchmark or guideline that we define before we can learn or choose the optimal model (without the strategy we cannot judge the model and make a choice). Algorithm is the specific method used in learning.
1,k nearest Neighbor method
This section introduces the K nearest neighbor method from the following two parts: the thought of K nearest neighbor method and three elements.
K Nearest Neighbor method. K-Nearest Neighbor method is a basic classification method, which can also be used for regression. This method uses data to explain the "flock together, birds of a Feather" implication: If a sample in the feature space in the K most similar (that is, the most adjacent in the feature space) of the sample is a category, the sample also belongs to this category. Most of your K friends are rich, so you're rich too (and of course there's a problem with accuracy, HH).
Three of elements. From the thought of K-nearest neighbor, it also embodies three elements: distance measurement, that is, what kind of criteria define a person distance from your relationship, and then determine whether your friend, the choice of k value, that is, to see a few friends to speculate about your situation, classification decision-making, generally choose the majority of votes, that is, most of the money You're a rich man (do you want a different classification strategy?). K A friend has a rich, then decided this belongs to the rich, also can AH).
2, Clustering
This section mainly introduces two kinds of clustering algorithm ideas: Hierarchical clustering, K-mean clustering.
Hierarchical clustering constructs a hierarchical structure of a group by continuously merging the most similar groups. During each iteration, the hierarchical clustering algorithm calculates the distance between 22 groups and merges the nearest two groups, forming a group.
K-means (the same algorithm that was previously thought to refer to Kmeans and KNN, HH). The working process of the K-means algorithm is described as follows: first select K objects from N data Objects as the initial cluster centers, and for the rest of the objects, they are assigned to their closest similarity (distance) according to their similarity to these clustering centers (the cluster centers represent). And then computes the cluster center of each new cluster (the mean value of all the objects in the cluster); Repeat this process until the standard measure function begins to converge. Mean variance is generally used as the standard measure function. K clusters have the following characteristics: Each cluster itself is as compact as possible, and each cluster is as separate as possible.
3, Naive Bayes
Naive Bayesian method is a classification method based on Bayesian theorem and conditional independent hypothesis. The basic process is: for a given training data set, first of all, based on the characteristics of conditional independence hypothesis learning/input/output joint probability distribution, and then based on some models of the given input input x, using the Bayesian theorem of each classification of the posterior probability, the selection of the maximum probability of the classification as output Y.
There are several words in this block: Bayes theorem, characteristic condition independence hypothesis, priori probability, posterior difficulty probability, combined probability.
The encyclopedia has some textual interpretations of Bayes ' theorem: Normally, when event a occurs, the probability of B occurring is not equal to that of B, and the Bayesian theorem is used to describe the relationship between the two. So what's the relationship between the two? is a formula, Bayesian formula/theorem.
Prior probability: For the current instance to be judged does not know any knowledge, just based on previous experience to determine which class the current instance belongs to.
Posterior probability: A conditional probability that is calculated after knowing some knowledge of the current instance to be judged.
Conditional independence hypothesis: When multiple features are used to determine the category of an instance, it is assumed that this feature is independent.
Joint probability: Represents the probability that two events occur in conjunction with each other.
4, Support vector machine
Support Vector Machine (SVM) is a two classification model. is a generic term for a set of methods. Read some information, but there is no depth to deduce. Here is my two-point impression: One is the support vector machine in the classification of the pursuit of standards, and the second is when the given data "non-divided" what to do.
There are many black dots and white spots in the two-dimensional space, assuming that the dots can be separated by black and white. Support Vector machines not only find a line to separate it, but also find a line that is the most spaced. As for how to calculate the maximum interval? Straight point of view, there are two points, this line is a two-point connection of the perpendicular bisector. As for the number of dimensions higher, the most spaced lines correspond to the super-plane.
What if the data are "not divided"? The data is mapped into higher dimensional data spaces for separation. The line that corresponds to the largest interval becomes the super plane.
5, Maximum entropy model
This section introduces the following two parts in turn: Entropy, maximum entropy principle.
Entropy is the unit of measurement of information. Given a symbol, how do you measure the amount of information it contains? Shannon first mentioned the concept of information entropy in the field of communication, which is mainly defined by its calculation formula.
Maximum entropy principle, which is a criterion of probability model learning. The maximum entropy principle holds that if there are many models that satisfy the current conditions, then the model with the largest entropy is the optimal one. Intuitively, in satisfying known knowledge conditions, do not know the condition knowledge without making any speculation, and other probability treatment.
6, decision tree model
Decision tree is a basic classification method and regression method, and its learning usually consists of three steps: Feature selection, decision tree generation, decision tree pruning. In this section, we introduce two parts: the basic concept of decision tree, the ID3 algorithm.
A decision tree model is a tree structure that describes the classification of instances. It consists of a node and a forward edge. There are two types of nodes: internal nodes and Edge nodes. The Internal node table is a feature or attribute, and the leaf node represents a classification. The process of classifying with decision tree: from the root node shape, such as the example of a certain characteristic test, according to the test results, the instance is assigned to one of its sub-nodes; The sub-node corresponds to a feature value, and then recursively tests and assigns the instance until the leaf node is reached to complete the classification.
As can be seen from the decision tree Description, the more closely the characteristics of the root node, the greater the sensitivity of the classification , while the root node has the greatest degree of distinction. The ID3 algorithm is a recursive selection of the most differentiated features to establish the node and eventually form a decision tree, the role of the ID5 algorithm is the same, but both of the characteristics of the distinction between the standard is not the same.
Impressions
After trying to learn Chinese word segmentation, labeling prediction and text categorization, I feel that I know some of the above models, and want to simply describe these models with words. But when you open a blog and start writing, it feels too hard! There is quite a fraction of the probability that, apart from the formula, there is no way to explain it. There are some models that only get the idea, and then use some out-of-the-box toolkits to achieve their effects, up to the tune of parameters. But when it comes to explaining things, I find nothing to say!
Originally intended to use an afternoon to write a simple, but the model listed, do not know how to do, habitual copy, stickers. But found that there is no end of the thing, affixed to the picture! And each model is a lot and better than the blog summary, and oneself in one in a model to write good is good. So later on to their own positioning, for a model, I can according to their own understanding, remember something to see, and then say, on this basis as far as possible to speak clearly. The last is to learn not to go home, continue to work!
Learning materials:
" Statistical learning method "Li Hang  
"Collective intelligence programming." Programming Collective inteligence "
Shes Yan's blog. K-means algorithm and KNN (k nearest neighbor) algorithm
Understanding the three-layer realm of SVM-a popular introduction to support vector machines
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Statistical algorithm learning carding (i.)