[Language Processing and Python] 6.4 decision tree/6.5 Naive Bayes classifier/6.6 Maximum Entropy Classifier

Source: Internet
Author: User

6.4 Decision Tree

A decision tree is a simple flowchart for selecting tags for input values. This flowchart consists of the Decision nodes that check the feature values.
And the leaf node that assigns the label. Select a label for the input value. We start with the initial decision node (called its root node) of the flowchart.

 

Application of entropy and information gain in decision tree determination (you can find relevant materials by yourself)

See http://blog.csdn.net/athenaer/article/details/8425479

Disadvantages of decision tree:

1. overfitting may occur.

Since each branch of the decision tree divides the training data
For low nodes in the training tree, the available training data volume may become very small. Therefore, these lower decision nodes may pass through
Fit the training set. The learning model reflects the characteristics of the training set, rather than the obvious linguistic model at the bottom of the problem. For this question
One solution is to stop splitting nodes when the training data volume becomes too small. Another solution is to make a complete decision.
Tree, but then the decision nodes that cannot improve performance in the Development test set are cut.

2. Force the check in a specific order.

They force features to be checked in a specific order, even if the feature may be
Relatively independent. For example, when a topic-based document (such as a sports, car, or murder mystery), features such as hasword (foot
Ball), which is very likely to represent a specific tag, regardless of the other feature values. It is determined that the space near the top of the tree is
Limitation, most of these features will need to be repeated in many different branches in the tree. Because the number of branches goes down the tree
The number of duplicates may increase exponentially.

The naive Bayes classification method discussed below overcomes this restriction and allows all features to be "Parallel.

6.5 Naive Bayes Classifier

In Naive Bayes classification, each feature has a say to determine which tag should be assigned to a given input value. Select a tag for an input value. The Naive Bayes classifier starts from calculating the prior probability of each tag. It is determined by checking the frequency of each tag on the training set. Then, the contribution of each feature is combined with its prior probability to obtain the likelihood estimation of each tag. Tags with the highest likelihood estimation are allocated to the input values.

Potential Probability Model

Another way to understand Naive Bayes classifier is to select the most likely tag for input.

We can calculate the expression P (label | features). Given a special feature set, an input has a probability of a specific tag.

P (label | features) = P (features, label)/P (features), where P (features, label) is the likelihood of the label.

P (features, label) = P (label) × P (features | label) ===p (features, label) = P (label) * effecf ε featuresP (f | label)

P (label) is the prior probability of a given label, and each P (f | label) is a separate feature contribution to the possibility of tags.

Zero count and smooth

When building a naive Bayes model, we usually use more complex technologies. To prevent the likelihood of a label being given from being 0, this technology is called smoothing technology.

Non-binary features

Simplicity of independence

Why is it simple? It is impractical to assume that all features are independent of each other.

Causes of double count

P (features, label) = w [label] × ∏ f ε features w [f, label] (considering the possible interaction between contribution of features in training)

Here, w [label] is the "Initial score" of a given tag. w [f, label] is the contribution of a given feature to the possibility of a tag. The values w [label] AND w [f, label] are the parameters or weights of the model. Using the naive Bayes algorithm, we set these parameters separately:
W [label] = P (label)
W [f, label] = P (f | label)

In the classifier in the next section, when selecting the values of these parameters, it will consider the possible interaction between them.

6.6 Maximum Entropy Classifier

P (features) = Σ x ε corpus P (label (x) | features (x ))

P (label | features) indicates the input probability of a class label in features. It is defined as: P (label | features) = P (label, features) /Σ label P (label, features)

Maximum Entropy Model

The maximum entropy classifier model is a generalization of the naive Bayes classifier model.

The following content is taken from: http://wiki.52nlp.cn/%E6%9C%80%E5%A4%A7%E7%86%B5%E6%A8%A1%E5%9E%8B%E4%B8%8E%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86

Maximum Entropy Model and Natural Language Processing

In daily life, the occurrence of many things shows a certain degree of randomness. The test results are often uncertain and do not know the probability distribution that the random phenomenon is subject, there are only some test samples or sample features, and statistics are often concerned about a problem. In this case, how can we make a reasonable inference on the distribution? The maximum entropy method is used to deduce an unknown distribution based on the sample information.

The principle of maximum entropy was calculated by E. t. according to Jaynes, the main idea is to select the probability distribution that conforms to the knowledge but has the highest entropy value when only learning about the unknown distribution. In this case, there may be more than one probability distribution that fits the known knowledge. We know that entropy is actually the uncertainty of a random variable. When entropy is the largest, it means that the random variable is the most uncertain. In other words, that is, the random variable is the most random, it is the most difficult to accurately predict its behavior. In this sense, the essence of the maximum entropy principle is that, with some knowledge known, the most reasonable inference about unknown distribution is that it complies with the most uncertain or random inference of known knowledge, this is the only unbiased choice we can make. Any other choice means that we have added other constraints and assumptions that cannot be made based on the information we have.

Many problems in natural language processing can be attributed to statistical classification. Many machine learning methods can be used here. in natural language processing, statistical classification is manifested in the estimation of the probability P (a, B) of co-occurrence of Class a and a context B. The content and meaning of Class a and context B are also different. In part-of-speech tagging, the meaning of a class refers to the word class tag in the part-of-speech tagging set, while the context refers to the word before the word to be processed, the next word and word class or several words and word class before and after. Generally, context is sometimes a word, sometimes a word class mark, or sometimes a historical decision. A large-scale corpus usually contains the co-occurrence information of a and B, but the appearance of B in the corpus is often sparse, and all possible (a, B) to calculate a reliable P (a, B), the corpus size is always insufficient. The problem is to find a method that can be used to reliably estimate P (a, B) under sparse data conditions ). Different methods may adopt different estimation methods.

P * = argmaxH (p) p ε P

P (p | p is the probability distribution that meets the conditions on X)

Features: (x, y)

Y: Information to be determined in this feature

X: context information in this feature

Sample of a feature (x, y)-Distribution of syntactic phenomena described by the feature in the standard set:

(Xi, yi) pairs

Yi is an instance of y.

Xi is the context of yi

Feature function: for a feature (x0, y0), define the feature function:

F (x, y) = 1 if y = y0 and x = x0

0 other cases

Expected Value of the feature function:

For a feature (x0, y0), the expectation in the sample is:

P' (f) = Σ P' (x, y) f (x, y)

P' (x, y) is the probability that (x, y) appears in the sample.

Condition:

For each feature (x, y), the conditional probability distribution created by the model must be the same as the distribution shown by the training sample.

Epfj = Σ P' (x) fj (x)

Therefore, the maximum entropy model can be expressed

P * =-argmax Σ p (y | x) P' (x) logp (y | x)

P

P = {p (y | x) | ∀ fi: Σ p (y | x) P' (x) fi (x, y) = Σ P' (x, y) fi (x, y)

(X, y) (x, y)

∀ X: Σ p (y | x) = 1}

Y

That is, the problem of extreme values with restrictions is solved.

 

Generative classifier comparison conditional Classifier

Naive Bayes classifier is an example of a generative classifier. It establishes a model to predict the joint probability of P (input, label) pairs.

Therefore, generative models can be used to answer the following questions:

1. What is the most likely tag for a given input?
2. How likely is a given tag for a given input?
3. What is the most likely input value?
4. How likely is a given input value?
5. How likely is a given input to have a given tag?
6. The most possible tag for an input with one or more values (but we do not know which one is)
What is it?

The maximum entropy classifier is an example of a conditional classifier. Conditional classifier establishes model prediction P (label | input), which is the probability of a tag with a given input value. Therefore, the conditional model can still be used to answer questions 1 and 2.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.