"Reading notes-data mining concepts and Technologies" Category: Basic concepts

Source: Internet
Author: User
Tags id3

Two stages of data classification: The learning phase (constructing the classification model) and the classification stage (using the model to predict the class label of the given data) and the classification phase (using the model to predict the class designator for the given data).

Decision Tree Induction

Constructs a tree, from the root to the leaf node path, the leaf node is stored in the tuple's prediction class.

The construction of Decision tree Classifier does not require any domain knowledge and parameter setting, so it is suitable for probing knowledge discovery. Decision trees can handle high-dimensional data.

When splitting a node, there are three main categories of criteria for choosing what metrics to choose:

1, ID3: information gain;

2, C4.5: information gain rate;

3, Gini coefficient;

How to prevent overfitting?

Tree pruning; cutting off the most unreliable branches;

Method: a) first pruning; b) after pruning;

Although the pruning trees are generally more compact than the trees that are not pruned, they can still be large and complex. Decision trees can be plagued by duplication and replication, making them difficult to explain. such as age<60, followed by age<45, there was a repetition. Replication is a tree in which there are duplicate subtrees. These conditions affect the accuracy and the interpretation of decision trees.

Two solutions: a) multivariate partitioning (based on the division of combinatorial attributes); b) use different forms of knowledge representation (such as rules) rather than decision trees.

If-then rule-Constructs a rule-based classifier

Scalability and decision tree induction:

Problem: Existing decision tree algorithms, such as Id3,c4.5,cart, are designed for relatively small datasets, restricting training tuples. When these algorithms need to be used in the mining of super-large real-world databases , the effectiveness becomes a matter of concern.

One way is: sampling, or putting the training data in memory;

Another method is: Rainforest algorithm: avc-set, describes the node's training tuple.

Self-service optimistic algorithm boat, the number of scans less than rainforest less, another advantage is that it can be incrementally updated.

For decision trees, add interactive methods, perceptual-based classification (perception-based CLASSIFICATION,PBC), interactive methods based on multidimensional visualization techniques that allow users to add context to the data when building decision trees.

Bayesian Classification method

Based on Bayesian theorem: All of the learned probability theory know this theorem, distinguish the prior probability and the posteriori probability.

You can refer to a previous article using Bayesian to do text categorization: http://www.cnblogs.com/XBWer/p/3840736.html

When multiple attribute values are encountered, each attribute must satisfy a mutually independent condition.

When a probability of 0 is encountered, Laplace smoothing can be used, that is, add 1 (also mentioned in the Stanford Machine learning public Class).

Rule-based classification

Use the If-then rule: if: the rule before or the premise; then: the conclusion of the rule.

If the data for a tuple matches all of the rule X, then we say that Rule X is satisfied and overrides the tuple.

How to evaluate a rule: coverage and accuracy ; coverage: number of tuples covered/total tuple; accuracy: the number of tuples correctly categorized/the number of tuples covered.

What if a tuple can trigger multiple rules?

Resolves a conflicting policy to decide which rule to activate and to assign its class predictions to tuple x.

How do I create a rule-based classifier?

How to extract If-then rules by decision tree and establish a rule-based classifier. If-then rules may be easier to understand than decision trees, which can sometimes be large.

A leaf node corresponding to a rule is obviously unscientific because the number is too large and difficult to explain.

So, how to trim the rule set is a problem

For a given rule, any condition that does not increase the estimated accuracy of the rule can be cut off (deleted), thus generalizing the rule.

Any rules that do not contribute to the overall accuracy of the entire ruleset will also be clipped.

Of course, there are new problems in the pruning of rules, because these rules are no longer mutually exclusive and exhaustive.

Rule generalization using sequential overlay algorithms : You can extract if-then rules directly from the training data (that is, you do not have to produce a decision tree).

Common algorithms: Aq,cn2,ripper

General strategy:

As you can see, Learn_one_rule's role is to find the "best" rule for the current exhaustion.

So, how are rules generated?

from general to special. Adopt a greedy depth-first strategy. whenever you are faced with adding a new property test to the current rule, it chooses the test that best improves the rule quality properties based on the training sample.

How do you measure the quality of a rule?

1. Entropy (for conditions that cover a large number of tuples in a single class and a small number of other class tuples)

2. Information gain (suitable for rules with high accuracy and covering many positive tuples)

3. Coverage

Rule pruning: Preventing overfitting

Model evaluation and selection evaluation classifier performance measurement: accuracy rate (recognition rate), sensitivity (recall rate), special effects, accuracy, f1,fβ of various types of tuples roughly evenly distributed:

P is a positive tuple, and n is a negative tuple. eg, positive tuple: buy_computer=yes, negative tuple: Buy_computer=no.

Confusion Matrix: A useful tool for classification classifiers to identify different types of tuples.

It is easy to see the accuracy rate (recognition rate) from the matrix.

Class imbalance (the main class of interest is sparse, and the distribution of datasets reflects a significant majority of negative classes, while positive classes account for a few, such as "non-fraud >> fraud"):

Agility and special effects

accuracy and Recall rate

Just now, these metrics are assumed to be unique and can be categorized by all objects. Therefore, when it is assumed that a tuple can belong to more than one class, it does not require the return of the class designator (that is, which one is specific), but rather the probability of returning the class distribution. At this point, the accuracy metric can be tested with two guesses: A class prediction is judged to be correct if it is consistent with the most probable or secondary class. Although this does take into account the tuple's non-unique classification to some extent, it is not a complete solution.

Retention methods and random secondary sampling

: What we usually do, part of it as a training set, part of the test set.

Cross-validation

Here you can refer to lesson 11th of the Stanford Machine Learning course.

K-fold cross-validation

Layered cross-validation

General Take k=10

Self-help method (suitable for small samples)

The most common is the 632 self-help method.

Selecting models using statistical significance tests

First, understand what is the significance of the test :

The significance test (significance test) is the prior to the overall ( random variable ) parameter or the overall distribution form to make a hypothesis , and then use the sample information to determine whether this hypothesis (standby hypothesis) is reasonable, that is, to determine the overall real situation and the original hypothesis whether there is a significant difference. Or, the difference between a sample and the assumptions we make about the whole is purely a chance variant, or it is caused by inconsistencies between what we do and the overall reality. The significance test is to test the assumptions we make about the whole, and the principle is to accept or negate the hypothesis by "the principle of small probability event's actual impossibility ".

Sampling experiment will produce sampling error, the experimental data can not be compared with two results (average or rate) to make a conclusion, but to carry out statistical analysis, to identify the difference between the two is caused by sampling error, or by a specific experimental treatment.

Cost-effective and ROC curve comparison classifier

eg

Introduction to the Technology combination classification method to improve the accuracy of classification (general)

In short, a number of different classifiers are established based on multiple classification methods, and the final classification results are obtained based on their prediction results.

bagging, lifting and random forests are examples of combinatorial classification methods

Bag packing

Very simple, and the self-help method very much like, and put back the sample, to create a K model, using a combination classifier, each classifier has the same weight , the new tuple is categorized, return the majority of voting results.

Ascension and AdaBoost

Elevation: Each classifier does not have the same weight

AdaBoost:

Random Forest

Transferred from: http://www.cnblogs.com/tornadomeet/archive/2012/11/06/2756361.html

Each classifier is a decision tree, then the combinatorial classifier is a "forest".

In machine learning, random forests are made up of a number of decision trees, because these decision trees are formed using random methods and are therefore called random decision trees. There is no association between the trees in the random forest. When the test data enters the random forest, it is actually to let each decision tree classification, and finally take all decision trees in the category of the most results of the class as the final result. So a random forest is a classifier that contains multiple decision trees, and its output category is determined by the number of categories that the individual tree outputs. Random forests can handle both the amount of discrete values, such as the ID3 algorithm, or the amount of continuous values, such as the C4.5 algorithm. In addition, random forests can also be used for unsupervised learning clustering and anomaly detection.

Random forest consists of decision tree, the decision tree is actually the space with the super-plane division of a method, each time the division, the current space is divided into two, for example, the following decision tree (the value of its properties are continuous real numbers):

Divide the space into a form like this:

The advantages of random deep Forest: it is more suitable for long classification problem, faster training and prediction, and the fault-tolerant ability of training data is an effective method to estimate missing data, which can still maintain accuracy when there is a large proportion of data missing in the dataset. Ability to handle large datasets efficiently Can deal with thousands of variables without deletion, can generate an internal unbiased estimate of generalization error in the process of classification, can detect the interaction between features and the degree of importance, but over-fitting, the implementation of simple and easy to parallelize.

Improving the classification accuracy of class imbalance data

The traditional classification algorithm is designed to minimize the error of classification. They assume that the cost of false positive and pseudo negative cases is equal. Therefore, the traditional classification algorithm is not suitable for class imbalance data.

Other methods:

1, oversampling (does not involve changes in the structure of the classification model, change the distribution of the tuple in the training set, so that the rare class can be well represented, the positive tuple repeated sampling, so that the result training set contains the same number of positive and negative tuples)

2, Under-sampling (not involved in the classification model structure changes, change the distribution of the tuple in the training set, reduce the number of negative tuples)

3, Threshold movement (does not involve the change of the classification model structure, affect the new data classification when the model decision)

4. Combination Technology

Summary:

"Reading notes-data mining concepts and Technologies" Category: Basic concepts

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.