Day1 machine Learning (machines learning, ML) basics

Source: Internet
Author: User

I. Introduction TO MACHINE learning

    • Defined

  The machine learning definition given by Tom Mitchell: For a class of task T and performance Metric p, if the computer program is self-perfecting with experience E in the performance of P on T, then it is said that this computer program learns from experience E.

The machine learning definition given by Baidu Encyclopedia: Machine Learning is a multidisciplinary interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithmic complexity theory and many other disciplines. Specialized in computer simulation or realization of human learning behavior, in order to acquire new knowledge or skills, reorganize the existing knowledge structure to continuously improve their performance.

    • Classification

Supervised learning (supervised learning): Datasets are labeled, that is, the given sample is the answer, most of the models we have learned belong to this category, which includes K-nearest neighbor algorithm, decision tree, Naive Bayes, logistic regression, support vector machine, etc.

Unsupervised learning (unsupervised learning): Contrary to supervised learning, the data set is completely untagged, the main basis is that similar samples in the data space of the general distance is similar, so that the distance can be calculated by the sample classification, including clustering, EM algorithm, etc.

Semi-supervised learning (semi-supervised learning): Semi-supervised learning in general, the problem is that the amount of data is super large, but there is little or no label data, it is very expensive to get the tag, while the training is partly labeled and partly not;

Intensive learning: The way to motivate learning, through the incentive function to make the model constantly adapt to the situation;

    • Related concepts

Training set (training Set/data)/Training sample (training examples): Used for training, which is the data set that produces the model or algorithm;

Test Set (testing set/data)/test sample (testing examples): A data set used specifically to test a model or algorithm that has been studied well;

Eigenvector (features/feature vector): A set of attributes, usually represented by a vector, attached to an instance;

tag: The tag of the instance category;

Positive Example (positive example);

Counter Example (negative example);
    • Deep Learning (Deepin learning)

 It is a new field based on machine learning, which is derived from the neural network algorithm which is inspired by human brain structure and the increase of the model structure depth, and is accompanied by a series of new algorithms resulting from the improvement of big data and computing ability. Deep learning, as an extension of machine learning, is used in the fields of image processing and computer vision, natural language processing and speech recognition.

    • Machine learning Steps

  first split the data into training and test sets, and then Train the algorithm with the feature vectors of the training set and the training set, and then Use the learning algorithm to evaluate the algorithm in the test set (which may be designed to adjust parameters (parameter tuning), with a validation set (validation set).

Second, model evaluation and selection

    • Error-related concepts, over-due fits

  Error Rate: The number of samples that are classified incorrectly is usually the proportion of the total number of samples.

Accuracy (accuracy): 1-Error rate.

Error: The difference between the actual predicted output of the learner and the true output of the sample;

Training error (training error): Also known as empirical error (empirical error), the error of the learner in the training set;

Generalization error (generalization error): The error of the learner on the new sample;

Overfitting (overfitting): When the learner learns the training sample "too good", it is likely that some of the characteristics of the training sample itself are considered to be the general nature of all potential samples, which can lead to a decrease in generalization performance. This phenomenon is called overfitting. The ability to learn is too powerful to learn the less common traits contained in training samples, which are unavoidable and can only be mitigated or reduced.

Under-Fitting (underfitting): Refers to the learner's general nature of training samples has not been well learned. The less-fitting comparison is easy to customer service, such as increasing the number of training wheels in the extended branch and neural network learning in decision Tree learning.

    • Evaluation method

Suppose there is a data set d= {(x1, y1), (x2, y2), ..., (xm, ym)}, proper handling of D, The training set S and the test set T are generated from. Here are a few common approaches.

Retention method (holdout):

it divides the dataset D directly into two mutually exclusive collections, one as the training set S and the other as the test set T.    After the model is trained on S, the test error is evaluated using T as an estimate of the generalization error. The Division of Training set and test set should keep the consistency of data distribution as far as possible.

Cross-validation method (validation):

  It first divides the dataset D into a K-sized mutually exclusive subset, each subset D I maintain consistency of data distribution wherever possible. Then, each time a subset of K-1 as a training set, the remaining subset as a test set, so that the K set of training sets and test sets, so that the K training and testing, the final return is the mean value of the K test results. K typically takes a value of 10, called 10 percent cross-validation.

Self-help Method (bootstrapping):

For a given dataset containing M samples, d, for which a sample of M-Times has been put back, the new dataset is D1. About 36.8% of the sample in the initial dataset D does not appear in D1, and D1 can be used as a training set, d-d1 as a test set.

    • Performance metrics

The most common performance metric for regression tasks is the mean square error (mean squared error);

In the classification problem, the performance measures used include (confusion matrix) accuracy, recall rate, F1 measure.

  

 

  

Day1 machine Learning (machines learning, ML) basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.