[Deep-learning-with-python] Machine learning basics

Last Update:2018-07-18 Source: Internet

Author: User

Tags keras

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Machine learning Types
Machine Learning Model Evaluation steps
Deep Learning data Preparation
Feature Engineering
Over fitting
General process for solving machine learning problems

Machine Learning Four Branches

The second classification, multi-classification and regression problems are supervised learning-the goal is to learn the relationship between training input and corresponding tags.
Supervised learning is just the tip of the iceberg of machine learning. Machine learning is divided into 4 categories: supervised learning, unsupervised learning, semi-supervised learning and intensive learning.

Supervised learning

The most common type of machine learning---learning the mapping between input data and corresponding tags. Almost all deep learning applications today are supervised learning types, such as OCR recognition, speech recognition, image classification, and machine translation.
Although supervised learning consists mainly of classification and regression, it also includes some other variants:

Generates a sequence---a given picture, generating a generalized caption. Sequence generation can be regarded as a series of classification problems;
The syntax tree predicts---to generate a corresponding syntactic tree for the sentence;
Object detection---Given picture, in the picture to circle the object's rectangular box;
Wait a minute.
Non-supervised learning
An interesting data conversion for data visualization, data compression, data noise reduction, or data correlation analysis without corresponding tags. Data reduction and clustering are typical unsupervised learning.

Semi-supervised learning

A special kind of supervised learning. Semi-supervised learning is supervised learning without manual labeling. But the learning process still has tags (or supervised learning), but the tags are generated from the input data by heuristic algorithms.
For example, self-coding is a common semi-supervised learning.

Intensive Learning

In intensive learning, agents (proxies) receive information about the environment and then choose actions that maximize reward. For example, the neural network looks at the screen of the network game and then takes the corresponding action-maximizing the score.
Today, intensive learning is still in the research stage, and no epoch-making application has emerged.

Model evaluation

The main goal of machine learning is to improve the generalization ability of the model---how it behaves on new data, and overfitting is a common problem in machine learning. We have only access to the data at hand, so we must measure the generalization capability of the model with appropriate evaluation methods.

Training sets, validation sets, and test sets

Evaluating a model typically divides the data into training sets, validation sets, and test sets. Train on the training set, validate the model on the validation set, and once the model can be applied, the final test is done on the test machine.
Why not just use two datasets: test sets and training sets? Because the model development process usually needs to adjust the parameters-such as the number of network layers and the number of neurons per layer and so on. The feedback signal is adjusted using the evaluation results on the validation set as parameters. The tuning process for hyper-parameters is also a kind of learning-finding the right parameter settings in a hyper-parametric space. Based on the Apple results training model on the validation set, the model is eventually fitted to the validation set-although the model is not trained on the validation set.
The cause of this phenomenon is information leakage information leaks. The information on the validation set is leaked to the model every time that you fine-tune the model's hyper-parameters based on the validation set performance. If a single parameter is adjusted only once, only a small amount of information is disclosed to the model, and the result is reliable; if repeated repeatedly--doing an experiment, validating the validation set, and then modifying the model; The information that leaks into the model increases gradually.
Finally, the trained model behaves very well on the validation set, as this is the result of our optimization. But what we ultimately care about is how the model behaves on the new data, so we need to evaluate the model---Test set using a completely different, unseen set of data. Therefore, the model is not exposed to the test set until the end.
Dividing the data into training sets, validation sets, and test sets may seem straightforward, but there are other ways of dealing with the amount of data in the hour-leaving, K-fold cross-validation, and disrupting iterative K-fold validation.

Simple set-aside method hold-out

Set aside a subset of the dataset as a test set. Training is performed on the remainder of the data and evaluated on the test set. To avoid information leakage, model parameters cannot be modified according to the performance of the test set.

num_validation_samples = 10000np.random.shuffle(data)validation_data = data[:num_validation_samples]data = data[num_validation_samples:]training_data = data[:]model = get_model()model.train(training_data)validation_score = model.evaluate(validation_data)# 在验证集上进行验证根据验证结果修改参数设置，之后反复训练、评估、调整参数。一旦参数最后确定，使用除测试集以外的全部数据进行模型训练。model = get_model()model.train(np.concatenate([training_data,validation_data]))test_score = model.evaluate(test_data)

This is the simplest method of evaluation, but one drawback: if the data set is too small, then the sample size of the test set and the validation set will be small and the data is not representative enough. This is easy to discern: if you divide the data, each time you disrupt the results of the division, the results of the final evaluation are very different. K-fold verification and iterative K-fold verification can handle this problem.

K-Fold verification

Divide the data into k parts with the same amount of data per share. Each training on K-1 data is verified on the first data. Finally, the average value of K verification results is used as the final evaluation result.

Disrupt iterative K-fold verification

This approach can be used when the amount of data is small but you want the model to be evaluated as accurately as possible. The Kaggle game is very useful .
K-fold verification is used multiple times, but data is disrupted when the data is divided into K-parts. Finally, the average of the results of the K-fold verification is run multiple times. The equivalent of training P X K models (p is the number of iterations of K-fold verification)----expensive.

Key points

When choosing an evaluation method, you need to be aware of:

Data representation----training set and test set are representative of data. For example: handwritten numeral classification, if the first 80% is a training set, the last 20% is a test set, but the data set sorted by category, resulting in the training set of 0~7, test set of 8~9, resulting in poor model training results, but this error is very common.
The Time dimension----if the model training is based on past data to predict the future (for example, weather forecasts), you cannot randomly disrupt the data.
Data duplication----If part of the data in the dataset is duplicated, disrupt the data partitioning training set and test set derivative training sets and the test set have partial data duplication. Finally, the model is evaluated on some test sets---the result is high. Ensure that there is no overlap between the validation set and the test set.

Data preprocessing, feature engineering and feature learning

In addition to model evaluation, in the model development process, a problem before model training must be considered---how to process data and tags before sending data and tags to model training? Many data processing methods and characteristics of the project are domain-related, different areas of the use of technology to distinguish.

Data preprocessing

Data preprocessing is designed to better match the raw data to the input format requirements of the network. Includes vectorization, normalization, handling of missing values, and feature extraction.

To quantify

the input and label of the neural network must be the tensor of the floating-point type (or the tensor of the integer type). No matter what data is processed---sound, picture, text, must be converted into tensor form----This step is called Data vectorization.

Normalization

Overall, if the data sent to the neural network is too large, the learning effect of the model is not ideal (for example, the value of the data is larger than the initial value of the model weight factor). To make model training easier, data should meet:

The range of values is small: usually should be between 0~1;
Data homogeneous: The range of values for all features is the same.
Handling Missing values
Overall, the missing value is populated with 0 (assuming that 0 is not meaningful) and is feasible for neural networks. The model then automatically learns that 0 represents the missing value, and then ignores 0.
Note If the training data for the model has no missing values, and the test set has missing values, the model cannot learn to ignore 0 values. In this case, you should manually generate a training sample with missing values: Copy the training set multiple times, discarding features that might be missing from some datasets.

Feature Engineering

Feature Engineering: The use of prior knowledge of data and machine learning algorithms ensures that the data is changed before the data is sent to the model, making the final algorithm work better. The algorithm can not be expected to learn from any data. Data should be expressed in a way that helps model learning.
Prior to deep learning, special engineering is very important because previous machine learning algorithms do not have enough hypothetical space to learn useful features on their own. The representation of the data is critical to the success of the algorithm.
Fortunately, deep learning is not as demanding as feature engineering because neural networks can automatically extract useful feature information from data. But feature engineering is still useful for deep learning:

A good characteristic means that the problem can be handled more quickly and efficiently.
A good feature means a smaller amount of data is needed, and special engineering is especially important when the amount of data is too small.

Over fitting and under fitting

Over-fitting phenomena occur in every machine learning problem. Learning to avoid overfitting is essential for mastering machine learning.
The main problem in machine learning is the harmony between model optimization and generalization ability. Optimizing the---Adjustment model makes the performance of the training data as good as possible, and the generalization---the training model behaves as well as possible on the new data. We can adjust the model to get good model optimization, but we can't control the generalization ability of the model-only adjust the model according to the model's performance on the training data.
After the training begins, the optimization ability and generalization ability are interrelated: the loss value decreases on the training data, and the loss on the test set decreases correspondingly; the model under-fitting---model has not learned all the knowledge of data. However, when the generalization ability stops increasing after multiple loops, the indicator on the validation set begins to fall: The model has been fitted----at this point, the model begins to learn the characteristics of personalization in the data, but when the new data arrives it is misleading.
In order to avoid the irrelevant and misleading features of the model learning training data, the best way is to prepare more training data. The more training data, the more generalization ability. When data collection is difficult, adjust the quality of the data and add restrictions to the data. If the model can "remember" all the data, the optimization process forces attention to more important data characteristics.
The process of avoiding overfitting is called regularization regularization. Some common regularization methods are described below.

Simplified model

The simplest way to avoid model overfitting is to simplify the model and reduce the size of the model: the number of parameters that can be learned in the model (number of networks and number of network neurons per layer). in deep learning, the number of learning parameters in a network model can be seen as the capacity of the model. Therefore, more parametric models mean more memory capacity, so it is easy to learn a mapping relationship that is similar to a dictionary type---training samples to corresponding tags. Deep learning models tend to adapt to training data, but the real challenge is the generalization capability of the model, not the adaptability.
On the other hand, if the network model is limited to memory capacity, the learning of mapping relationships becomes difficult, so in order to minimize the loss function, the model will learn the compression representation of the data. We should use a multi-parameter and not less-than-fit network model. The tradeoff between too much capacity and too little capacity.
Unfortunately, there is no effective rule or method to determine the size of the model parameters. You must constantly try to find the optimal parameter size on the validation set. a general approach to determining the size of a model: start with a relatively simple model, gradually increase or decrease the number of neurons or the number of network layers until the loss value on the validation set is not reduced.

Regularization of weights

By adding constraints on complex models, forcing the weight coefficients to take small values helps mitigate overfitting---weight regularization: add cost functions about weights to the loss function of the model. Like what:

L1 regularization: Adding the L1 norm of weight coefficient;
L2 regularization: Add the L2 norm of the weight factor.

In Keras, weight regularization is performed by declaring parameters to the network layer to pass the weighted regularization instance. For example, add L2 regularization:

from keras import regularizersmodel = models.Sequential()model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),activation=‘relu‘, input_shape=(10000,)))model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),activation=‘relu‘))model.add(layers.Dense(1, activation=‘sigmoid‘))

Dropout layer

Dropout is the most effective and common regularization method in neural networks. Dropout is applied to the network layer and randomly "throws away" a portion of the network layer's output characteristics (set to 0) in the training process. Dropout rate refers to the ratio of features set to 0, usually between 0.2~0.5. During the test, no neurons were "thrown away"; On the contrary, the output value of the network layer needs to be scaled with the dropout rate equal number, and the balance is more active than the training process. (Another way of balancing is the same amplification of the output during the training process as the dropout rate, and no transformations are made during the test.) )

Keras, there are dropout network layer---to the output of the previous layer to do dropout.

model.add(layers.Dropout(0.25))

Practical application:

model = models.Sequential()model.add(layers.Dense(16, activation=‘relu‘, input_shape=(10000,)))model.add(layers.Dropout(0.5))model.add(layers.Dense(16, activation=‘relu‘))model.add(layers.Dropout(0.5))model.add(layers.Dense(1, activation=‘sigmoid‘))

Machine learning general process problem definition and data acquisition

Problem definition:

What is the input data? What is the predicted result? Predictions can only be made if there is training data.
What is the problem? Two categories? Multi-classification? Scalar regression? Vector regression and so on.
After answering the above questions, understand the assumptions made at the current stage:
Data labels can be predicted with input data;
The current amount of data is sufficient to learn the correspondence between the input data and the label.
Machine learning can only learn the characteristic information of training data, and can only discern the data that has been seen. Using models trained on data collected in the past to predict the results of future data means that future performance is assumed to be similar to the past, with exceptions.
Select evaluation Metrics
accuracy, precision and recall and so on. The evaluated indicators also contribute to the selection of the loss function.
Decision Evaluation Method
Set-up method, K-fold cross-validation, iterative K-fold verification.
The data prepares the output processing, and the processing is sent to the model for learning.
The data is expressed in tensor form ;
The value range of tensor values is as small as possible, for example, between [0,1] or [ -1,1];
If the range of values of different characteristics is different, normalization should be done.
Feature engineering, especially for small datasets.
Model training-better than benchmark model baseline model: random guess. For example, Mnist is 0.1.
Before you build your first model, you need to be clear:
The last layer activates the function: restricts the output of the network model; Sigmoid,relu, etc.;
Loss function: matches the problem type; the regression problem MSE, two classification binary_crossentropy
Optimized configuration: What optimization method is used? What is the learning rate? Rmsprop is valid for most problems, as the default method.
The most common choices are:

Training an over-fitting model

To define the complexity of the training model, first develop an overfitting model:

Add a network layer;
Number of neurons added per layer of network;
Epochs become larger.
Monitor the change of loss values on the training set and the validation set and the accuracy of the needle set in the training set. A fitting occurs when the performance begins to decay on the validation set.
The next stage of the model adjustment.
Model regularization and hyper-parameter fine-tuning this stage takes the longest time: repeated attempts, training, validation sets on evaluation, modification of models, retraining, etc. until the model meets the requirements.
Add dropout;
Different architectures: increase and decrease the network layer;
L1, L2 regularization;
Modify different hyper-parameters;
Iterative in different feature engineering: adding new features, reducing features, and so on.

each time the model is adjusted using the performance on the validation set, the information for the validation set is leaked to the model. It is harmless to repeat several times, but too many repetitions will eventually result in the model being over-fitted on the validation set and the evaluation result untrustworthy.

Once the best model parameters, configuration, and finally all the data on the non-test set training, and finally on the test set test evaluation.

[Deep-learning-with-python] Machine learning basics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More