Engineering applicability Analysis of main machine learning algorithms
Some time ago Alphago with the Li Shishi of the war and related deep study of the news brush over and over the circle of friends. However, this matter, but also in the depth of machine learning to further expand, and the breadth of machine learning (also known as engineering practice), there is still no breakthrough theory or practice, the use of the field continues to use, not the field is still not used.
The role of engineering analysis engineering trivia
The mission of machine learning is to transform the computer's powerful computing ability and storage capacity into a deductive ability, which translates into one aspect, and the efficiency of conversion is on the other. Scientific research nature of the Alphago, with nearly unlimited computing resources, do not consider this aspect of the problem, but in our actual engineering applications, always take into account the difficulty of achieving, memory, efficiency, energy consumption, stability and other trivial issues.
The paper lists a series of problems that machine learning will encounter in practical engineering application, and discusses the space-time complexity and parallelism of all kinds of algorithms.
Evaluation criteria
The application of machine learning algorithms is usually taken off-line after the model is trained and put to the online prediction. For server clusters, it is possible to train and predict on the same device. But in more cases, especially when it comes to client development, the model we train is running on a device that has much less computational resources than the training environment.
Offline training does not have a lot of constraints on computing resources, nor does it need to be real-time, but if the amount of data is large and the training time is long, you also want to be able to solve the problem by adding machines (scale out).
Online forecasts have real-time requirements, and if they are mobile devices, there is a need for power consumption.
Offline (Training)
Offline training process, mainly consider the following factors
Algorithm implementation Difficulty
Engineering uses machine learning, although in many cases is a ready-made code base, but the implementation of the algorithm is difficult to affect the availability of the code base of how much and good or bad. And, along with the development of computer hardware and software architecture, it is always necessary to implement the code of the algorithm with The Times (distributed, hardware acceleration, etc.), the difficulty of the algorithm is also decided by its update difficult.
In addition, for "have the ideal, have the pursuit of" the company, the control of a project of the entire process is necessary, really do deep, you must implement the machine learning algorithm, reference article http://www.infoq.com/cn/news/2015/12/ Manual-machine-algorithm-reason. At this point, the algorithm is difficult to achieve is a very realistic problem.
Algorithm use difficulty
The difficulty of using this algorithm also includes the robustness of the algorithm itself, which is the most necessary consideration in engineering. On the one hand, whether the dirty data is sensitive, need not do the complicated data cleaning, on the other hand is how many parameters need to be set, the parameter itself has significant effect on the result of the algorithm.
The algorithm is difficult to use and requires professionals to use, the project's human costs will be significantly improved.
Another consequence of high difficulty is that the results are unpredictable and require repeated attempts, resulting in additional time and labor costs.
Training required Resources and parallelism
For large companies, the machine is cheaper than people, hundreds of of servers, computing resources are not worried, but for small companies, it is likely that people are cheaper than machines, training algorithm time complexity and space complexity is too high, it is not necessarily affordable.
The parallelism of the algorithm determines whether the algorithm can scale up (plus the machine) and, on the other hand, determines whether it can use GPU acceleration (the resources and energy required after using GPU acceleration can be greatly reduced).
Online (predictive) model file size
After supervising the learning algorithm training data, a model file is obtained for subsequent predictions, and the size of the model file determines whether it can be migrated to other devices to run.
The size of the model file is also an indicator of whether the learning algorithm is over-fitted.
Forecast speed
The prediction speed is not only the real-time problem, but also the power consumption problem. The speed is generally due to the small number of computations, power consumption will be low.
On mobile devices, this is a very important factor to consider.
SVM based on algorithm evaluation support vector machine
The time complexity of training is internal product calculation + Kernel function map + gradient descent/smo, kernel function calculation can be neglected relative to inner product calculation, so the training time complexity of linear SVM and nonlinear SVM is similar. However, if we want to consider the parameter tuning of nonlinear SVM, the complexity of it needs to multiply the optimization iterative algebra (GRID-optimized RBF kernel function, which is 10,000 times times the granularity of 100.
The internal product calculation can do data parallelism, in the massive data, two times the solution with SMO or gradient descent, although the SMO algorithm complexity is low, but not easy to implement in parallel, so in the large data volume, or gradient descent. SVM Data parallel principle is not difficult, implementation of SVM itself because of the complexity of the algorithm, or more cumbersome.
There are significant differences between linear and nonlinear SVM when predicting:
After the SVM completes the training, the final weight of the sample is not 0 left, as a support vector.
The non-linear SVM must keep all support vectors, and at the time of prediction let all the support vectors and the target vectors do an inner product calculation, so the support vector number determines the size of the model and the time complexity of the prediction. (The number of support vectors is likely to be massive under massive data).
Linear SVM can merge all support vectors into a hyper-planar equation, which is just a set of parameters for a dimension measurement level.
In addition, support vector machine, because it only supports two classification, must use a set of combinational logic to extend the two classifier to multi-classifier when dealing with multi-classification problem.
Reference:
Http://blog.sina.com.cn/s/blog_5eef0840010147pa.html
This will also reduce the speed of the predictions.
In addition to the well-known machine learning Toolkit, SVM has a dedicated tool LIBSVM to implement, which can be applied to situations where the amount of data is not very large. This toolkit basically has no brain training model, just to integrate code when making predictions.
Logical regression (logistic regression)
Logistic regression is a linear regression which changes the calculation method of loss function. The gradient descent algorithm is used to solve the coefficients. The number of coefficients is the number of features.
Given that the gradient descent algorithm is used to solve the coefficients, there are some computational computations for the calculation of derivative and offset, but the computation is significantly less than SVM. And very easy to expand to a large number of applications.
Logistic regression training results in most cases are under-fitting, in a few cases, the characteristics have been too much, there will be a fitting appear.
Naive Bayesian
Naive Bayesian training process is a statistical data frequency process, very quickly, a series of probability values, in the prediction of all categories of probability to take the highest.
As a simple brute force algorithm, it is widely used in engineering.
Decision Tree c4,5 Decision tree
Decision Tree Training is a continuous search for optimal sub-point of the process, the selection of points, each may choose to go through once, but the continuous division process, the data volume is also declining, the overall training speed is very fast.
Decision trees come out of the model is relatively simple, can also be equipped with parameter control, which makes it very easy to integrate.
Since it makes the selection of features in the course of training, a small number of invalid features have little effect on it. But when the tree is deep, it will fit.
In general, it is faster than logistic regression, and it is more accurate, but under the large data volume, the calculation process of the decision tree needs to traverse all the data, and the efficiency is not higher than the logic regression using the stochastic gradient descent.
Random Forest
Random forest random sampling of data, generation of several decision trees, using
When training, try to sample the data every time, but because you want to generate multiple subtrees, its complexity is generally higher than the decision tree.
When predicting, the complexity of the prediction is significantly higher because the original decision tree becomes more than one (typically dozens of).
Nearest neighbor Algorithm KNN
KNN belongs to lazy learning, the training process does not do any processing, only in the prediction of the query of the nearest n data for judgment.
Because it holds all the training data, its model files are undoubtedly very large, so this algorithm is limited to use on the server.
Summarize it in a single table:
训练用的矩阵输入为 n X m ,n为数据量,m为维度(特征数),一般来说m 远小n。此外记几个估计系数:t:10量级p:100量级k:1000量级q:10000量级f:0~1之间的实数
预测准确率是根据以前做数据的经验估计的,随便看看就好。
| algorithm |
Training Time |
Training Memory Consumption |
model File Size |
Forecast Time |
Algorithm Implementation difficulty |
algorithm Use difficulty |
Forecast Accuracy Rate |
| Nonlinear SVM |
Long O (p*m*n*n) +o (p*k*n*n) |
Multi O (N*n) |
O (M*N^F) |
More slowly |
High |
Medium |
High |
| Linear SVM |
Longer O (m*n*n) +o (k*n*n) |
Same nonlinear SVM |
O (M) |
Fast |
High |
So so |
So so |
| Logistic regression |
General O (k*n*m) |
Less O (m) |
O (M) |
Fast |
Low |
So so |
Low |
| Naive Bayesian |
Less O (n*m) |
Less O (m) |
O (m*t) |
Fast |
Low |
Low |
Low |
| c4,5 Decision Tree |
Less O (t*n*m) |
Less O (m) |
O (M*M) |
Fast |
Low |
So so |
So so |
| Random Forest |
General O (T*T*N*M) |
Less O (m) |
O (m*m*t) |
More quickly |
Low |
Lower |
High |
| BP Neural network |
General O (P*Q*N*M) |
General O (M*P) |
O (M*P) |
Medium |
So so |
Medium |
So so |
| Knn |
No |
No |
O (M*n) |
Slow |
Low |
Low |
So so |
Deep learning
As explained in the previous article, deep learning is a combination of unsupervised and supervised learning algorithms. Therefore, it is not easy to determine the complexity of space-time.
Deep learning model parameters are generally independent of the data magnitude, but on the one hand because of its relatively shallow learning in terms of input is not so select, the dimension of the input vector is large, on the other hand can enjoy the superposition of the model, so its parameter magnitude is very large.
Due to the several basic algorithms of deep Learning (convolutional neural network CNN, Deep Belief network DBN, etc.) are natural data parallel algorithms, hardware acceleration is also relatively easy, and can be rampant on the mass of data on the random gradient descent algorithm SGD, so that it can be in the massive data under the training is not too slow.
The neural network class algorithm has many model parameters, and the memory and time required for prediction are long, so it is difficult to use in a non-server environment.
Writing machine learning from a software engineering perspective engineering analysis of main supervised learning algorithms in 3--