Project applicability analysis of main machine learning algorithms
Some time ago Alphago with the Li Shishi of the war and related deep study of the news brush over and over the circle of friends. Just this thing, but also in the depth of machine learning to further expand, and the breadth of machine learning (also known as Project practice), there is still no breakthrough theory or practice, the use of the field continues to use, not the field is still not used.
The role of project analysis trivia on project
The mission of machine learning is to transform the computer's powerful computing and storage capabilities into deductive capabilities. The ability to transform is one aspect. The efficiency of conversion is still one aspect. The Alphago of scientific research has almost unlimited computational resources. Don't worry too much about the problem. But in our actual project application, we should always take into account the difficulty, memory, efficiency, energy consumption, stability and other trivial issues.
Lists a series of questions that machine learning will encounter in real project applications. In this paper, we only discuss the space-time complexity and parallelism of various algorithms.
Evaluation criteria
The application of machine learning algorithms is usually taken offline after the model is trained. Put it on the line to predict. for server clusters. It is possible that training and prediction occur on the same device. But in many other cases. Especially when doing client development, the model we train is carried out on a device that has much less computational resources than the training environment.
Offline training does not have a lot of constraints on computing resources, and it does not need to be very real-time, but assuming that the amount of data is very large and the training time is very long, you also want to be able to solve this problem by adding machines (scale out).
The online pre-test will have real-time requirements, assuming that the mobile device, as well as the power consumption requirements.
Offline (Training)
Offline training process, mainly consider the following factors
Algorithm implementation Difficulty
Machine learning is used on project, although in many cases a ready-made code base is used, the difficulty of implementing the algorithm affects the number and quality of available code bases. And with the development of computer hardware and software architectures. Always requires the implementation of the algorithm code with The Times (distributed, hardware acceleration, etc.). At this time, the difficulty of the algorithm is also determined to update its difficult.
In addition, for "have the ideal, have the pursuit of" the company, the control of a project all the process is necessary, really do deep, you must implement the machine learning algorithm, reference article http://www.infoq.com/cn/news/2015/12/ Manual-machine-algorithm-reason.
At this point, the algorithm is difficult to achieve is a very real problem.
Algorithm use difficulty
The difficulty of using the algorithm here also includes the robustness of the algorithm itself, which is the most well-considered place on project.
On the one hand, it is sensitive to dirty data. There is no need to do complex data cleansing, and on the one hand it is necessary to set the number of parameters, the number of parameters itself on the results of the algorithm is significant.
The algorithm is difficult to use and requires professionals to use, project's human cost will be significantly improved.
One consequence of the difficulty of use is that the results are unpredictable and need to be repeated. resulting in additional time and labor costs.
Training required Resources and parallelism
For big companies. The machine is cheaper than people, hundreds of of the server, computing resources are not worried, but for small companies. It is very likely that people are cheaper than machines, the time complexity of training algorithms and the complexity of space is too high. is not necessarily affordable.
The parallelism of the algorithm determines whether the algorithm can scale up (plus the machine) and, on the one hand, whether it can use GPU acceleration (the resources and energy required to use GPU acceleration can be greatly reduced).
Online (pre-measured) model file size
After supervising the learning algorithm training data. You will get a model file. To be predicted. The size of the model file determines whether it can be migrated to other devices for execution.
The size of the model file is also an indicator of whether the learning algorithm is over-fitted.
Pre-measured speed
The predicted speed is not only the real-time problem, but also the power consumption problem. The speed is usually due to low computational capacity. Power consumption will also be low.
On mobile devices, this is a very important factor to consider.
SVM based on algorithm evaluation support vector machine
The time complexity of training is internal product calculation + Kernel function map + gradient descent/smo, kernel function calculation can be ignored relative to inner product calculation, so the training time complexity of linear SVM and nonlinear SVM is almost the same.
However, considering the parameter tuning of nonlinear SVM, it is necessary to multiply the optimization iterative algebra (GRID-optimized RBF kernel function) for its complexity. It is 10,000 times times the size of 100.
The inner product calculation is able to do data parallelism, in the massive data, two times the solution with SMO or gradient descent, although the SMO algorithm complexity is low, but not easy parallel implementation, so in the large data volume, or gradient descent.
SVM Data parallel principle is not difficult, implemented because the SVM itself algorithm is complex, or more troublesome.
There are very large differences between linear and nonlinear SVM when predicted:
After the SVM is finished training. Finally, the weight is not 0 of the sample left, as a support vector.
Non-linear SVM must retain all support vectors. At the time of prediction, all support vectors and target vectors are calculated as an inner product. So. The number of support vectors determines the size of the model and the time complexity of the pre-measurement. (The number of support vectors is very likely to be massive under massive data).
Linear SVM is capable of merging all support vectors into a hyper-planar equation, which is just a set of parameters for a dimension measurement level.
In addition, support vector machine because natural only supports two classification, when dealing with multi-classification problem, we must use a set of combinational logic to extend the two classifier to multi-classifier.
References:
Http://blog.sina.com.cn/s/blog_5eef0840010147pa.html
This will also reduce the rate at which it is predicted.
In addition to the well-known machine learning Toolkit, SVM has a dedicated tool, LIBSVM, that can be applied to situations where the amount of data is not very large. The toolkit basically has no brain training model, just to integrate the code when making predictions.
Logical regression (logistic regression)
Logistic regression is a linear regression which changes the calculation method of loss function. The gradient descent algorithm is used to solve the coefficients. The number of coefficients is the number of features.
Because the gradient descent algorithm is used to solve the coefficients. Constant calculation of derivative and offset will have some computational capacity. However, the computation is significantly less than SVM. And very easy to expand to a large number of applications.
The result of logistic regression training is that most of the cases are under-fitted. Few cases. There are too many features, and there will be overfitting.
Naive Bayesian
Naive Bayesian training process is a statistical data frequency process, very quickly, a series of probability values, in the pre-measurement of all categories of probability to take the highest.
As a simple brute-force algorithm, project has a very wide application.
Decision Tree c4,5 Decision tree
The training of decision tree is a process of searching for optimal sub-points continuously. When you select a sub-point, each of the possible choices takes one time. But in the process of continuous division, the amount of data is also declining, the overall training speed is very fast.
Decision trees come out of the model is relatively simple, but also can be with the control of the parameters. This makes it very easy to integrate.
Because it makes the selection of features in the course of training, there are few invalid features that have little effect on it. But when the depth of the tree is larger, it will cross-fit.
In general, it trains faster than logistic regression. is also more accurate, but under the big data volume. Because the decision tree has to traverse all the data in the calculation process. The efficiency is not necessarily higher than the logic regression using a random gradient descent.
Random Forest
Random forest random sampling of data. Generate a number of decision trees, using
During training. As much as possible every time the data is part of the sample, but because to generate multiple subtrees, its complexity overall is higher than the decision tree.
When predicted, the complexity of the pre-test is significantly higher because the original decision tree has become multiple (usually dozens of).
Recent Neighbor Algorithm KNN
KNN belongs to lazy learning, the training process does not do whatever processing, only in the pre-test to query the recent n data for inference.
Because it saves all the training data, its model files are undoubtedly very large, so the algorithm is limited to use on the server.
Summarize it in a single table:
训练用的矩阵输入为 n X m ,n为数据量,m为维度(特征数)。一般来说m 远小n。此外记几个预计系数:t:10量级p:100量级k:1000量级q:10000量级f:0~1之间的实数
预測准确率是依据曾经做数据的经验预计的,随便看看就好。
algorithm |
Training Time |
Training Memory Consumption |
model File Size |
pre-measured time |
Algorithm Implementation difficulty |
algorithm Use difficulty |
pre-measured accuracy rate |
Nonlinear SVM |
Long O (p*m*n*n) +o (p*k*n*n) |
Multi O (N*n) |
O (M*N^F) |
More slowly |
High |
Medium |
High |
Linear SVM |
Longer O (m*n*n) +o (k*n*n) |
Same nonlinear SVM |
O (M) |
Fast |
High |
So so |
So so |
Logistic regression |
General O (k*n*m) |
Less O (m) |
O (M) |
Fast |
Low |
So so |
Low |
Naive Bayesian |
Less O (n*m) |
Less O (m) |
O (m*t) |
Fast |
Low |
Low |
Low |
c4,5 Decision Tree |
Less O (t*n*m) |
Less O (m) |
O (M*M) |
Fast |
Low |
So so |
So so |
Random Forest |
General O (T*T*N*M) |
Less O (m) |
O (m*m*t) |
More quickly |
Low |
Lower |
High |
BP Neural network |
General O (P*Q*N*M) |
General O (M*P) |
O (M*P) |
Medium |
So so |
Medium |
So so |
Knn |
No |
No |
O (M*n) |
Slow |
Low |
Low |
So so |
Deep learning
The previous article has been explained. Deep learning is a combination of unsupervised and supervised learning algorithms. Therefore, it is not easy to determine the complexity of space-time.
The model parameters of deep learning are generally independent of the data magnitude, but on the one hand, because it is less selective to input than shallow learning, the dimensions of the input vectors are large, and there is a way to overlay the model. Therefore, the magnitude of the participation is very large.
Because several basic algorithms of deep Learning (convolutional neural network CNN, Deep Belief network DBN, etc.) are natural data parallel algorithms, hardware acceleration is also relatively easy, to be able to traverse the massive data on the random gradient descent algorithm sgd. So that it can not be too slow training under the huge amount of data.
The neural network class algorithm has many model parameters. The memory and time required for pre-measurement are much longer, so it is very difficult to use in a non-server environment.
Writing machine learning from the perspective of Software Project Project analysis of main supervised learning algorithms in 3--