Writing machine learning from the perspective of Software Engineering 4 ——-The engineering realization of C4.5 decision tree

Last Update:2016-07-17 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Engineering implementation of C4.5 decision tree

This article begins with a series of engineering implementations of machine learning algorithms. For common and simple considerations, the C4.5 decision tree was chosen as the first algorithm.

Engineering Framework

Since this is the first algorithm implementation, it should be necessary to introduce the entire engineering framework.
For optimal performance, this framework is designed for C/s + + languages. But even in other languages, this framework can be implemented, and the module can be streamlined.

The Project positioning:
1, no brain version of the Machine learning algorithm library, users do not need to understand any algorithm details, and do not need to understand the configuration of the algorithm parameter meaning.
2, the separable algorithm library, the algorithm library output model file can be conveniently used by other engineering analysis.
3, high-performance algorithm library, theoretically can achieve the ultimate performance.

Overall design such as:

Data layer

This layer solves the problem of how data is defined, how it is produced, how it is used, and how it is destroyed.
Define two types of data:
Matrix: static data, matrix storage, using floating point by default, precision can be configured globally.
Chain: Dynamic Data, also known as time series type data.

Ideally, the data layer provides a uniform interface for both the operational layer and the algorithm layer, whether it is the matrix memory requested on the CPU or the memory requested in GPU memory, whether it is a common matrix or a sparse matrix, transparent to the outside world. However, some algorithms must be based on the specific conditions of the matrix (such as the QR-transform matrix eigenvector algorithm is expected to be based on sparse matrix acceleration), so in addition to the unified interface, but also support type recognition, so that the operational layer can be "local conditions."

The separation of discrete data and continuous data is also a difficult point for abstraction, where the default data is sequential, but the meta information is introduced in the matrix and chain types, with metadata to determine which columns are discrete and which discrete values are optional.

Operational layer Characteristics

On the one hand, machine learning algorithms are mostly matrix operations, packaged in a layer for easy reuse, on the other hand, in order to facilitate subsequent performance optimization, requires a dedicated computing layer. This layer, as an optimization focus, needs to be followed by an optimization module.

Provides the following features:
1, the basic matrix operations: matrix multiplication, matrix inversion, to seek the characteristics of the root and so on.
2, Matrix-based complex operations: gradient descent, convolution, kernel function mapping and so on.
3. Data conversion Related: Time series is expanded into matrix data (AR model, phase space reconstruction, etc.), sparse matrix and dense matrix transfer, data random sampling and so on.

Optimization

The optimization means mainly consists of two aspects, which can be specially optimized according to the need of practical application.
1, Gpgpu:opencl/cuda
2, Simd:neon/sse

There is a saying "famous": Premature optimization is the root of all evils. But after stepping through a series of code-optimized pits, I deeply feel that it is the root of all evils not to consider optimization, and that the cost of optimizing a poorly written program is much more than the cost of the time it was written efficiently at the outset. Pay attention to the code efficiency, it is not to say that the performance optimization to the extreme at the outset, but to a benefit of the subsequent performance optimization of the framework to design, so that there is no need for the performance requirements of large reconfiguration.

Algorithm Layer

This layer implements the basic machine learning algorithm. such as SVM, C45 decision tree, naive Bayesian, logistic regression and so on.
We would like all operations to be done in the matrix operations layer, so that it can be uniformly optimized. But the ideal is very good, the reality is very cruel. Considering that some operations are not easy to encapsulate the matrix operations layer, this layer also allows direct interaction with the data layer, manipulating the actual data.
The following is the interface definition for the algorithm layer:

/ * Predictive model * /Class ipredictor{ Public:Virtualmatrix*vpredict(matrix* X)Const=0;/ * The following two interfaces are used for classification * /    //output probabilities of each type    Virtualmatrix* Vpredictprob (matrix* X)Const=0;//Get values of each type    Virtualmatrix* vgetvalues ()Const=0;Virtualnode* Vdump ()Const=0;};/ * Supervised learning algorithm * /Class ilearner{ Public:Virtualipredictor*Vlearn(matrix* X, matrix* Y)Const=0;};/ * Unsupervised Learning algorithm * /Class ifreelearner{ Public:Virtualipredictor*Vlearn(matrix* X)Const=0;};/ * Time series Prediction model * /Class ichainpredictor{ Public:Virtualchain*vpredict(chain* X)Const=0;Virtualnode* Vdump ()Const=0;};/ * Time series Training algorithm * /Class ichainlearner{ Public:Virtualichainpredictor*Vlearn(matrix* X)Const=0;};

It is noted that for the classification model, we give the demand for probabilistic prediction, which is very important for some application scenarios, such as a series of image processing algorithms based on face recognition. In fact, all classification algorithms can calculate this probability.

For a wider range of engineering applications, we need to be able to serialize predictive algorithms (also known as training-derived models). The recommended way is to dump into a syntax tree (node for tree node above) and then select a JSON or XML library to serialize. When deserializing, the syntax tree is solved first, and then the model is restored according to the syntax tree.

Algorithm Abstraction Layer

The purpose of this layer is to implement complex algorithms, such as implementing random forest, AdaBoost, GBDT based on decision tree, implementing multi-classification SVM based on two classification SVM, and scoring the training algorithm (cross-validation) and so on. This layer allows to invoke the interface of the algorithm layer and the data conversion interface of the operational layer (such as sampling, etc.).

Implementation of C45 decision tree algorithm

The algorithm principle of decision tree can refer to this article:
Http://www.cnblogs.com/bourneli/archive/2013/03/15/2961568.html

Algorithm description and Parameters

From the previous article, the algorithm steps are as follows:

1. 开始，所有记录看作一个节点2. 遍历每个变量的每一种分割方式，找到最好的分割点3. 分割成两个节点N1和N24. 对N1和N2分别继续执行2-3步，直到每个节点足够“纯”为止

The detailed implementation is this:
Training algorithm:

1、接受输入自变量矩阵X，因变量矩阵Y2、基于矩阵Y计算纯度，满足如下任一条件则终止，转步骤5生成节点：（1）达到最大树深限制（2）纯度达到targetPrune（3）Y的数量小于minNumber3、基于自变量矩阵X的统计列出所有分划点，遍历每一个点，计算纯度增益，选取收益最大的一个分划点:（1）基于该点分隔对应的Y为Y1和Y2。（2）按一定公式计算Y1的纯度和Y2的纯度。（3）计算纯度增益。4、产生分划点，并基于最优分划点，将Y分成Y1和Y2，将X分为X1和X2，从1步执行。5、产生结果点：（1）得出Y中所有类型，记为数值矩阵。（2）计算Y中所有类型的占比，记为概率矩阵。（3）以概率最高的类型作为该点的默认预测值。

These parameters are covered here:
1, the biggest tree deep limit maxdepth
2, early termination of the purity requirements Targetprune
3. The amount of data to be overwritten at least for each active node Minnumber
These three parameters need to be mapped to the [0,1] interval according to a certain rule, so that the user can have no difference parameters.

Tree structure Sub-point

1, the Division of Information: the number of columns, the value of how much
2, two pointers/references to the left and right nodes

Result Point

The result points include
1. Probability matrix
2. Numerical matrix
3. Default value

When predicting, we can find the result point by the sub-layer information, and take the probability matrix and the numerical matrix of the result point.

class, module location

Whether ID3, C45 or C50, the predictive model of decision tree is the same tree structure, and the C45 decision tree is a kind of algorithm to construct decision tree. Therefore, in terms of naming, the predictive model is prefixed with no C45.

The algorithm involves the following class distributions:

Writing machine learning from the perspective of Software Engineering 4 ——-The engineering realization of C4.5 decision tree

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More