Using In-database analytics technology to realize the algorithm of machine learning on large scale data based on SGD

Last Update:2017-02-27 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the growth of application data, statistical analysis and machine learning are becoming a big challenge in large datasets. Currently, there are many languages/libraries for statistical analysis/machine learning, such as the R language designed for data analysis purposes, the Python language machine learning Library scikits, and the Map-reduce implementation based Mahout, which supports distributed environment extensions, and distributed memory computing Framework Spark machine Learning Library mllib and so on. At present, the Spark framework also introduces the interface Sprakr of R language. However, this article to discuss is another design idea, in the database to achieve statistical analysis and machine learning algorithm, that is, In-database Analysis,madlib Library is the representative of this design idea.

The Machine learning library built into the database (through the database UDF) has many advantages, the implementation of machine learning algorithm only need to write the corresponding SQL statements can be, while the database itself as an analysis of the data source, the use of very convenient, greatly reduced the application of machine learning threshold. Of course, the shortcomings are obvious, because of the UDF programming interface provided by the database, the implementation of the algorithm will be subject to a lot of constraints, many optimizations difficult to achieve, and large-scale data sets of machine learning, especially the need for iterative calculation, usually the algorithm performance and results of convergence speed requirements higher, otherwise it is difficult to practical. The focus of this paper is to discuss how to efficiently realize the algorithm of SGD (random gradient descent) in machine learning under the framework of in-database analysis. Because many machine learning algorithms such as linear SVM classifier, K-mean, Logistic regression can be implemented by SGD algorithm, only the different target functions should be designed for different algorithms. Therefore, the implementation of a high-performance SGD algorithm framework on the database can be used to perform a large class of machine learning algorithms.

Take Madlib as an example, if you want to Madlib using the SVM algorithm to train the dataset, you can execute the following SQL statement:

SELECT madlib.lsvm_classification (' My_schema.my_train_data ', 
                                   ' myexpc ', 
                                   false
                                );

Madlib.lsvm_classification is the SVM computing function implemented in Madlib, in which the linear SVM is classified by the SGD algorithm, in which My_schema.my_train_data is the training data table, The following structure definitions must be met:

Table/view my_schema.my_train_table 
(       
        ID    INT,       --point ID
        ind   float8[),  --Data point
        label  FLOAT8   --label of data point, that is, classification results
;

The model generated after execution is saved to the table specified in the second parameter ' Myexpec '. The third parameter (TRUE/FALSE) specifies whether the algorithm needs to be executed in parallel.

After the model table Myexpec is generated, the following SQL statements are executed to make predictions:

SELECT madlib.lsvm_predict (' myexpc ', 
                            ' {10,-2,4,20,10} '
                          );

The Madlib.lsvm_classification function also has more parameters to set. If you can set kernel-function, then the way to calculate the implementation is not SGD, you can set the iteration precision thresholds, etc., see the Madlib documentation.

From this example, we can see that the use of Madlib library is very convenient, so long as we build the corresponding data table according to the library function request and import the data, we can make the model training and model prediction by calling the SQL function in the Madlib library. At present, Madlib supports PostgreSQL, Greenplum, Pivotal hawq.

However, in many cases, madlib performance is very poor, based on the SGD algorithm for machine learning training model for example, in the case of large amount of data often need dozens of hours or even hundreds of hours to complete, in order to improve the calculation speed and model convergence speed, There are two places that are well worth optimizing: one is to update the concurrency model, and the other is to optimize the data reading order.

First look at the implementation of the basic SGD algorithm framework on the UDFA (custom aggregate function) of the database. Taking linear SVM as an example, the following objective function f (x) can be defined:

Among them <xi, yi> is the training data, W is the training model. A gradient descent algorithm such as the following is used to find a w that makes the value of this objective function as small as possible by iterating:

Where is the step value.

But directly in the calculation, each iteration, you need to traverse all the data, difficult to apply in the actual scenario, so there is a sgd (random gradient descent) algorithm, for each step of the iteration of f (x), approximately with a random data point to replace the original F (x) in the sum of all data points:

In the case where the original data order is random, each iteration only needs to take out one record in sequence and iterate over the previous style. This iterative update is calculated in a way that is consistent with the UDFA extension interface provided in the database. You can abstract this calculation process into three functions:

Initialize (state)
Transition (state, row_data)
terminate

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More