Using In-database analytics technology to realize the algorithm of machine learning on large scale data based on SGD

Source: Internet
Author: User
Tags svm

With the growth of application data, statistical analysis and machine learning are becoming a big challenge in large datasets. Currently, there are many languages/libraries for statistical analysis/machine learning, such as the R language designed for data analysis purposes, the Python language machine learning Library scikits, and the Map-reduce implementation based Mahout, which supports distributed environment extensions, and distributed memory computing Framework Spark machine Learning Library mllib and so on. At present, the Spark framework also introduces the interface Sprakr of R language. However, this article to discuss is another design idea, in the database to achieve statistical analysis and machine learning algorithm, that is, In-database Analysis,madlib Library is the representative of this design idea.

The Machine learning library built into the database (through the database UDF) has many advantages, the implementation of machine learning algorithm only need to write the corresponding SQL statements can be, while the database itself as an analysis of the data source, the use of very convenient, greatly reduced the application of machine learning threshold. Of course, the shortcomings are obvious, because of the UDF programming interface provided by the database, the implementation of the algorithm will be subject to a lot of constraints, many optimizations difficult to achieve, and large-scale data sets of machine learning, especially the need for iterative calculation, usually the algorithm performance and results of convergence speed requirements higher, otherwise it is difficult to practical. The focus of this paper is to discuss how to efficiently realize the algorithm of SGD (random gradient descent) in machine learning under the framework of in-database analysis. Because many machine learning algorithms such as linear SVM classifier, K-mean, Logistic regression can be implemented by SGD algorithm, only the different target functions should be designed for different algorithms. Therefore, the implementation of a high-performance SGD algorithm framework on the database can be used to perform a large class of machine learning algorithms.

Take Madlib as an example, if you want to Madlib using the SVM algorithm to train the dataset, you can execute the following SQL statement:

SELECT madlib.lsvm_classification (' My_schema.my_train_data ', 
                                   ' myexpc ', 
                                   false
                                );

Madlib.lsvm_classification is the SVM computing function implemented in Madlib, in which the linear SVM is classified by the SGD algorithm, in which My_schema.my_train_data is the training data table, The following structure definitions must be met:

Table/view my_schema.my_train_table 
(       
        ID    INT,       --point ID
        ind   float8[),  --Data point
        label  FLOAT8   --label of data point, that is, classification results
;

The model generated after execution is saved to the table specified in the second parameter ' Myexpec '. The third parameter (TRUE/FALSE) specifies whether the algorithm needs to be executed in parallel.

After the model table Myexpec is generated, the following SQL statements are executed to make predictions:

SELECT madlib.lsvm_predict (' myexpc ', 
                            ' {10,-2,4,20,10} '
                          );

The Madlib.lsvm_classification function also has more parameters to set. If you can set kernel-function, then the way to calculate the implementation is not SGD, you can set the iteration precision thresholds, etc., see the Madlib documentation.

From this example, we can see that the use of Madlib library is very convenient, so long as we build the corresponding data table according to the library function request and import the data, we can make the model training and model prediction by calling the SQL function in the Madlib library. At present, Madlib supports PostgreSQL, Greenplum, Pivotal hawq.

However, in many cases, madlib performance is very poor, based on the SGD algorithm for machine learning training model for example, in the case of large amount of data often need dozens of hours or even hundreds of hours to complete, in order to improve the calculation speed and model convergence speed, There are two places that are well worth optimizing: one is to update the concurrency model, and the other is to optimize the data reading order.

First look at the implementation of the basic SGD algorithm framework on the UDFA (custom aggregate function) of the database. Taking linear SVM as an example, the following objective function f (x) can be defined:

Among them <xi, yi> is the training data, W is the training model. A gradient descent algorithm such as the following is used to find a w that makes the value of this objective function as small as possible by iterating:

Where is the step value.

But directly in the calculation, each iteration, you need to traverse all the data, difficult to apply in the actual scenario, so there is a sgd (random gradient descent) algorithm, for each step of the iteration of f (x), approximately with a random data point to replace the original F (x) in the sum of all data points:

In the case where the original data order is random, each iteration only needs to take out one record in sequence and iterate over the previous style. This iterative update is calculated in a way that is consistent with the UDFA extension interface provided in the database. You can abstract this calculation process into three functions:

Initialize (state)
Transition (state, row_data)
terminate

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.