Using SQL to play Data mining Madlib (i)--Installation

Last Update:2017-07-03 Source: Internet

Author: User

Tags postgresql svm pmml

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Introduction of Madlib
　　
Madlib is an open-source machine learning Library in collaboration with the University of Berkeley, which provides accurate data parallel implementations, statistics and machine learning methods for analyzing structured and unstructured data, with the main purpose of extending the analytical capabilities of the database, which can be easily loaded into the database. Extended database analysis capabilities, July 2015 Madlib became the Apache Software Foundation incubation project, the latest version of MADlib1.11, can be used in Greenplum, PostgreSQL and HAWQ database systems.
　　
1. Design Ideas
　　
The main idea of driving the Madlib architecture is consistent with Hadoop, mainly in the following areas:
　　
Manipulate local data within the database without unnecessary data movement in multiple run-time environments.
　　
Take advantage of the capabilities of the database engine, but isolate machine learning logic from the implementation details of a particular database.
　　
Leverage the parallelism and scalability provided by MPP's no-sharing technology, such as the Greenplum database and HAWQ.
　　
The maintenance activities performed are open to the Apache community and ongoing academic research.
　　
If you only summarize the features of Madlib in one sentence, as described in the title, you can use SQL to play data analysis, data mining, and machine learning.
　　
2. Features
　　
(1) Classification
　　
If the desired output is essentially categorical, you can use the classification method to build the model and predict which category the new data will belong to. The goal of the classification is to be able to mark the input record as the correct category.
　　
Examples of classifications: Suppose there are data describing demographics, as well as individual applications for loans and loan defaults, then we can create a model that describes the possibility of a new demographic data set of loan defaults. In this scenario, the output is categorized as "default" and "normal".
　　
(2) return
　　
If the desired output has continuity, we use the regression method to build the model and predict the output value.
　　
Example of regression: if there is real data describing real estate attributes, we can build a model that predicts the price based on the known features of the House. The scenario is a regression problem because the output responds to sequential values rather than classifications.
　　
(3) Clustering
　　
Identify data groupings in which the data items in a group are more similar to the data items in other groups.
　　
Clustering Example: In customer segmentation analysis, the goal is to identify customer behavior similar feature groups, in order to target different characteristics of customers to design a variety of marketing activities to achieve market objectives. This will be a controlled classification task if you understand the customer segmentation in advance. This is a clustering task when we group the data to identify itself.
　　
(4) Thematic modeling
　　
Theme modeling is similar to clustering, and it is also a data group that identifies each other as similar. But the similarity here is usually referred to as a document with the same subject in the Text field.
　　
(5) Mining Association Rules
　　
Also called shopping basket analysis or frequent itemsets mining. Relative to random occurrences, determine what matters more often together and indicate the potential relationship between matters.
　　
An example of association rule mining: in an online store application, association rules mining can be used to determine which commodities tend to be sold together. These items are then entered into the customer referral engine to provide promotional opportunities, such as the famous beer and diaper story.
　　
(6) Descriptive statistics
　　
Descriptive statistics do not provide a model and are therefore not considered a machine learning method. Descriptive statistics, however, can help provide information to analysts to understand the underlying data, provide valuable explanations for the data, and may affect the choice of the data model.
　　
Examples of descriptive statistics: calculating the distribution of data within each variable in a dataset can help to understand which variables should be treated as categorical variables, which are continuity variables, and the distribution of values.
　　
(7) Model validation
　　
Starting using it without knowing the accuracy of a model can lead to bad results. Because of this, it is important to understand the problems of the model and evaluate the accuracy of the model with the test data. It is necessary to separate the training data from the test data, analyze the data frequently, verify the validity of the statistical model, and evaluate the model not to fit the training data excessively. N-fold cross-validation is also often used.
　　
3. function
　　
The features of Madlib are shown in 1.
　　
Figure 1
　　
? Data Types and Transformations (datatype conversion)
　　
? Arrays and matrices (arrays and matrices)
　　
O array Operations (array operations)
　　
O Matrix Operations (matrix operations)
　　
Omatrix factorization (low matrix decomposition)
　　
o Low-rank Matrix factorization (low-order matrix decomposition)
　　
o Singular value decomposition (SVD, singular value decomposition)
　　
o norms and Distance functions (specification and distance functions)
　　
o Sparse Vectors (sparse vector)
　　
? dimensionality Reduction (dimensionality reduction)
　　
o Principal Component Analysis (PCA principal component)
　　
o Principal Component Projection (PCP Principal component projection)
　　
? Encoding categorical Variables (coded categorical variable)
　　
? Stemming (word cut)
　　
? Model Evaluation (evaluation of models)
　　
? Cross Validation (crossover verification)
　　
? Statistics (statistics)
　　
? Descriptive Statistics (descriptive statistics)
　　
o Pearson ' s Correlation (Pierce relevance)
　　
o Summary (summary summary)
　　
? Inferential Statistics (inferential statistics)
　　
o hypothesis Tests (hypothesis test)
　　
? Probability Functions (probability function)
　　
? Supervised learning (supervised learning algorithm)
　　
? Conditional random field (conditional random field)
　　
? Regression Models (regression model)
　　
o Clustered Variance (cluster variance)
　　
o cox-proportional Hazards Regression (Cox ratio risk regression model)
　　
o Elastic net regularization (Elastic net regression)
　　
O Generalized Linear Models
　　
o Linear Regression (linear regression)
　　
o Logistic Regression (logistic regression)
　　
O Marginal effects (marginal effect)
　　
o multinomial Regression (polynomial regression)
　　
o Ordinal Regression (ordered regression)
　　
O Robust Variance (robust variance)
　　
? Supported Vector machines (SVM, support vector machine)
　　
? Trees Methods (Tree model)
　　
O Decision tree (decision Trees)
　　
o Random Forest (stochastic forest)
　　
? Time series Analysis (TimeSeries)
　　
? ARIMA (autoregressive integral sliding average model)
　　
? Unsupervised learning (unsupervised learning)
　　
? Association Rules (Association rule)
　　
o Apriori algorithm (Apriori algorithm)
　　
? Clustering (cluster)
　　
o K-means Clustering (K-means)
　　
? Topic modelling (Thematic model)
　　
o Latent Dirichlet Allocation (LDA)
　　
? Utility Functions (utility function)
　　
? Developer Database Functions (Developer data function)
　　
? Linear Solvers (linear solver)
　　
o Dense Linear systems (dense linear systems)
　　
o Sparse Linear Systems (sparse linear systems)
　　
? Path Functions (route function)
　　
? PMML export (PMML output)
　　
? Text analysis (textual parsing)
　　
o Term Frequency (Word frequency, TF)
　　
Second, installation
　　
1. Determine the installation platform
　　
The latest release version of Madlib is 1.11, which can be installed in PostgreSQL, Greenplum, and Hawq, and the installation process varies from one database to another. I was installed in the HAWQ2.1.1.0.
　　
2. Download the Madlib binary install Compression pack
　　
As: Https://network.pivotal.io/products/pivotal-hdb. The 2.1.1.0 version of HAWQ provides four installation files, as shown in 2. After testing, only Madlib 1.10.0 version of the file can be installed properly.
　　
Figure 2
　　
3. Installing Madlib
　　
The following command needs to be performed on the master host of the HAWQ using the Gpadmin user.
　　
(1) Decompression
　　
[Plain] View plain copy
　　
TAR-ZXVF madlib-ossv1.10.0_pv1.9.7_hawq2.1-rhel5-x86_64.tar.gz
　　
(2) Installing the Madlib gppkg file
　　
[Plain] View plain copy
　　
Gppkg-i madlib-ossv1.10.0_pv1.9.7_hawq2.1-rhel5-x86_64.gppkg
　　
This command creates Madlib installation directories and files on all nodes of the HAWQ cluster (master and segment), and the default directory is/usr/local/hawq_2_1_1_0/madlib.
　　
(3) Deploying Madlib in the specified database
　　
[Plain] View plain copy
　　
$GPHOME/madlib/bin/madpack install-c/dm-s www.wanmeiyuele.cn madlib-p hawq
　　
This command establishes the Madlib schema,-p parameter in the DM database of HAWQ to specify the platform as HAWQ. After the command executes, you can view the database objects created in the Madlib schema.
　　
[Plain] View plain copy
　　
dm=# set Search_path=madlib;
　　
SET
　　
dm=# \dt
　　
List of relations
　　
Schema | Name | Type | Owner | Storage
　　
--------+------------------+-------+---------+-------------
　　
Madlib | Migrationhistory | Table | Gpadmin | Append only
　　
(1 row)
　　
dm=# \ds
　　
List of relations
　　
Schema | Name | Type | Owner | Storage
　　
--------+-------------------------+----------+---------+---------
　　
Madlib | Migrationhistory_id_seq | Sequence | Gpadmin | Heap
　　
(1 row)
　　
dm=# Select Type,count (*)
　　
dm-# from (select P.proname as name,
　　
DM (# Case if P.proisagg then ' agg '
　　
DM (# when p.prorettype = ' Pg_catalog.trigger '::p g_catalog.regtype Then ' Trigger '
　　
DM (# Else ' normal '
　　
DM (# End as type
　　
DM (# from Pg_catalog.pg_proc p, pg_catalog.pg_namespace n
　　
DM (# WHERE n.oid = P.pronamespace and N.nspname= ' Madlib ') t
　　
dm-# GROUP by rollup (type);
　　
Type | Count
　　
--------+-------
　　
agg | 135
　　
normal | 1324
　　
| 1459
　　
(3 rows)
　　
As you can see, Madlib deploys the application Madpack first creates the database schema Madlib, and then creates the database object in that schema, including a table, a sequence, 1324 normal functions, and 135 aggregate functions. All machine learning and data mining models, algorithms, operations, and functions are actually performed by invoking these functions.
　　
(4) Verifying the installation
　　
[Plain] View plain copy
　　
$GPHOME/madlib/bin/madpack install-check-c/dm-s madlib-p hawq
　　
This command verifies that all models work correctly by performing 77 cases of 29 models. The command output is as follows:
　　
[Plain] View plain copy
　　
[Email protected] madlib]$ $GPHOME/madlib/bin/madpack install-check-c/dm-s madlib-p hawq
　　
madpack.py:INFO:Detected HAWQ version 2.1.
　　
TEST Case result| module:array_ops|array_ops.sql_in| pass| time:1851 milliseconds
　　
TEST Case result| module:bayes|gaussian_naive_www.longboshyl.cn bayes.sql_in| pass| time:24222 milliseconds
　　
TEST Case result| module:bayes|bayes.sql_in| pass| time:70634 milliseconds
　　
TEST Case result| module:crf|crf_train_small.sql_in| pass| time:27186 milliseconds
　　
TEST Case result| module:crf|crf_train_large.sql_in| pass| time:32602 milliseconds
　　
TEST Case result| module:crf|crf_test_small.sql_in| pass| time:22410 milliseconds
　　
TEST Case result| module:crf|crf_test_large.sql_in| pass| time:21711 milliseconds
　　
TEST Case result| module:elastic_net|elastic_net_install_check.sql_in| pass| time:931563 milliseconds
　　
TEST Case result| module:graph|sssp.sql_in| pass|www.jiaeidaypt.cn time:18174 milliseconds
　　
TEST Case result| module:linalg|svd.sql_in| pass| time:72105 milliseconds
　　
TEST Case result| module:linalg|matrix_ops.sql_in| pass| time:58312 milliseconds
　　
TEST Case result| module:linalg|linalg.sql_in| pass| time:2836 milliseconds
　　
TEST Case result| module:pmml|table_to_pmml.sql_in| pass| time:34508 milliseconds
　　
TEST Case result| module:pmml|pmml_rf.sql_in| pass| time:35993 milliseconds
　　
TEST Case result| module:pmml|pmml_ordinal.sql_in| pass| time:15540 milliseconds
　　
TEST Case result| module:pmml|pmml_multinom.sql_in| pass| time:12546 milliseconds
　　
TEST Case result| Module:pmml|pmml_glm_poisson.sql_in|www.huazongyule.net pass| time:7321 milliseconds
　　
TEST Case result| module:pmml|pmml_glm_normal.sql_in| pass| time:8597 milliseconds
　　
TEST Case result| module:pmml|pmml_glm_ig.sql_in| pass| time:8861 milliseconds
　　
TEST Case result| module:pmml|pmml_glm_gamma.sql_in| pass| time:26212 milliseconds
　　
TEST Case result| module:pmml|pmml_glm_binomial.sql_in| pass| time:12977 milliseconds
　　
TEST Case result| module:pmml|pmml_dt.sql_in| pass| time:9401 milliseconds
　　
TEST Case result| module:prob|prob.sql_in| pass| time:1917 milliseconds
　　
TEST Case result| module:sketch|support.sql_in| pass| time:143 milliseconds
　　
TEST Case result| module:sketch|mfv.sql_in| pass| time:720 milliseconds
　　
TEST Case result| module:sketch|fm.sql_in| pass| time:7301 milliseconds
　　
TEST Case result| module:sketch|cm.sql_in| pass| time:19777 milliseconds
　　
TEST Case result| module:svm|svm.sql_in| pass| time:205677 milliseconds
　　
TEST Case result| module:tsa|arima_train.sql_in| pass| time:75680 milliseconds
　　
TEST Case result| module:tsa|arima.sql_in| pass| time:76236 milliseconds
　　
TEST Case result| module:conjugate_gradient|conj_grad.sql_in| pass| time:6757 milliseconds
　　
TEST Case result| module:knn|knn.sql_in| pass| time:9835 milliseconds
　　
TEST Case result| module:lda|lda.sql_in| pass| time:20510 milliseconds
　　
TEST Case result| module:stats|wsr_test.sql_in| pass| time:2766 milliseconds
　　
TEST Case result| module:stats|t_test.sql_in| pass| time:3686 milliseconds
　　
TEST Case result| module:stats|robust_and_clustered_variance_coxph.sql_in| pass| time:17499 milliseconds
　　
TEST Case result| module:stats|pred_metrics.sql_in| pass| time:14032 milliseconds
　　
TEST Case result| module:stats|mw_test.sql_in| pass| time:1852 milliseconds
　　
TEST Case result| module:stats|ks_test.sql_in| pass| time:2465 milliseconds
　　
TEST Case result| module:stats|f_test.sql_in| pass| time:2358 milliseconds
　　
TEST Case result| module:stats|cox_prop_hazards.sql_in| pass| time:39932 milliseconds
　　
TEST Case result| module:stats|correlation.sql_in| Www.wmyl11.com pass| time:10520 milliseconds
　　
TEST Case result| module:stats|chi2_test.sql_in| pass| time:3581 milliseconds
　　
TEST Case result| module:stats|anova_test.sql_in| pass| time:1801 milliseconds
　　
TEST Case result| module:svec_util|svec_test.sql_in| pass| time:14043 milliseconds
　　
TEST Case result| module:svec_util|gp_sfv_sort_order.sql_in| pass| time:3399 milliseconds
　　
TEST Case result| Module:utilities|text_utilities.sql_www.wmyl15.com in| pass| time:6579 milliseconds
　　
TEST Case result| module:utilities|sessionize.sql_in| pass| time:3901 milliseconds
　　
TEST Case result| module:utilities|pivot.sql_in| pass| time:15634 milliseconds
　　
TEST Case result| module:utilities|path.sql_in| pass| time:9321 milliseconds
　　
TEST Case result| module:utilities|encode_categorical.sql_in| pass| time:7665 milliseconds
　　
TEST Case result| module:utilities|drop_madlib_temp.sql_in| pass| time:153 milliseconds
　　
TEST Case result| module:assoc_rules|assoc_rules.sql_in| pass| time:31975 milliseconds
　　
TEST Case result| module:convex|lmf.sql_in| pass| time:66775 milliseconds
　　
TEST Case result| module:glm|poisson.sql_in| pass| Time:www.ruanjianyin.cn 19117 milliseconds
　　
TEST Case result| module:glm|ordinal.sql_in| pass| time:23446 milliseconds
　　
TEST Case result| module:glm|multinom.sql_in| pass| time:18780 milliseconds
　　
TEST Case result| module:glm|inverse_gaussian.sql_in| pass| time:20931 milliseconds
　　
TEST Case result| module:glm|gaussian.sql_in| pass| time:23795 milliseconds
　　
TEST Case result| module:glm|gamma.sql_in| pass| time:43365 milliseconds
　　
TEST Case result| module:glm|binomial.sql_in| pass| time:39437 milliseconds
　　
TEST Case result| module:linear_systems|sparse_linear_sytems.sql_in| pass| time:5405 milliseconds
　　
TEST Case result| module:linear_systems|dense_linear_sytems.sql_in| pass| time:3331 milliseconds
　　
TEST Case result| module:recursive_partitioning|random_forest.sql_in| pass| time:294832 milliseconds
　　
TEST Case result| module:recursive_partitioning|decision_tree.sql_in| pass| time:91311 milliseconds
　　
TEST Case result| module:regress|robust.sql_in| pass| time:55325 milliseconds
　　
TEST Case result| module:regress|multilogistic.sql_in| pass| time:25330 milliseconds
　　
TEST Case result| module:regress|marginal.sql_in| pass| time:www.10000da.cn 73750 milliseconds
　　
TEST Case result| module:regress|logistic.sql_in| pass| time:76501 milliseconds
　　
TEST Case result| module:regress|linear.sql_in| pass| time:7517 milliseconds
　　
TEST Case result| module:regress|clustered.sql_in| pass| time:40661 milliseconds
　　
TEST Case result| module:sample|sample.sql_in| pass| time:890 milliseconds
　　
TEST Case result| module:summary|summary.sql_in| pass| time:14644 milliseconds
　　
TEST Case result| module:kmeans|kmeans.sql_in| pass| time:52173 milliseconds
　　
TEST Case result| module:pca|pca_project.sql_in| pass| time:229016 milliseconds
　　
TEST Case result| module:pca|pca.sql_in| pass| time:523230 milliseconds
　　
TEST Case result| module:validation|cross_validation.sql_in| pass| time:33685 milliseconds
　　
[Email protected] madlib]$
　　
As you can see, all cases have been performed properly, indicating that the Madlib installation was successful.
　　
Third, uninstall
　　
The uninstallation process is basically the inverse of the installation.
　　
1. Delete Madlib mode
　　
Method 1, deploy the application using Madpack.
　　
[Plain] View plain copy
　　
$GPHOME/madlib/bin/madpack uninstall-c/dm-s madlib-p hawq
　　
Method 2, use the SQL command to manually delete the mode.
　　
[SQL] View plain copy
　　
Drop schema Madlib cascade;
　　
2. Delete other legacy database objects
　　
(1) Delete mode
　　
If there is an error in the middle of the test, the database may contain a test pattern, these schema names are prefixed with madlib_installcheck_ and can only be executed manually by executing the SQL command, such as:
　　
[Plain] View plain copy
　　
Drop schema Madlib_installcheck_kmeans cascade;
　　
(2) Delete user
　　
If there is a legacy test user, delete it.
　　
[SQL] View plain copy
　　
Drop user if exists Madlib_1100_installcheck;
　　
3. Delete Madlib RPM Package
　　
(1) Query package name
　　
[Plain] View plain copy
　　
Gppkg-q--all
　　
The output is as follows:
　　
[Plain] View plain copy
　　
[Email protected] madlib]$ gppkg-q--all
　　
20170630:16:19:53:076493 gppkg:hdp3:gpadmin-[info]:-starting gppkg with args:-Q--all
　　
madlib-ossv1.10.0_pv1.9.7_hawq2.1
　　
(2) Delete RPM package
　　
[Plain] View plain copy
　　
Gppkg-r madlib-ossv1.10.0_pv1.9.7_hawq2.1

Using SQL to play Data mining Madlib (i)--Installation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More