Using SQL to play Data mining Madlib (i)--Installation

Source: Internet
Author: User
Tags postgresql svm pmml

I. Introduction of Madlib
  
Madlib is an open-source machine learning Library in collaboration with the University of Berkeley, which provides accurate data parallel implementations, statistics and machine learning methods for analyzing structured and unstructured data, with the main purpose of extending the analytical capabilities of the database, which can be easily loaded into the database. Extended database analysis capabilities, July 2015 Madlib became the Apache Software Foundation incubation project, the latest version of MADlib1.11, can be used in Greenplum, PostgreSQL and HAWQ database systems.
  
1. Design Ideas
  
The main idea of driving the Madlib architecture is consistent with Hadoop, mainly in the following areas:
  
Manipulate local data within the database without unnecessary data movement in multiple run-time environments.
  
Take advantage of the capabilities of the database engine, but isolate machine learning logic from the implementation details of a particular database.
  
Leverage the parallelism and scalability provided by MPP's no-sharing technology, such as the Greenplum database and HAWQ.
  
The maintenance activities performed are open to the Apache community and ongoing academic research.
  
If you only summarize the features of Madlib in one sentence, as described in the title, you can use SQL to play data analysis, data mining, and machine learning.
  
2. Features
  
(1) Classification
  
If the desired output is essentially categorical, you can use the classification method to build the model and predict which category the new data will belong to. The goal of the classification is to be able to mark the input record as the correct category.
  
Examples of classifications: Suppose there are data describing demographics, as well as individual applications for loans and loan defaults, then we can create a model that describes the possibility of a new demographic data set of loan defaults. In this scenario, the output is categorized as "default" and "normal".
  
(2) return
  
If the desired output has continuity, we use the regression method to build the model and predict the output value.
  
Example of regression: if there is real data describing real estate attributes, we can build a model that predicts the price based on the known features of the House. The scenario is a regression problem because the output responds to sequential values rather than classifications.
  
(3) Clustering
  
Identify data groupings in which the data items in a group are more similar to the data items in other groups.
  
Clustering Example: In customer segmentation analysis, the goal is to identify customer behavior similar feature groups, in order to target different characteristics of customers to design a variety of marketing activities to achieve market objectives. This will be a controlled classification task if you understand the customer segmentation in advance. This is a clustering task when we group the data to identify itself.
  
(4) Thematic modeling
  
Theme modeling is similar to clustering, and it is also a data group that identifies each other as similar. But the similarity here is usually referred to as a document with the same subject in the Text field.
  
(5) Mining Association Rules
  
Also called shopping basket analysis or frequent itemsets mining. Relative to random occurrences, determine what matters more often together and indicate the potential relationship between matters.
  
An example of association rule mining: in an online store application, association rules mining can be used to determine which commodities tend to be sold together. These items are then entered into the customer referral engine to provide promotional opportunities, such as the famous beer and diaper story.
  
(6) Descriptive statistics
  
Descriptive statistics do not provide a model and are therefore not considered a machine learning method. Descriptive statistics, however, can help provide information to analysts to understand the underlying data, provide valuable explanations for the data, and may affect the choice of the data model.
  
Examples of descriptive statistics: calculating the distribution of data within each variable in a dataset can help to understand which variables should be treated as categorical variables, which are continuity variables, and the distribution of values.
  
(7) Model validation
  
Starting using it without knowing the accuracy of a model can lead to bad results. Because of this, it is important to understand the problems of the model and evaluate the accuracy of the model with the test data. It is necessary to separate the training data from the test data, analyze the data frequently, verify the validity of the statistical model, and evaluate the model not to fit the training data excessively. N-fold cross-validation is also often used.
  
3. function
  
The features of Madlib are shown in 1.
  
Figure 1
  
? Data Types and Transformations (datatype conversion)
  
? Arrays and matrices (arrays and matrices)
  
O array Operations (array operations)
  
O Matrix Operations (matrix operations)
  
Omatrix factorization (low matrix decomposition)
  
o Low-rank Matrix factorization (low-order matrix decomposition)
  
o Singular value decomposition (SVD, singular value decomposition)
  
o norms and Distance functions (specification and distance functions)
  
o Sparse Vectors (sparse vector)
  
? dimensionality Reduction (dimensionality reduction)
  
o Principal Component Analysis (PCA principal component)
  
o Principal Component Projection (PCP Principal component projection)
  
? Encoding categorical Variables (coded categorical variable)
  
? Stemming (word cut)
  
? Model Evaluation (evaluation of models)
  
? Cross Validation (crossover verification)
  
? Statistics (statistics)
  
? Descriptive Statistics (descriptive statistics)
  
o Pearson ' s Correlation (Pierce relevance)
  
o Summary (summary summary)
  
? Inferential Statistics (inferential statistics)
  
o hypothesis Tests (hypothesis test)
  
? Probability Functions (probability function)
  
? Supervised learning (supervised learning algorithm)
  
? Conditional random field (conditional random field)
  
? Regression Models (regression model)
  
o Clustered Variance (cluster variance)
  
o cox-proportional Hazards Regression (Cox ratio risk regression model)
  
o Elastic net regularization (Elastic net regression)
  
O Generalized Linear Models
  
o Linear Regression (linear regression)
  
o Logistic Regression (logistic regression)
  
O Marginal effects (marginal effect)
  
o multinomial Regression (polynomial regression)
  
o Ordinal Regression (ordered regression)
  
O Robust Variance (robust variance)
  
? Supported Vector machines (SVM, support vector machine)
  
? Trees Methods (Tree model)
  
O Decision tree (decision Trees)
  
o Random Forest (stochastic forest)
  
? Time series Analysis (TimeSeries)
  
? ARIMA (autoregressive integral sliding average model)
  
? Unsupervised learning (unsupervised learning)
  
? Association Rules (Association rule)
  
o Apriori algorithm (Apriori algorithm)
  
? Clustering (cluster)
  
o K-means Clustering (K-means)
  
? Topic modelling (Thematic model)
  
o Latent Dirichlet Allocation (LDA)
  
? Utility Functions (utility function)
  
? Developer Database Functions (Developer data function)
  
? Linear Solvers (linear solver)
  
o Dense Linear systems (dense linear systems)
  
o Sparse Linear Systems (sparse linear systems)
  
? Path Functions (route function)
  
? PMML export (PMML output)
  
? Text analysis (textual parsing)
  
o Term Frequency (Word frequency, TF)
  
Second, installation
  
1. Determine the installation platform
  
The latest release version of Madlib is 1.11, which can be installed in PostgreSQL, Greenplum, and Hawq, and the installation process varies from one database to another. I was installed in the HAWQ2.1.1.0.
  
2. Download the Madlib binary install Compression pack
  
As: Https://network.pivotal.io/products/pivotal-hdb. The 2.1.1.0 version of HAWQ provides four installation files, as shown in 2. After testing, only Madlib 1.10.0 version of the file can be installed properly.
  
Figure 2
  
3. Installing Madlib
  
The following command needs to be performed on the master host of the HAWQ using the Gpadmin user.
  
(1) Decompression
  
[Plain] View plain copy
  
TAR-ZXVF madlib-ossv1.10.0_pv1.9.7_hawq2.1-rhel5-x86_64.tar.gz
  
(2) Installing the Madlib gppkg file
  
[Plain] View plain copy
  
Gppkg-i madlib-ossv1.10.0_pv1.9.7_hawq2.1-rhel5-x86_64.gppkg
  
This command creates Madlib installation directories and files on all nodes of the HAWQ cluster (master and segment), and the default directory is/usr/local/hawq_2_1_1_0/madlib.
  
(3) Deploying Madlib in the specified database
  
[Plain] View plain copy
  
$GPHOME/madlib/bin/madpack install-c/dm-s www.wanmeiyuele.cn madlib-p hawq
  
This command establishes the Madlib schema,-p parameter in the DM database of HAWQ to specify the platform as HAWQ. After the command executes, you can view the database objects created in the Madlib schema.
  
[Plain] View plain copy
  
dm=# set Search_path=madlib;
  
SET
  
dm=# \dt
  
List of relations
  
Schema | Name | Type | Owner | Storage
  
--------+------------------+-------+---------+-------------
  
Madlib | Migrationhistory | Table | Gpadmin | Append only
  
(1 row)
  
dm=# \ds
  
List of relations
  
Schema | Name | Type | Owner | Storage
  
--------+-------------------------+----------+---------+---------
  
Madlib | Migrationhistory_id_seq | Sequence | Gpadmin | Heap
  
(1 row)
  
dm=# Select Type,count (*)
  
dm-# from (select P.proname as name,
  
DM (# Case if P.proisagg then ' agg '
  
DM (# when p.prorettype = ' Pg_catalog.trigger '::p g_catalog.regtype Then ' Trigger '
  
DM (# Else ' normal '
  
DM (# End as type
  
DM (# from Pg_catalog.pg_proc p, pg_catalog.pg_namespace n
  
DM (# WHERE n.oid = P.pronamespace and N.nspname= ' Madlib ') t
  
dm-# GROUP by rollup (type);
  
Type | Count
  
--------+-------
  
agg | 135
  
normal | 1324
  
| 1459
  
(3 rows)
  
As you can see, Madlib deploys the application Madpack first creates the database schema Madlib, and then creates the database object in that schema, including a table, a sequence, 1324 normal functions, and 135 aggregate functions. All machine learning and data mining models, algorithms, operations, and functions are actually performed by invoking these functions.
  
(4) Verifying the installation
  
[Plain] View plain copy
  
$GPHOME/madlib/bin/madpack install-check-c/dm-s madlib-p hawq
  
This command verifies that all models work correctly by performing 77 cases of 29 models. The command output is as follows:
  
[Plain] View plain copy
  
[Email protected] madlib]$ $GPHOME/madlib/bin/madpack install-check-c/dm-s madlib-p hawq
  
madpack.py:INFO:Detected HAWQ version 2.1.
  
TEST Case result| module:array_ops|array_ops.sql_in| pass| time:1851 milliseconds
  
TEST Case result| module:bayes|gaussian_naive_www.longboshyl.cn bayes.sql_in| pass| time:24222 milliseconds
  
TEST Case result| module:bayes|bayes.sql_in| pass| time:70634 milliseconds
  
TEST Case result| module:crf|crf_train_small.sql_in| pass| time:27186 milliseconds
  
TEST Case result| module:crf|crf_train_large.sql_in| pass| time:32602 milliseconds
  
TEST Case result| module:crf|crf_test_small.sql_in| pass| time:22410 milliseconds
  
TEST Case result| module:crf|crf_test_large.sql_in| pass| time:21711 milliseconds
  
TEST Case result| module:elastic_net|elastic_net_install_check.sql_in| pass| time:931563 milliseconds
  
TEST Case result| module:graph|sssp.sql_in| pass|www.jiaeidaypt.cn time:18174 milliseconds
  
TEST Case result| module:linalg|svd.sql_in| pass| time:72105 milliseconds
  
TEST Case result| module:linalg|matrix_ops.sql_in| pass| time:58312 milliseconds
  
TEST Case result| module:linalg|linalg.sql_in| pass| time:2836 milliseconds
  
TEST Case result| module:pmml|table_to_pmml.sql_in| pass| time:34508 milliseconds
  
TEST Case result| module:pmml|pmml_rf.sql_in| pass| time:35993 milliseconds
  
TEST Case result| module:pmml|pmml_ordinal.sql_in| pass| time:15540 milliseconds
  
TEST Case result| module:pmml|pmml_multinom.sql_in| pass| time:12546 milliseconds
  
TEST Case result| Module:pmml|pmml_glm_poisson.sql_in|www.huazongyule.net pass| time:7321 milliseconds
  
TEST Case result| module:pmml|pmml_glm_normal.sql_in| pass| time:8597 milliseconds
  
TEST Case result| module:pmml|pmml_glm_ig.sql_in| pass| time:8861 milliseconds
  
TEST Case result| module:pmml|pmml_glm_gamma.sql_in| pass| time:26212 milliseconds
  
TEST Case result| module:pmml|pmml_glm_binomial.sql_in| pass| time:12977 milliseconds
  
TEST Case result| module:pmml|pmml_dt.sql_in| pass| time:9401 milliseconds
  
TEST Case result| module:prob|prob.sql_in| pass| time:1917 milliseconds
  
TEST Case result| module:sketch|support.sql_in| pass| time:143 milliseconds
  
TEST Case result| module:sketch|mfv.sql_in| pass| time:720 milliseconds
  
TEST Case result| module:sketch|fm.sql_in| pass| time:7301 milliseconds
  
TEST Case result| module:sketch|cm.sql_in| pass| time:19777 milliseconds
  
TEST Case result| module:svm|svm.sql_in| pass| time:205677 milliseconds
  
TEST Case result| module:tsa|arima_train.sql_in| pass| time:75680 milliseconds
  
TEST Case result| module:tsa|arima.sql_in| pass| time:76236 milliseconds
  
TEST Case result| module:conjugate_gradient|conj_grad.sql_in| pass| time:6757 milliseconds
  
TEST Case result| module:knn|knn.sql_in| pass| time:9835 milliseconds
  
TEST Case result| module:lda|lda.sql_in| pass| time:20510 milliseconds
  
TEST Case result| module:stats|wsr_test.sql_in| pass| time:2766 milliseconds
  
TEST Case result| module:stats|t_test.sql_in| pass| time:3686 milliseconds
  
TEST Case result| module:stats|robust_and_clustered_variance_coxph.sql_in| pass| time:17499 milliseconds
  
TEST Case result| module:stats|pred_metrics.sql_in| pass| time:14032 milliseconds
  
TEST Case result| module:stats|mw_test.sql_in| pass| time:1852 milliseconds
  
TEST Case result| module:stats|ks_test.sql_in| pass| time:2465 milliseconds
  
TEST Case result| module:stats|f_test.sql_in| pass| time:2358 milliseconds
  
TEST Case result| module:stats|cox_prop_hazards.sql_in| pass| time:39932 milliseconds
  
TEST Case result| module:stats|correlation.sql_in| Www.wmyl11.com pass| time:10520 milliseconds
  
TEST Case result| module:stats|chi2_test.sql_in| pass| time:3581 milliseconds
  
TEST Case result| module:stats|anova_test.sql_in| pass| time:1801 milliseconds
  
TEST Case result| module:svec_util|svec_test.sql_in| pass| time:14043 milliseconds
  
TEST Case result| module:svec_util|gp_sfv_sort_order.sql_in| pass| time:3399 milliseconds
  
TEST Case result| Module:utilities|text_utilities.sql_www.wmyl15.com in| pass| time:6579 milliseconds
  
TEST Case result| module:utilities|sessionize.sql_in| pass| time:3901 milliseconds
  
TEST Case result| module:utilities|pivot.sql_in| pass| time:15634 milliseconds
  
TEST Case result| module:utilities|path.sql_in| pass| time:9321 milliseconds
  
TEST Case result| module:utilities|encode_categorical.sql_in| pass| time:7665 milliseconds
  
TEST Case result| module:utilities|drop_madlib_temp.sql_in| pass| time:153 milliseconds
  
TEST Case result| module:assoc_rules|assoc_rules.sql_in| pass| time:31975 milliseconds
  
TEST Case result| module:convex|lmf.sql_in| pass| time:66775 milliseconds
  
TEST Case result| module:glm|poisson.sql_in| pass| Time:www.ruanjianyin.cn 19117 milliseconds
  
TEST Case result| module:glm|ordinal.sql_in| pass| time:23446 milliseconds
  
TEST Case result| module:glm|multinom.sql_in| pass| time:18780 milliseconds
  
TEST Case result| module:glm|inverse_gaussian.sql_in| pass| time:20931 milliseconds
  
TEST Case result| module:glm|gaussian.sql_in| pass| time:23795 milliseconds
  
TEST Case result| module:glm|gamma.sql_in| pass| time:43365 milliseconds
  
TEST Case result| module:glm|binomial.sql_in| pass| time:39437 milliseconds
  
TEST Case result| module:linear_systems|sparse_linear_sytems.sql_in| pass| time:5405 milliseconds
  
TEST Case result| module:linear_systems|dense_linear_sytems.sql_in| pass| time:3331 milliseconds
  
TEST Case result| module:recursive_partitioning|random_forest.sql_in| pass| time:294832 milliseconds
  
TEST Case result| module:recursive_partitioning|decision_tree.sql_in| pass| time:91311 milliseconds
  
TEST Case result| module:regress|robust.sql_in| pass| time:55325 milliseconds
  
TEST Case result| module:regress|multilogistic.sql_in| pass| time:25330 milliseconds
  
TEST Case result| module:regress|marginal.sql_in| pass| time:www.10000da.cn 73750 milliseconds
  
TEST Case result| module:regress|logistic.sql_in| pass| time:76501 milliseconds
  
TEST Case result| module:regress|linear.sql_in| pass| time:7517 milliseconds
  
TEST Case result| module:regress|clustered.sql_in| pass| time:40661 milliseconds
  
TEST Case result| module:sample|sample.sql_in| pass| time:890 milliseconds
  
TEST Case result| module:summary|summary.sql_in| pass| time:14644 milliseconds
  
TEST Case result| module:kmeans|kmeans.sql_in| pass| time:52173 milliseconds
  
TEST Case result| module:pca|pca_project.sql_in| pass| time:229016 milliseconds
  
TEST Case result| module:pca|pca.sql_in| pass| time:523230 milliseconds
  
TEST Case result| module:validation|cross_validation.sql_in| pass| time:33685 milliseconds
  
[Email protected] madlib]$
  
As you can see, all cases have been performed properly, indicating that the Madlib installation was successful.
  
Third, uninstall
  
The uninstallation process is basically the inverse of the installation.
  
1. Delete Madlib mode
  
Method 1, deploy the application using Madpack.
  
[Plain] View plain copy
  
$GPHOME/madlib/bin/madpack uninstall-c/dm-s madlib-p hawq
  
Method 2, use the SQL command to manually delete the mode.
  
[SQL] View plain copy
  
Drop schema Madlib cascade;
  
2. Delete other legacy database objects
  
(1) Delete mode
  
If there is an error in the middle of the test, the database may contain a test pattern, these schema names are prefixed with madlib_installcheck_ and can only be executed manually by executing the SQL command, such as:
  
[Plain] View plain copy
  
Drop schema Madlib_installcheck_kmeans cascade;
  
(2) Delete user
  
If there is a legacy test user, delete it.
  
[SQL] View plain copy
  
Drop user if exists Madlib_1100_installcheck;
  
3. Delete Madlib RPM Package
  
(1) Query package name
  
[Plain] View plain copy
  
Gppkg-q--all
  
The output is as follows:
  
[Plain] View plain copy
  
[Email protected] madlib]$ gppkg-q--all
  
20170630:16:19:53:076493 gppkg:hdp3:gpadmin-[info]:-starting gppkg with args:-Q--all
  
madlib-ossv1.10.0_pv1.9.7_hawq2.1
  
(2) Delete RPM package
  
[Plain] View plain copy
  
Gppkg-r madlib-ossv1.10.0_pv1.9.7_hawq2.1

Using SQL to play Data mining Madlib (i)--Installation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.