I. Introduction of Madlib
Madlib is an open-source machine learning Library in collaboration with the University of Berkeley, which provides accurate data parallel implementations, statistics and machine learning methods for analyzing structured and unstructured data, with the main purpose of extending the analytical capabilities of the database, which can be easily loaded into the database. Extended database analysis capabilities, July 2015 Madlib became the Apache Software Foundation incubation project, the latest version of MADlib1.11, can be used in Greenplum, PostgreSQL and HAWQ database systems.
1. Design Ideas
The main idea of driving the Madlib architecture is consistent with Hadoop, mainly in the following areas:
Manipulate local data within the database without unnecessary data movement in multiple run-time environments.
Take advantage of the capabilities of the database engine, but isolate machine learning logic from the implementation details of a particular database.
Leverage the parallelism and scalability provided by MPP's no-sharing technology, such as the Greenplum database and HAWQ.
The maintenance activities performed are open to the Apache community and ongoing academic research.
If you only summarize the features of Madlib in one sentence, as described in the title, you can use SQL to play data analysis, data mining, and machine learning.
2. Features
(1) Classification
If the desired output is essentially categorical, you can use the classification method to build the model and predict which category the new data will belong to. The goal of the classification is to be able to mark the input record as the correct category.
Examples of classifications: Suppose there are data describing demographics, as well as individual applications for loans and loan defaults, then we can create a model that describes the possibility of a new demographic data set of loan defaults. In this scenario, the output is categorized as "default" and "normal".
(2) return
If the desired output has continuity, we use the regression method to build the model and predict the output value.
Example of regression: if there is real data describing real estate attributes, we can build a model that predicts the price based on the known features of the House. The scenario is a regression problem because the output responds to sequential values rather than classifications.
(3) Clustering
Identify data groupings in which the data items in a group are more similar to the data items in other groups.
Clustering Example: In customer segmentation analysis, the goal is to identify customer behavior similar feature groups, in order to target different characteristics of customers to design a variety of marketing activities to achieve market objectives. This will be a controlled classification task if you understand the customer segmentation in advance. This is a clustering task when we group the data to identify itself.
(4) Thematic modeling
Theme modeling is similar to clustering, and it is also a data group that identifies each other as similar. But the similarity here is usually referred to as a document with the same subject in the Text field.
(5) Mining Association Rules
Also called shopping basket analysis or frequent itemsets mining. Relative to random occurrences, determine what matters more often together and indicate the potential relationship between matters.
An example of association rule mining: in an online store application, association rules mining can be used to determine which commodities tend to be sold together. These items are then entered into the customer referral engine to provide promotional opportunities, such as the famous beer and diaper story.
(6) Descriptive statistics
Descriptive statistics do not provide a model and are therefore not considered a machine learning method. Descriptive statistics, however, can help provide information to analysts to understand the underlying data, provide valuable explanations for the data, and may affect the choice of the data model.
Examples of descriptive statistics: calculating the distribution of data within each variable in a dataset can help to understand which variables should be treated as categorical variables, which are continuity variables, and the distribution of values.
(7) Model validation
Starting using it without knowing the accuracy of a model can lead to bad results. Because of this, it is important to understand the problems of the model and evaluate the accuracy of the model with the test data. It is necessary to separate the training data from the test data, analyze the data frequently, verify the validity of the statistical model, and evaluate the model not to fit the training data excessively. N-fold cross-validation is also often used.
3. function
The features of Madlib are shown in 1.
Figure 1
? Data Types and Transformations (datatype conversion)
? Arrays and matrices (arrays and matrices)
O array Operations (array operations)
O Matrix Operations (matrix operations)
Omatrix factorization (low matrix decomposition)
o Low-rank Matrix factorization (low-order matrix decomposition)
o Singular value decomposition (SVD, singular value decomposition)
o norms and Distance functions (specification and distance functions)
o Sparse Vectors (sparse vector)
? dimensionality Reduction (dimensionality reduction)
o Principal Component Analysis (PCA principal component)
o Principal Component Projection (PCP Principal component projection)
? Encoding categorical Variables (coded categorical variable)
? Stemming (word cut)
? Model Evaluation (evaluation of models)
? Cross Validation (crossover verification)
? Statistics (statistics)
? Descriptive Statistics (descriptive statistics)
o Pearson ' s Correlation (Pierce relevance)
o Summary (summary summary)
? Inferential Statistics (inferential statistics)
o hypothesis Tests (hypothesis test)
? Probability Functions (probability function)
? Supervised learning (supervised learning algorithm)
? Conditional random field (conditional random field)
? Regression Models (regression model)
o Clustered Variance (cluster variance)
o cox-proportional Hazards Regression (Cox ratio risk regression model)
o Elastic net regularization (Elastic net regression)
O Generalized Linear Models
o Linear Regression (linear regression)
o Logistic Regression (logistic regression)
O Marginal effects (marginal effect)
o multinomial Regression (polynomial regression)
o Ordinal Regression (ordered regression)
O Robust Variance (robust variance)
? Supported Vector machines (SVM, support vector machine)
? Trees Methods (Tree model)
O Decision tree (decision Trees)
o Random Forest (stochastic forest)
? Time series Analysis (TimeSeries)
? ARIMA (autoregressive integral sliding average model)
? Unsupervised learning (unsupervised learning)
? Association Rules (Association rule)
o Apriori algorithm (Apriori algorithm)
? Clustering (cluster)
o K-means Clustering (K-means)
? Topic modelling (Thematic model)
o Latent Dirichlet Allocation (LDA)
? Utility Functions (utility function)
? Developer Database Functions (Developer data function)
? Linear Solvers (linear solver)
o Dense Linear systems (dense linear systems)
o Sparse Linear Systems (sparse linear systems)
? Path Functions (route function)
? PMML export (PMML output)
? Text analysis (textual parsing)
o Term Frequency (Word frequency, TF)
Second, installation
1. Determine the installation platform
The latest release version of Madlib is 1.11, which can be installed in PostgreSQL, Greenplum, and Hawq, and the installation process varies from one database to another. I was installed in the HAWQ2.1.1.0.
2. Download the Madlib binary install Compression pack
As: Https://network.pivotal.io/products/pivotal-hdb. The 2.1.1.0 version of HAWQ provides four installation files, as shown in 2. After testing, only Madlib 1.10.0 version of the file can be installed properly.
Figure 2
3. Installing Madlib
The following command needs to be performed on the master host of the HAWQ using the Gpadmin user.
(1) Decompression
[Plain] View plain copy
TAR-ZXVF madlib-ossv1.10.0_pv1.9.7_hawq2.1-rhel5-x86_64.tar.gz
(2) Installing the Madlib gppkg file
[Plain] View plain copy
Gppkg-i madlib-ossv1.10.0_pv1.9.7_hawq2.1-rhel5-x86_64.gppkg
This command creates Madlib installation directories and files on all nodes of the HAWQ cluster (master and segment), and the default directory is/usr/local/hawq_2_1_1_0/madlib.
(3) Deploying Madlib in the specified database
[Plain] View plain copy
$GPHOME/madlib/bin/madpack install-c/dm-s www.wanmeiyuele.cn madlib-p hawq
This command establishes the Madlib schema,-p parameter in the DM database of HAWQ to specify the platform as HAWQ. After the command executes, you can view the database objects created in the Madlib schema.
[Plain] View plain copy
dm=# set Search_path=madlib;
SET
dm=# \dt
List of relations
Schema | Name | Type | Owner | Storage
--------+------------------+-------+---------+-------------
Madlib | Migrationhistory | Table | Gpadmin | Append only
(1 row)
dm=# \ds
List of relations
Schema | Name | Type | Owner | Storage
--------+-------------------------+----------+---------+---------
Madlib | Migrationhistory_id_seq | Sequence | Gpadmin | Heap
(1 row)
dm=# Select Type,count (*)
dm-# from (select P.proname as name,
DM (# Case if P.proisagg then ' agg '
DM (# when p.prorettype = ' Pg_catalog.trigger '::p g_catalog.regtype Then ' Trigger '
DM (# Else ' normal '
DM (# End as type
DM (# from Pg_catalog.pg_proc p, pg_catalog.pg_namespace n
DM (# WHERE n.oid = P.pronamespace and N.nspname= ' Madlib ') t
dm-# GROUP by rollup (type);
Type | Count
--------+-------
agg | 135
normal | 1324
| 1459
(3 rows)
As you can see, Madlib deploys the application Madpack first creates the database schema Madlib, and then creates the database object in that schema, including a table, a sequence, 1324 normal functions, and 135 aggregate functions. All machine learning and data mining models, algorithms, operations, and functions are actually performed by invoking these functions.
(4) Verifying the installation
[Plain] View plain copy
$GPHOME/madlib/bin/madpack install-check-c/dm-s madlib-p hawq
This command verifies that all models work correctly by performing 77 cases of 29 models. The command output is as follows:
[Plain] View plain copy
[Email protected] madlib]$ $GPHOME/madlib/bin/madpack install-check-c/dm-s madlib-p hawq
madpack.py:INFO:Detected HAWQ version 2.1.
TEST Case result| module:array_ops|array_ops.sql_in| pass| time:1851 milliseconds
TEST Case result| module:bayes|gaussian_naive_www.longboshyl.cn bayes.sql_in| pass| time:24222 milliseconds
TEST Case result| module:bayes|bayes.sql_in| pass| time:70634 milliseconds
TEST Case result| module:crf|crf_train_small.sql_in| pass| time:27186 milliseconds
TEST Case result| module:crf|crf_train_large.sql_in| pass| time:32602 milliseconds
TEST Case result| module:crf|crf_test_small.sql_in| pass| time:22410 milliseconds
TEST Case result| module:crf|crf_test_large.sql_in| pass| time:21711 milliseconds
TEST Case result| module:elastic_net|elastic_net_install_check.sql_in| pass| time:931563 milliseconds
TEST Case result| module:graph|sssp.sql_in| pass|www.jiaeidaypt.cn time:18174 milliseconds
TEST Case result| module:linalg|svd.sql_in| pass| time:72105 milliseconds
TEST Case result| module:linalg|matrix_ops.sql_in| pass| time:58312 milliseconds
TEST Case result| module:linalg|linalg.sql_in| pass| time:2836 milliseconds
TEST Case result| module:pmml|table_to_pmml.sql_in| pass| time:34508 milliseconds
TEST Case result| module:pmml|pmml_rf.sql_in| pass| time:35993 milliseconds
TEST Case result| module:pmml|pmml_ordinal.sql_in| pass| time:15540 milliseconds
TEST Case result| module:pmml|pmml_multinom.sql_in| pass| time:12546 milliseconds
TEST Case result| Module:pmml|pmml_glm_poisson.sql_in|www.huazongyule.net pass| time:7321 milliseconds
TEST Case result| module:pmml|pmml_glm_normal.sql_in| pass| time:8597 milliseconds
TEST Case result| module:pmml|pmml_glm_ig.sql_in| pass| time:8861 milliseconds
TEST Case result| module:pmml|pmml_glm_gamma.sql_in| pass| time:26212 milliseconds
TEST Case result| module:pmml|pmml_glm_binomial.sql_in| pass| time:12977 milliseconds
TEST Case result| module:pmml|pmml_dt.sql_in| pass| time:9401 milliseconds
TEST Case result| module:prob|prob.sql_in| pass| time:1917 milliseconds
TEST Case result| module:sketch|support.sql_in| pass| time:143 milliseconds
TEST Case result| module:sketch|mfv.sql_in| pass| time:720 milliseconds
TEST Case result| module:sketch|fm.sql_in| pass| time:7301 milliseconds
TEST Case result| module:sketch|cm.sql_in| pass| time:19777 milliseconds
TEST Case result| module:svm|svm.sql_in| pass| time:205677 milliseconds
TEST Case result| module:tsa|arima_train.sql_in| pass| time:75680 milliseconds
TEST Case result| module:tsa|arima.sql_in| pass| time:76236 milliseconds
TEST Case result| module:conjugate_gradient|conj_grad.sql_in| pass| time:6757 milliseconds
TEST Case result| module:knn|knn.sql_in| pass| time:9835 milliseconds
TEST Case result| module:lda|lda.sql_in| pass| time:20510 milliseconds
TEST Case result| module:stats|wsr_test.sql_in| pass| time:2766 milliseconds
TEST Case result| module:stats|t_test.sql_in| pass| time:3686 milliseconds
TEST Case result| module:stats|robust_and_clustered_variance_coxph.sql_in| pass| time:17499 milliseconds
TEST Case result| module:stats|pred_metrics.sql_in| pass| time:14032 milliseconds
TEST Case result| module:stats|mw_test.sql_in| pass| time:1852 milliseconds
TEST Case result| module:stats|ks_test.sql_in| pass| time:2465 milliseconds
TEST Case result| module:stats|f_test.sql_in| pass| time:2358 milliseconds
TEST Case result| module:stats|cox_prop_hazards.sql_in| pass| time:39932 milliseconds
TEST Case result| module:stats|correlation.sql_in| Www.wmyl11.com pass| time:10520 milliseconds
TEST Case result| module:stats|chi2_test.sql_in| pass| time:3581 milliseconds
TEST Case result| module:stats|anova_test.sql_in| pass| time:1801 milliseconds
TEST Case result| module:svec_util|svec_test.sql_in| pass| time:14043 milliseconds
TEST Case result| module:svec_util|gp_sfv_sort_order.sql_in| pass| time:3399 milliseconds
TEST Case result| Module:utilities|text_utilities.sql_www.wmyl15.com in| pass| time:6579 milliseconds
TEST Case result| module:utilities|sessionize.sql_in| pass| time:3901 milliseconds
TEST Case result| module:utilities|pivot.sql_in| pass| time:15634 milliseconds
TEST Case result| module:utilities|path.sql_in| pass| time:9321 milliseconds
TEST Case result| module:utilities|encode_categorical.sql_in| pass| time:7665 milliseconds
TEST Case result| module:utilities|drop_madlib_temp.sql_in| pass| time:153 milliseconds
TEST Case result| module:assoc_rules|assoc_rules.sql_in| pass| time:31975 milliseconds
TEST Case result| module:convex|lmf.sql_in| pass| time:66775 milliseconds
TEST Case result| module:glm|poisson.sql_in| pass| Time:www.ruanjianyin.cn 19117 milliseconds
TEST Case result| module:glm|ordinal.sql_in| pass| time:23446 milliseconds
TEST Case result| module:glm|multinom.sql_in| pass| time:18780 milliseconds
TEST Case result| module:glm|inverse_gaussian.sql_in| pass| time:20931 milliseconds
TEST Case result| module:glm|gaussian.sql_in| pass| time:23795 milliseconds
TEST Case result| module:glm|gamma.sql_in| pass| time:43365 milliseconds
TEST Case result| module:glm|binomial.sql_in| pass| time:39437 milliseconds
TEST Case result| module:linear_systems|sparse_linear_sytems.sql_in| pass| time:5405 milliseconds
TEST Case result| module:linear_systems|dense_linear_sytems.sql_in| pass| time:3331 milliseconds
TEST Case result| module:recursive_partitioning|random_forest.sql_in| pass| time:294832 milliseconds
TEST Case result| module:recursive_partitioning|decision_tree.sql_in| pass| time:91311 milliseconds
TEST Case result| module:regress|robust.sql_in| pass| time:55325 milliseconds
TEST Case result| module:regress|multilogistic.sql_in| pass| time:25330 milliseconds
TEST Case result| module:regress|marginal.sql_in| pass| time:www.10000da.cn 73750 milliseconds
TEST Case result| module:regress|logistic.sql_in| pass| time:76501 milliseconds
TEST Case result| module:regress|linear.sql_in| pass| time:7517 milliseconds
TEST Case result| module:regress|clustered.sql_in| pass| time:40661 milliseconds
TEST Case result| module:sample|sample.sql_in| pass| time:890 milliseconds
TEST Case result| module:summary|summary.sql_in| pass| time:14644 milliseconds
TEST Case result| module:kmeans|kmeans.sql_in| pass| time:52173 milliseconds
TEST Case result| module:pca|pca_project.sql_in| pass| time:229016 milliseconds
TEST Case result| module:pca|pca.sql_in| pass| time:523230 milliseconds
TEST Case result| module:validation|cross_validation.sql_in| pass| time:33685 milliseconds
[Email protected] madlib]$
As you can see, all cases have been performed properly, indicating that the Madlib installation was successful.
Third, uninstall
The uninstallation process is basically the inverse of the installation.
1. Delete Madlib mode
Method 1, deploy the application using Madpack.
[Plain] View plain copy
$GPHOME/madlib/bin/madpack uninstall-c/dm-s madlib-p hawq
Method 2, use the SQL command to manually delete the mode.
[SQL] View plain copy
Drop schema Madlib cascade;
2. Delete other legacy database objects
(1) Delete mode
If there is an error in the middle of the test, the database may contain a test pattern, these schema names are prefixed with madlib_installcheck_ and can only be executed manually by executing the SQL command, such as:
[Plain] View plain copy
Drop schema Madlib_installcheck_kmeans cascade;
(2) Delete user
If there is a legacy test user, delete it.
[SQL] View plain copy
Drop user if exists Madlib_1100_installcheck;
3. Delete Madlib RPM Package
(1) Query package name
[Plain] View plain copy
Gppkg-q--all
The output is as follows:
[Plain] View plain copy
[Email protected] madlib]$ gppkg-q--all
20170630:16:19:53:076493 gppkg:hdp3:gpadmin-[info]:-starting gppkg with args:-Q--all
madlib-ossv1.10.0_pv1.9.7_hawq2.1
(2) Delete RPM package
[Plain] View plain copy
Gppkg-r madlib-ossv1.10.0_pv1.9.7_hawq2.1
Using SQL to play Data mining Madlib (i)--Installation