Classification and interpretation of Spark 39 machine Learning Library _ machine learning

Source: Internet
Author: User
Tags pmml spark mllib

As an article of the College (http://xxwenda.com/article/584), the follow-up preparation is to be tested individually. Of course, there have been many tests.


Apache Spark itself
1.MLlib
Amplab
Spark was originally born in the Berkeley Amplab Laboratory and is still a Amplab project, though not in the Apache Spark Foundation, but still has a considerable place in your daily GitHub program.

ML Base

The mllib of the spark itself is at the bottom of the three-layer ML base, MLI is in the middle layer, and ML optimizer is at the top of the abstraction.

2.MLI

3.ML Optimizer (also known as Ghostface)

Ghostware the project began in 2014, but it has never been published. In these 39 machine learning libraries, this is the only piece of fog that can be included in this list, all with the status of Amplab and ML base support.

Outside of ML base

4.Splash

This is a recent June 2015 project, when running a random gradient descent (SGD), this set of stochastic learning algorithms claims to be 25%-75% faster than spark mlib in performance. This is the SP tag item in the Amplab lab and is therefore worth reading.

5.Keystone ML

KML has introduced the End-to-end machine learning pipeline into the spark, but the pipeline has matured in the recent spark version. Also promised to have some computer vision, I have also mentioned in the blog that there are some limitations.

6.Velox

As a server dedicated to the management of a large number of machine learning model collection.

7.CoCoA

By optimizing the communication mode and shuffles to achieve faster machine learning, the detailed description of this paper, "efficient communication distributed double coordinate rise".
Framework
Gpu-based

8.deeplearning4j

I once had a blog to explain that "deeplearning4j increased the support of the spark GPU."

9.Elephas

New concept, this is my original intention to write this blog. It provides an interface to the Keras.

Non-gpu-based

10.DistML

A parameter server (as Spark mlib) in which the pattern is parallel rather than data parallel.

11.Aerosolve

From Airbnb, for their automated pricing.

Zen

Logistic regression, implied Dirichlet distribution (LDA), factorization, neural network, limited Boltzmann machine.

13.Distributed Data Frame

Similar to the Spark Dataframe, but the engine is unknowable (for example, in the future it will run on the engine rather than the spark). This includes the interface between Cross-validation and the external machine learning Library.
Interface to other machine learning systems
Spark-corenlp

Encapsulates the Stanford CORENLP.

Sparkit-learn

The interface to the Python scikit-learn.

Sparkling Water

to the interface.

Hivemall-spark

Encapsulates the Hivemall, the machine learning in hive.

Spark-pmml-exporter-validator

The Predictive Model Markup Language (PMML) can be exported, an industry-standard XML format for passing machine learning models.
Additional components: Enhances existing algorithms in Mllib.
Mllib-dropout

Increase dropout capacity for spark mllib. Based on the implementation of this paper, the dropout: an easy way to prevent over fitting in neural networks.

20.generalized-kmeans-clustering

Add any distance function for K-means algorithm.

Spark-ml-streaming.

The visualized streaming machine learning algorithm is placed in spark mllib.
Algorithm
Supervised learning

Spark-libfm

Factor decomposition Machine.

Scalanetwork

Recursive neural Network (Rnns).

Dissolve-struct

Based on the high performance Spark communication framework mentioned above, support vector machines (SVM) are cocoa.

Sparkling ferns

Based on the implementation of this paper, the "image classification technique by using random forest and random fern algorithm".

Streaming-matrix-factorization

Matrix decomposition recommendation System.

Non-supervised learning

Patchwork

The speed of clustering is 40% higher than the K-means algorithm in Spark mllib.

bisecting k-meams Clustering

The K-means algorithm, which can produce more identical size clusters, is based on the document classification technology comparison.

Spark-knn-graphs.

The graph is established by using the K-nearest neighbor algorithm and the position-sensitive hash function (LSH).

Topicmodeling.

On-line implicit Dirichlet distribution, Gibbs sampling implied Dirichlet distribution, online level Dirichlet process (HDP)
Algorithm building block.

Sparkboost

Adaptive lifting algorithm and mp-boost algorithm.

Spark-tfocs

Integrated into the Spark Tfocs (first-order conic solver template), if your machine learning cost function is just a convex function, then you can run Tfocs to solve the problem.

Lazy-linalg.

Using the LINALG package in spark Mllib to complete the linear algebra operation.
Feature Extraction
34.spark-infotheoretic-feature-selection

The Information Theory foundation of Feature selection. The implementation of this paper is based on the conditional maximum likelihood method: A unified framework for feature selection under information theory.

Spark-mdlp-discretization

For data labels, a part of the continuous digital dimension is "discretized", so that the data classes in each case can be evenly distributed. This is the basic idea of the cart and ID3 algorithm to generate decision trees. The realization of this paper is based on the multi-interval discretization of continuous value attribute in categorical learning.

Spark-tsne.

The distributed T-sne algorithm is used for data dimensionality reduction.

Panax Modelmatrix.

Sparse feature vectors sparse feature vectors.
Specific areas
Spatial and time-series data

K mean value algorithm, regression algorithm and statistical method.

Twitter data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.