As an article of the College (http://xxwenda.com/article/584), the follow-up preparation is to be tested individually. Of course, there have been many tests.
Apache Spark itself
1.MLlib
Amplab
Spark was originally born in the Berkeley Amplab Laboratory and is still a Amplab project, though not in the Apache Spark Foundation, but still has a considerable place in your daily GitHub program.
ML Base
The mllib of the spark itself is at the bottom of the three-layer ML base, MLI is in the middle layer, and ML optimizer is at the top of the abstraction.
2.MLI
3.ML Optimizer (also known as Ghostface)
Ghostware the project began in 2014, but it has never been published. In these 39 machine learning libraries, this is the only piece of fog that can be included in this list, all with the status of Amplab and ML base support.
Outside of ML base
4.Splash
This is a recent June 2015 project, when running a random gradient descent (SGD), this set of stochastic learning algorithms claims to be 25%-75% faster than spark mlib in performance. This is the SP tag item in the Amplab lab and is therefore worth reading.
5.Keystone ML
KML has introduced the End-to-end machine learning pipeline into the spark, but the pipeline has matured in the recent spark version. Also promised to have some computer vision, I have also mentioned in the blog that there are some limitations.
6.Velox
As a server dedicated to the management of a large number of machine learning model collection.
7.CoCoA
By optimizing the communication mode and shuffles to achieve faster machine learning, the detailed description of this paper, "efficient communication distributed double coordinate rise".
Framework
Gpu-based
8.deeplearning4j
I once had a blog to explain that "deeplearning4j increased the support of the spark GPU."
9.Elephas
New concept, this is my original intention to write this blog. It provides an interface to the Keras.
Non-gpu-based
10.DistML
A parameter server (as Spark mlib) in which the pattern is parallel rather than data parallel.
11.Aerosolve
From Airbnb, for their automated pricing.
Zen
Logistic regression, implied Dirichlet distribution (LDA), factorization, neural network, limited Boltzmann machine.
13.Distributed Data Frame
Similar to the Spark Dataframe, but the engine is unknowable (for example, in the future it will run on the engine rather than the spark). This includes the interface between Cross-validation and the external machine learning Library.
Interface to other machine learning systems
Spark-corenlp
Encapsulates the Stanford CORENLP.
Sparkit-learn
The interface to the Python scikit-learn.
Sparkling Water
to the interface.
Hivemall-spark
Encapsulates the Hivemall, the machine learning in hive.
Spark-pmml-exporter-validator
The Predictive Model Markup Language (PMML) can be exported, an industry-standard XML format for passing machine learning models.
Additional components: Enhances existing algorithms in Mllib.
Mllib-dropout
Increase dropout capacity for spark mllib. Based on the implementation of this paper, the dropout: an easy way to prevent over fitting in neural networks.
20.generalized-kmeans-clustering
Add any distance function for K-means algorithm.
Spark-ml-streaming.
The visualized streaming machine learning algorithm is placed in spark mllib.
Algorithm
Supervised learning
Spark-libfm
Factor decomposition Machine.
Scalanetwork
Recursive neural Network (Rnns).
Dissolve-struct
Based on the high performance Spark communication framework mentioned above, support vector machines (SVM) are cocoa.
Sparkling ferns
Based on the implementation of this paper, the "image classification technique by using random forest and random fern algorithm".
Streaming-matrix-factorization
Matrix decomposition recommendation System.
Non-supervised learning
Patchwork
The speed of clustering is 40% higher than the K-means algorithm in Spark mllib.
bisecting k-meams Clustering
The K-means algorithm, which can produce more identical size clusters, is based on the document classification technology comparison.
Spark-knn-graphs.
The graph is established by using the K-nearest neighbor algorithm and the position-sensitive hash function (LSH).
Topicmodeling.
On-line implicit Dirichlet distribution, Gibbs sampling implied Dirichlet distribution, online level Dirichlet process (HDP)
Algorithm building block.
Sparkboost
Adaptive lifting algorithm and mp-boost algorithm.
Spark-tfocs
Integrated into the Spark Tfocs (first-order conic solver template), if your machine learning cost function is just a convex function, then you can run Tfocs to solve the problem.
Lazy-linalg.
Using the LINALG package in spark Mllib to complete the linear algebra operation.
Feature Extraction
34.spark-infotheoretic-feature-selection
The Information Theory foundation of Feature selection. The implementation of this paper is based on the conditional maximum likelihood method: A unified framework for feature selection under information theory.
Spark-mdlp-discretization
For data labels, a part of the continuous digital dimension is "discretized", so that the data classes in each case can be evenly distributed. This is the basic idea of the cart and ID3 algorithm to generate decision trees. The realization of this paper is based on the multi-interval discretization of continuous value attribute in categorical learning.
Spark-tsne.
The distributed T-sne algorithm is used for data dimensionality reduction.
Panax Modelmatrix.
Sparse feature vectors sparse feature vectors.
Specific areas
Spatial and time-series data
K mean value algorithm, regression algorithm and statistical method.
Twitter data