MLlib is a distributed machine learning library built on spark that leverages Spark's in-memory computing and the benefits of iterative computing to dramatically improve performance. At the same time, because of the rich expressive force of Spark operator, the algorithm development of large-scale machine learning is no longer complex.
MLlib is the implementation of some commonly used machine learning algorithms and libraries on the spark platform. MLlib is the underlying component of Amplab's mlbase machine learning project.
Mlbase is a machine learning platform , see http://www.cnblogs.com/zlslch/p/5726445.html in detail
MLI is an interface layer that provides many constructs, and MLlib is the underlying algorithm implementation layer, shown in 1.
Figure 1 Mlbase
The MLlib includes classification and regression, clustering, collaborative filtering, data reduction components, and the underlying optimization library, as shown in 2.
Figure 2 MLlib component diagram
With Figure 2, we can have a macro grasp of MLlib's overall components and dependent libraries.
A brief introduction to the underlying components:
blas/lapack layer : LAPACK is written in Fortran algorithm library, as the name implies, Linearalgebra package, is to solve the general problem of linear algebra. In addition, the algorithm package that must be mentioned is BLAS (Basic Linear Algebra subprograms), in fact LAPACK the bottom is used BLAS library. Many computer vendors have provided blas/lapack algorithm packages optimized for different processors.
Netlib-java(official website: https://github.com/fommil/netlib-java/) is a Java interface layer for the underlying blas,lapack encapsulation.
Breeze(official website: https://github.com/scalanlp/breeze) is a Scala-written numerical processing library that provides APIs such as vectors, matrix operations, and so on.
Library Dependency : The MLlib is used in Scala's linear algebra library Breeze, Breeze the underlying Netlib-java library. Netlib-java is dependent on the native Fortran routines. Therefore, when the user needs to use
Pre-Install the Gfortran Runtime Library on the node (download address: https://github.com/mikiobraun/jblas/wiki/Missing-Libraries). Due to the license (license) issue, the official MLlib relies on concentration without
Introduce the dependency of the Netlib-java native repository. If the runtime environment does not have a native library available, the user will see a warning message. If you need to use Netlib-java libraries in your program, you will need to introduce com.github.fommil.netlib:all:1.1.2 dependencies or reference guides to your project (URL: https://github.com/fommil/ Netlib-java/blob/master/readme.md#machine-optimised-system-libraries) to build the user's own project. If the user needs to use the Python interface, it requires a 1.4 or later version of NumPy (note: MLlib source notes Experimental/developerapi API may be adjusted and changed in the future release, the official will be published in different versions Provide migration Guide).
Introduction to Apache Spark Mllib