Spark Machine Learning Mllib Series 1 (for Python)--data type, vector, distributed matrix, API
Key words: Local vector,labeled point,local matrix,distributed Matrix,rowmatrix,indexedrowmatrix,coordinatematrix, Blockmatrix.
Mllib supports local vectors and matrices stored on single computers, and of course supports distributed matrices stored as RDD. An example of a supervised machine learning is called a label point in Mllib.
1. Local vector
A local vector is stored on a single computer by an integer type and an index starting at 0.
。 Mllib supports two local vectors, dense vectors and sparse vectors. A dense vector represents all of its values by a floating-point number group, while a sparse matrix consists of two parallel arrays, indexed and valued. For example, a vector (1.0,0.0,3.0) can be expressed in a dense representation of [1.0,0.0,3.0] or in a sparse form (3,[0,2],[1.0,3.0]), and 3 is the size of the vector. (I note: 3 is the length, that is, the number of elements, [0,2] is indexed, [1.0,3.0], for the value)
1. 1MLlib considers the following data types to be dense vectors:
~numpys Array
~python List
1.2MLlib considers the following data types to be sparse vectors:
~mllib ' s sparsevector.
~scipy ' s Csc_matrix with a single colum
For efficiency, we recommend using numpy arrays and using the factory method to inherit vectors to create sparse matrices.
Import NumPy as NP
import Scipy.sparse as SPS from
pyspark.mllib.linalg import Vectors
# Use a numpy array as A dense vector.
DV1 = Np.array ([1.0, 0.0, 3.0])
# Use a Python list as a dense vector.
DV2 = [1.0, 0.0, 3.0]
# Create a sparsevector.
SV1 = Vectors.sparse (3, [0, 2], [1.0, 3.0])
# Use a single-column scipy as a csc_matrix vector.
Sv2 = Sps.csc_matrix (Np.array ([1.0, 3.0]), Np.array ([0, 2]), Np.array ([0, 2]), Shape = (3, 1))
2. Label Points
The label point can be a local vector that can be sparse or dense, and in short they are tagged. In Mllib, tag points are used for supervised learning algorithms. We use a double number to store a label so that we can categorize it with a label point, or we can do a regression. For the second Division, a label should be either 0 or 1. For a variety of categories, the label should be indexed from 0,1,2,3 ....
A label point is represented by Labelpoint.
From Pyspark.mllib.linalg import sparsevector from
pyspark.mllib.regression import labeledpoint
# Create A Labeled Point with a positive label and a dense feature vector.
pos = Labeledpoint (1.0, [1.0, 0.0, 3.0])
# Create A labeled point with a negative label and a sparse feature 4/>neg = Labeledpoint (0.0, Sparsevector (3, [0, 2], [1.0, 3.0])
Sparse data
In training, having a sparse training data is a very common thing. Mllib supports reading a training example in the LIBSVM format. LIBSVM is the default format for LIBSVM and Liblinear. This is a sparse vector format with one label for each row, as follows:
Label Index1:value1 index2:value2 ...
The index is in ascending order starting at 1. When the read is complete, these feature indexes are converted to start from 0.
Training model of Mlutils.loadlibsvmfile reading storage LIBSVM format
From Pyspark.mllib.util import mlutils
examples = Mlutils.loadlibsvmfile (SC, "Data/mllib/sample_libsvm_data.txt" )
3. Local matrix
A local matrix has an integer-type row, a double-precision column index, and is stored on a single computer. Mllib supports sparse matrices that store all data on a separate array and are in the order of columns. For example, a dense matrix like the following:
This matrix is a matrix that is stored in a one-dimensional array [1.0, 3.0, 5.0, 2.0, 4.0, 6.0] in size (3,2).
The base class for the local matrix is the matrix, and we provide two implementation functions: Densematrix and Sparsematrix. We recommend the factory implementation method inside matrices to create the local matrix. Remember that the local matrix of the Mllib is listed as an ordinal store.
From pyspark.mllib.linalg import matrix, matrices
# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
DM2 = Matrices.dense (3, 2, [1, 2, 3, 4, 5, 6])
# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
SM = Mat Rices.sparse (3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
4. Distributed Matrix
A distributed matrix has a long type of row and column, as well as a double value, and is distributed in one or more rdds. It's important to choose the right format for storing huge distributed matrices. Converting a distributed matrix may require a global cleaning, which is very costly. Until now, four kinds of distributed matrices have been implemented.
The basic types of these four kinds are called Rowmatrix. This rowmatrix is a row-oriented distributed matrix, and there is no so-called row index. For example: A collection of eigenvectors. It relies on its own line with RDD, and each row of Rdd is a local vector. For a rowmatrix we assume that the number of columns is not very large so that a single local vector can reasonably and correctly communicate with the driver and be able to store operations on a node that is using it.
Indexedrowmatrix and Rowmatrix are very similar except for the different row indices that can be used to branch and perform a merge. Coordinatematrix is a distributed matrix stored in Rdds entries in the coordinate list (COO) format.
Blockmatrix is a distributed matrix supported by Rddmatrixblock, Matrixblock is ganso (int, int, matrix).
Note
The potential distributed matrix RDD must be deterministic, as we cache the size of the matrix, and generally using a non-deterministic Rdd can lead to errors.
Rowmatrix
Rowmatrix is a row-oriented distributed matrix that does not have a so-called row index and can use RDD rows, which are local vectors. Since each row is represented by a local vector, the number of columns is limited by the integer range, but the number of columns should actually be much smaller than the rows.
From pyspark.mllib.linalg.distributed import Rowmatrix
# Create a RDD of vectors.
rows = Sc.parallelize ([[1, 2, 3], [4, 5, 6], [7, 8, 9], [Ten,]])
# Create a rowmatrix from a RDD of vectors.
mat = Rowmatrix (rows)
# get its size.
m = Mat.numrows () # 4
n = mat.numcols () # 3 # Get the rows as a
RDD of vectors again.
Rowsrdd = Mat.rows
Indexedrowmatrix
Indexedrowmatrix and Rowmatrix are very similar except for meaningful row indexes. It uses the RDD index row so that each row represents its index and the local vector.
A indexedrowmatrix can be created by Indexedrowmatrix, a indexedrowmatrix to be converted to Rowmatrix by removing the row index.
From pyspark.mllib.linalg.distributed import Indexedrow, Indexedrowmatrix
# Create a RDD of indexed rows.
# -This can is done explicitly with the Indexedrow class:
indexedrows = Sc.parallelize ([Indexedrow (0, [1, 2, 3]),
Indexedrow (1, [4, 5, 6]),
Indexedrow (2, [7, 8, 9]),
Indexedrow (3, [a)])]
# -or by using (long, vector) tuples:
indexedrows = Sc.parallelize ([(0, [1, 2, 3]), (1, [4, 5, 6]),
(2, [7, 8, 9]), (3, [A,,])]
# Create an Indexedrowmat Rix from a RDD of indexedrows.
Mat = Indexedrowmatrix (indexedrows)
# get its size.
m = Mat.numrows () # 4
n = mat.numcols () # 3 # Get the rows as a
RDD of indexedrows.
Rowsrdd = mat.rows
# Convert to a rowmatrix by dropping the row indices.
Rowmat = Mat.torowmatrix ()
Coordinatematrix
Coordinatematrix is a distributed matrix,
and supported by Rdd's entries. Each entry is a Ganso (I:long, J:long, value:double), I is a row index, J is a column index, and value is an entry value. Coordinatematrix should only be used when the matrix is particularly large and the matrices are sparse.
Coordinatematrix can be created by Matrixentry entries, Coordinatematrix can be converted to Rowmatrix by using Torowmatrix, or a sparse row Indexedrowmatrix by using Toindexedrowmatrix.
From pyspark.mllib.linalg.distributed import Coordinatematrix, matrixentry
# Create A RDD of coordinate entries.
# -This can do explicitly with the Matrixentry class:
entries = Sc.parallelize ([matrixentry (0, 0, 1.2), Ma Trixentry (1, 0, 2.1), Matrixentry (6, 1, 3.7)])
# -or using (long, long, float) tuples:
entries = Sc.paralleli Ze ([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)]) # Create a coordinatematrix from a RDD of
matrixentries.
Mat = Coordinatematrix (entries)
# get its size.
m = Mat.numrows () # 3
n = mat.numcols () # 2 # Get the entries as a
RDD of matrixentries.
Entriesrdd = mat.entries
# Convert to a rowmatrix.
Rowmat = Mat.torowmatrix ()
# Convert to a indexedrowmatrix.
Indexedrowmat = Mat.toindexedrowmatrix ()
# Convert to a blockmatrix.
Blockmat = Mat.toblockmatrix ()
Blockmatrix
Blockmatrix is a distributed matrix and is supported by Matrixblocks, Matrixblocks is a meta ancestor, (int, int), matrix, (int, int) is a block index, matrix is Rowsperblock x The shape of the colsperblock.
From pyspark.mllib.linalg import matrices from
pyspark.mllib.linalg.distributed import Blockmatrix
# Create RDD of Sub-matrix blocks.
blocks = Sc.parallelize ([(0, 0), Matrices.dense (3, 2, [1, 2, 3, 4, 5, 6])),
((1, 0), Matrices.dense (3, 2, [7, 8, 9, 1 0, one,])]
# Create a blockmatrix from a RDD of Sub-matrix blocks.
Mat = Blockmatrix (blocks, 3, 2)
# get its size.
m = Mat.numrows () # 6
n = mat.numcols () # 2 # Get the blocks as a
RDD of Sub-matrix blocks.
Blocksrdd = mat.blocks
# Convert to a localmatrix.
Localmat = Mat.tolocalmatrix ()
# Convert to a indexedrowmatrix.
Indexedrowmat = Mat.toindexedrowmatrix ()
# Convert to a coordinatematrix.
Coordinatemat = Mat.tocoordinatematrix ()
Original address data types-rdd-based API