Mllib Design principle: The data in the form of an RDD, and then call the various algorithms on the distributed data set. Mllib is a collection of functions that can be called on the Rdd.
Operation Steps:
1. Use the string Rdd to represent the information.
2. A feature extraction algorithm in running Mllib to convert the text data to a numeric character. A vector rdd is returned to the operation.
3, the vector RDD call classification algorithm, return a model object, you can use the object to classify the new data points.
4. Use the evaluation function of Mllib to evaluate the model on the test data set.
Machine Learning Basics :
The machine learning algorithm attempts to maximize the mathematical objectives that represent the behavior of the algorithm based on the training data, and to make predictions or make decisions. Including classification, regression, clustering, each has a different goal. All the learning algorithms need to define the feature set for each data point, which is the value passed to the learning function.
What is more important is how to define the characteristics correctly. For example, in a product recommendation task, only an extra feature on the machine (a book that is recommended to the user may also depend on the movie the user has seen), it is possible to greatly improve the results. When the data has become a feature vector, most machine learning algorithms optimize a well-defined mathematical model based on these vectors. The algorithm then returns a model that represents the learning decision at the end of the run.
Mllib Data types
1. Vector
A mathematical vector. Mllib supports both dense vectors and sparse vectors. The former indicates that each bit of the vector is stored, while the latter stores the non-0 bits to save space.
Dense vectors: Store all the values in a floating-point integer group.
Sparse vectors store only non-0 values in each dimension. When a maximum of 10% elements is non-0, it is generally more likely to use sparse vectors.
The way in which vectors are created in Spark is
Import org.apache.spark.mllib.linalg.Vectors // Create dense vectors <1.0,2.0,3.0>; Vectors.dense receives a string of values or an array of val denseVec1 = Vectors.dense(1.0,2.0,3.0 = vectors.dense (Array (1.0,2.0,3.0 )// Create a dimension for sparse vectors <1.0,0.0,2.0,0.0> vectors (4= Vectors.sparse (4,array (0,2), Array (1.0,2.0))
2, Labeledpoint
In supervised learning algorithms such as classification and regression algorithms, Labeledpoint is used to represent labeled data points. It contains a eigenvectors and a label (represented by a floating-point number), located in the Mllib.regression package.
3, Rating
User ratings for a product, in the Mllib.recomendation package, for product recommendations.
4. Various model classes
Each model is the result of a training algorithm, and there is typically a predict () method that can be used to predict a new data point or an rdd composed of data points.
Feature Conversions :
TF-IDF: Frequency of Word, inverse document frequencies is a simple way to generate eigenvectors from text documents. It calculates two statistical values for each word in the document: one is the word frequency (TF), the number of occurrences of each term in the document, and the other is the inverse document frequency (IDF), which is used to measure the relevance of a word-specific document.
Mllib has two algorithms that can be used to calculate TF-IDF:HASHTF and TF
HASHTF calculates the word frequency vector for a given size from a document. To correspond the word and vector order, a hash is used. HASHINGTF uses the hash value of each word to modulo the length of the desired vector, mapping all the words to a number between 0 and S-1. This guarantees the generation of an S-dimensional vector. Then when the word frequency vectors are built, IDF is used to calculate the inverse document frequency and then multiply them by the word frequency to calculate TF-IDF.
Mllib Statistics
1. Statistics.colstats (RDD)
Calculates a statistical overview of the RDD composed of vectors, preserving the maximum, minimum, average, and variance of each column in the vector set.
2, Statistics.corr (rdd,method_
Calculates the correlation matrix between columns in an rdd composed of vectors, using one of the Picasson related or Spearman correlations.
3, Statistics.corr (Rdd1,rdd2,method)
Calculates the correlation matrix of two rdd composed of floating-point values.
4. Statistics.chisqtest (RDD)
Calculates the Picasson independence test results for each feature and label in an Rdd composed of Labeledpoint objects. Returns a Chisqtestresult object with P-values, test statistics, and degrees of freedom for each feature.
Classification and regression
Supervised trial learning refers to an algorithm that attempts to use tagged training data (data points with known results) to predict results based on the characteristics of the object. In the classification, the predicted variables are discrete (that is, a value in a finite set, called a category). For example, the classification may be spam and non-spam, or it may be the language used in the text. In the regression, the predicted variables were continuous (predicting a person's height based on age and weight).
Linear regression:
1, Numiteratrions
Number of iterations to run (default: 100)
2, Stepsize
Gradient Descent Step (default: 1.0)
3, Intercept
Whether to add a disturbance feature or deviation feature to the data-that is, a feature with a value that is always not 1 (default: false)
4, Regparam
Normalization parameters for Lasso and Ridge (default: 1.0)
Import Org.apache.spark.mllib.regression.LabeledPoint Import //.. New LINEARREGRESSIIONWITHSGD (). Setnumiterations ($). Setintercept (true= lr.run (points) println ("Weight:%s, intercept:%s". Format (model.weights, model.intercept))
Logistic regression
Used to look for a linear split plane that splits negative and positive examples. In Mllib, you receive a group of Labeledpoint with labels of 0 or 1, returning Logisticregressionmodel objects that can predict the classification of new points.
Decision tree and random deep forest
A decision tree is a flexible model that can be used for classification or for regression. The decision tree is represented as a node tree, and each node makes a two-dollar decision based on the characteristics of the data (such as whether the person is older than 20?). ), and each leaf node of the tree contains a prediction result (for example, does this person buy a product?) The attraction of decision trees is that the model itself is easy to check, and the decision tree supports both the characteristics of classification and continuous features.
Reference to: Spark rapid Big Data analytics
Spark mllib Knowledge Point collation