The classification algorithms implemented by Mahout are:
– Random gradient descent (SGD)
– Bayesian classification (Bayes)
– On-line learning algorithm (online Passive aggressive)
– Hidden Markov model (HMM)
– Decision Forest (random forest, DF)
Example 1: Using a location as a predictor variable
Using a simple example that uses synthetic data, demonstrates how to select predictor variables so that the Mahout model can accurately predict desired target variables.
650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M01/92/C0/wKiom1kCqcLC66DeAABDEF1BVOs721.jpg-wh_500x0-wm_ 3-wmp_4-s_305791841.jpg "title=" Untitled 2.jpg "alt=" Wkiom1kcqclc66deaabdef1bvos721.jpg-wh_50 "/>
is a collection of historical data. Assume that the search color fills the shape: The color fill is the target variable.
Features can be considered as predictor variables that contain shapes, and positions.
– The position seems to be suitable for use as Predictor variables: horizontal (x) coordinates may be sufficient.
– The shape doesn't seem to matter
Obviously, there are two possible values for color fills, either filled or unfilled.
– You now need to select a feature to use as a predictor variable. What are the characteristics that can be correctly expressed?
– First exclude the color fill (which is the target variable) and you can use the position or shape as a variable.
– The position can be described in x and Y coordinates. Based on a single data table, you can create a record for each sample, containing the target variable and the field of the Predictor variable being considered.
Example 2: Different predictive variables are required for various data
650) this.width=650; "Src=" https://s2.51cto.com/wyfs02/M02/92/C0/wKiom1kCqeKgMqa0AAA__xxapAY632.jpg-wh_500x0-wm_ 3-wmp_4-s_2509395695.jpg "title=" Untitled 3.jpg "alt=" Wkiom1kcqekgmqa0aaa__xxapay632.jpg-wh_50 "/>
Look at another set of historical data that has the same characteristics as the previous data.
– in this case, regardless of the x or Y coordinate, there is no effect on whether the predictor is filled with color.
– The location is no longer useful, but now the shape becomes a useful feature.
The feature (shape) selected as a Predictor variable has 3 values (circle, triangle, Square). The orientation can be introduced to differentiate these shapes (the triangles facing up and the triangles facing down)
Different algorithms have their advantages.
Take the previous example as proof:
– in Example 1, the training algorithm should use the x-coordinate position to determine the color fill. In Example 2, the shape is more useful.
The x-coordinate point position of a point is a continuous variable, which requires the algorithm to use continuous variables.
– In Mahout, SGD and random forest laws, you can use continuous variables.
– Naïve Bayes and supplemental Naive Bayes algorithm, you cannot use continuous variables.
Tradeoffs of parallel serial algorithms
650) this.width=650; "Src=" https://s5.51cto.com/wyfs02/M00/92/C0/wKiom1kCqfKj28BnAABzurePO88487.jpg-wh_500x0-wm_ 3-wmp_4-s_1521002133.jpg "title=" Untitled 1.jpg "alt=" Wkiom1kcqfkj28bnaabzurepo88487.jpg-wh_50 "/>
Parallel algorithms have considerable additional overhead, and it takes some time to set up the computing environment before starting to process the samples.
For some medium-sized datasets, the serial algorithm may not only be sufficient, but is often the preferred
This tradeoff, which compares the assumed serial, and the run time of the parallel extensible algorithm
The falling part of the jagged shape is due to the addition of a new machine
This article from the "CAS Computer Training" blog, declined to reprint!
Mahout implementation of the classification algorithm, two examples, predict the desired target variable