Mahout implementation of the classification algorithm, two examples, predict the desired target variable

Source: Internet
Author: User

The classification algorithms implemented by Mahout are:

– Random gradient descent (SGD)

– Bayesian classification (Bayes)

– On-line learning algorithm (online Passive aggressive)

– Hidden Markov model (HMM)

– Decision Forest (random forest, DF)


Example 1: Using a location as a predictor variable


Using a simple example that uses synthetic data, demonstrates how to select predictor variables so that the Mahout model can accurately predict desired target variables.

650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M01/92/C0/wKiom1kCqcLC66DeAABDEF1BVOs721.jpg-wh_500x0-wm_ 3-wmp_4-s_305791841.jpg "title=" Untitled 2.jpg "alt=" Wkiom1kcqclc66deaabdef1bvos721.jpg-wh_50 "/>


is a collection of historical data. Assume that the search color fills the shape: The color fill is the target variable.

Features can be considered as predictor variables that contain shapes, and positions.

– The position seems to be suitable for use as Predictor variables: horizontal (x) coordinates may be sufficient.

– The shape doesn't seem to matter


Obviously, there are two possible values for color fills, either filled or unfilled.

– You now need to select a feature to use as a predictor variable. What are the characteristics that can be correctly expressed?

– First exclude the color fill (which is the target variable) and you can use the position or shape as a variable.

– The position can be described in x and Y coordinates. Based on a single data table, you can create a record for each sample, containing the target variable and the field of the Predictor variable being considered.


Example 2: Different predictive variables are required for various data

650) this.width=650; "Src=" https://s2.51cto.com/wyfs02/M02/92/C0/wKiom1kCqeKgMqa0AAA__xxapAY632.jpg-wh_500x0-wm_ 3-wmp_4-s_2509395695.jpg "title=" Untitled 3.jpg "alt=" Wkiom1kcqekgmqa0aaa__xxapay632.jpg-wh_50 "/>

Look at another set of historical data that has the same characteristics as the previous data.

– in this case, regardless of the x or Y coordinate, there is no effect on whether the predictor is filled with color.

– The location is no longer useful, but now the shape becomes a useful feature.


The feature (shape) selected as a Predictor variable has 3 values (circle, triangle, Square). The orientation can be introduced to differentiate these shapes (the triangles facing up and the triangles facing down)



Different algorithms have their advantages.


Take the previous example as proof:


– in Example 1, the training algorithm should use the x-coordinate position to determine the color fill. In Example 2, the shape is more useful.


The x-coordinate point position of a point is a continuous variable, which requires the algorithm to use continuous variables.


– In Mahout, SGD and random forest laws, you can use continuous variables.

– Naïve Bayes and supplemental Naive Bayes algorithm, you cannot use continuous variables.



Tradeoffs of parallel serial algorithms

650) this.width=650; "Src=" https://s5.51cto.com/wyfs02/M00/92/C0/wKiom1kCqfKj28BnAABzurePO88487.jpg-wh_500x0-wm_ 3-wmp_4-s_1521002133.jpg "title=" Untitled 1.jpg "alt=" Wkiom1kcqfkj28bnaabzurepo88487.jpg-wh_50 "/>

Parallel algorithms have considerable additional overhead, and it takes some time to set up the computing environment before starting to process the samples.


For some medium-sized datasets, the serial algorithm may not only be sufficient, but is often the preferred

This tradeoff, which compares the assumed serial, and the run time of the parallel extensible algorithm

The falling part of the jagged shape is due to the addition of a new machine


This article from the "CAS Computer Training" blog, declined to reprint!

Mahout implementation of the classification algorithm, two examples, predict the desired target variable

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.