Core function Practice of Mahout Series

Source: Internet
Author: User
Tags locale

Last time we talked about Mahout's Computational project module Mahout Math. This contains a lot of commonly used mathematical calculations or statistical aspects, there are many things that may be used, so there is a good understanding of the needs of these foundations. Mahout provides a number of tools for the command-line, listed below all the commands, of course, this will change, and each has a different parameter; There are many similarities between the commands, which are familiar to each and have a lot of skill. Glimpse, can be seen, so that you know what mahout can do, provide the direct use of the way, available for reference:

Command Comment Detail
Arff.vector Generating vectors from Arff files Generate Vectors from ARFF file or directory
Baumwelch HMM Baum-welch Training algorithm Baum-welch algorithm for unsupervised HMM training
Buildforest Construction of random forest classifier Build the Random forest classifier
Canopy Canopy Clustering Canopy Clustering
Cat Print files or resources for easy viewing Print a file or resource as the logistic regression models would
Cleansvd Empty validation SVD output Cleanup and verification of SVD output
Clusterdump Dump clustering Output Result text Dump cluster output to text
Clusterpp Packet Clustering output Groups clustering Output in clusters
Cmdump Dump confusion matrix in HTML or text format Dump confusion matrix in HTML or text formats
Concatmatrices Merging matrices of the same base into a single matrix Concatenates 2 matrices of same cardinality into a single matrix
Cvb Lda LDA via collapsed variation Bayes (0th deriv. approx)
Cvb0_local LDA Local LDA via collapsed variation Bayes, in memory locally.
Describe Describe fields and target variables in a dataset Describe The fields and target variable in a data set
Evaluatefactorization Calculate Rmse and MAE Compute RMSE and MAE of a rating matrix factorization against probes
Fkmeans Fuzzy K-means Clustering Fuzzy K-means Clustering
Hmmpredict Generating random observation sequences from a given HMM model Generate random sequence of observations by given HMM
Itemsimilarity Similarity of goods Compute the item-item-similarities for item-based collaborative filtering
Kmeans K-means Clustering K-means Clustering
Lucene.vector Generate Lucene Index Vectors Generate Vectors from a Lucene index
Lucene2seq Lucene index produces text sequence Generate Text sequencefiles from a Lucene index
Matrixdump Dump matrix in CSV format Dump Matrix in CSV format
Matrixmult Get the product of two matrices Take the product of two matrices
Parallelals Parallel als ALS-WR factorization of a rating matrix
Qualcluster Running clustering experiments and abstracts Runs clustering experiments and summarizes results in a CSV
Recommendfactorized Use the divide factor to get the recommendation Compute recommendations using the factorization of a rating matrix
Recommenditembased Using a collaborative filtering recommendation based on items Compute recommendations using item-based Collaborative filtering
Regexconverter Convert a text file by row based on a regular expression Convert text files on a (on) based on regular expressions
Resplit Splitting file files into multiple halves Splits a set of sequencefiles into a number of equal splits
rowID Map series Files Map sequencefile<text,vectorwritable> to {sequencefile<intwritable,vectorwritable>, SequenceFile< Intwritable,text>}
Rowsimilarity Pair similarity of computed row matrices Compute the pairwise similarities of the rows of a matrix
Runadaptivelogistic Running Adaptive logic regression Score new production data using a probably trained and validated adaptivelogisticregression model
Runlogistic Running logical regression from CSV data Run a logistic regression model against CSV data
seq2encoded Getting coded sparse vectors from a text sequence file Encoded Sparse Vector generation from Text sequence files
Seq2sparse To obtain a sparse vector from a text sequence file Sparse Vector generation from Text sequence files
Seqdirectory Create a sequence file from a directory Generate sequence files (of Text) from directory
Seqdumper Generic sequence file Dump Generic Sequence File dumper
Seqmailarchives Create a sequence file from a compressed mail directory Creates sequencefile from directory containing gzipped Mail archives
Seqwiki Wikipedia XML dump to sequence file Wikipedia XML dump to sequence file
Spectralkmeans Spectral K-mean Clustering Spectral K-means Clustering
Split Input data is divided into test and training data Split Input data into test and train sets
Splitdataset Divide training and test data Split a rating dataset into training and probe parts
Ssvd Random SVD Stochastic SVD
Streamingkmeans Flow Type K-mean Clustering Streaming K-means Clustering
Svd Lanczos Singular value decomposition Lanczos Singular Value decomposition
Testforest Test Random Forest classifier Test the Random forest classifier
Testnb Test Bayes classifier Test the vector-based Bayes classifier
Trainadaptivelogistic Training self-adaptive logistic regression model Train an adaptivelogisticregression model
Trainlogistic Logical regression based on stochastic gradient descent training Train a logistic regression using stochastic gradient descent
Trainnb Based on Bayes classification training Train the vector-based Bayes classifier
Transpose Transpose matrix Take the transpose of a matrix
Validateadaptivelogistic Validation of adaptive Logic regression model Validate an adaptivelogisticregression model against hold-out data set
Vecdist Calculate vector Distance Compute the distances between a set of Vectors (or Cluster or canopy, they must fit in memory) and a list of Vectors
Vectordump Dump vector to text file Dump vectors from a sequence file to text
Viterbi Viterbi algorithm Viterbi decoding of hidden states from given output states sequence

Of course, some of the above Chinese translation is not very accurate, and did not use, the specific use of a lot of details.

Mahout provides a number of clustering, classification, recommendation (collaborative filtering) aspects of the calculation method, to the I data analysis provides the intentional help, at present the more mature should be to recommend this piece, in many systems has obtained the actual application, the effect is also good; relatively speaking, cluster classification or use of the occasion is relatively limited, Further research is needed.

The previous few have analyzed the recommendations, from theory to practice, and here are examples of a logistic regression (logistic regression) model.


1. Data preparation

Using the iris data, the iris data is analyzed using more experimental data than is said.

Open R, enter Iris, you can see what the data looks like, export data using the following command

Write.csv (iris,file= "D:/work_doc/doc/iris.csv")

The data is like this:

"ID", "Sepal.length", "Sepal.width", "Petal.length", "petal.width", "species"
"1", 5.1,3.5,1.4,0.2, "Setosa"
"2", 4.9,3,1.4,0.2, "Setosa"
"3", 4.7,3.2,1.3,0.2, "Setosa"
"4", 4.6,3.1,1.5,0.2, "Setosa"

2. Use Java code to actually manipulate it.

Import Java.io.File;
Import java.io.IOException;
Import Java.io.OutputStreamWriter;
Import Java.io.PrintWriter;
Import java.util.List;
Import Java.util.Locale;

Import Org.apache.commons.io.FileUtils;
Import Org.apache.mahout.classifier.sgd.CsvRecordFactory;
Import Org.apache.mahout.classifier.sgd.LogisticModelParameters;
Import org.apache.mahout.classifier.sgd.OnlineLogisticRegression;
Import Org.apache.mahout.math.RandomAccessSparseVector;
Import Org.apache.mahout.math.SequentialAccessSparseVector;
Import Org.apache.mahout.math.Vector;


Import Com.google.common.base.Charsets;
Import com.google.common.collect.Lists;

public class Irislrtest {

private static logisticmodelparameters LMP;
private static printwriter output;

public static void Main (string[] args) throws IOException {
Class
LMP = new Logisticmodelparameters ();
Output = new PrintWriter (System.out, New OutputStreamWriter
Charsets.utf_8), true);
Lmp.setlambda (0.001);
Lmp.setlearningrate (50);
Lmp.setmaxtargetcategories (3);
Lmp.setnumfeatures (4);
list<string> targetcategories = lists.newarraylist ("Setosa", "versicolor", "versicolor"); Corresponding species property three categories
Lmp.settargetcategories (targetcategories);
lmp.settargetvariable ("species"); What needs to be predicted is the species property
list<string> typelist = lists.newarraylist ("Numeric", "Numeric", "Numeric", "Numeric"); The type of each property
list<string> predictorlist = lists.newarraylist ("Sepal.length", "Sepal.width", "Petal.length", "Petal.Width"); Name of the property
Lmp.settypemap (Predictorlist, typelist);

Reading data
List<string> raw = Fileutils.readlines (New File (
"D:\\work_doc\\doc\\iris.csv"));
String Header = raw.get (0);
list<string> content = raw.sublist (1, raw.size ());
Csvrecordfactory csv = lmp.getcsvrecordfactory ();
Csv.firstline (header);

Training
onlinelogisticregression lr = lmp.createregression ();
for (int i = 0; i < i++) {//training times
for (String line:content) {
Vector input = new Randomaccesssparsevector (Lmp.getnumfeatures ());
int targetvalue = Csv.processline (line, input);
Lr.train (targetvalue, input);
}
}

Evaluate classification results
Double correctrate = 0;
Double samplecount = Content.size ();

for (String line:content) {
Vector v = new Sequentialaccesssparsevector (Lmp.getnumfeatures ());
int target = Csv.processline (line, v);
int score = Lr.classifyfull (v). Maxvalueindex ();
System.out.println ("target:" + target + "\treal:" + score);
if (score = = target) {
correctrate++;
}
}
output.printf (locale.english, "Rate =%.2f%n", correctrate/samplecount);
}

}

The code gives a comment, and the process is easier to understand. Not only this model is such a thought, many other algorithms are such a process, the specific training methods, algorithms or processes, there are differences.

Of course here is the code based on Mahout, the same in R can do many models, basic steps similar.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.