Last time we talked about Mahout's Computational project module Mahout Math. This contains a lot of commonly used mathematical calculations or statistical aspects, there are many things that may be used, so there is a good understanding of the needs of these foundations. Mahout provides a number of tools for the command-line, listed below all the commands, of course, this will change, and each has a different parameter; There are many similarities between the commands, which are familiar to each and have a lot of skill. Glimpse, can be seen, so that you know what mahout can do, provide the direct use of the way, available for reference:
Command |
Comment |
Detail |
Arff.vector |
Generating vectors from Arff files |
Generate Vectors from ARFF file or directory |
Baumwelch |
HMM Baum-welch Training algorithm |
Baum-welch algorithm for unsupervised HMM training |
Buildforest |
Construction of random forest classifier |
Build the Random forest classifier |
Canopy |
Canopy Clustering |
Canopy Clustering |
Cat |
Print files or resources for easy viewing |
Print a file or resource as the logistic regression models would |
Cleansvd |
Empty validation SVD output |
Cleanup and verification of SVD output |
Clusterdump |
Dump clustering Output Result text |
Dump cluster output to text |
Clusterpp |
Packet Clustering output |
Groups clustering Output in clusters |
Cmdump |
Dump confusion matrix in HTML or text format |
Dump confusion matrix in HTML or text formats |
Concatmatrices |
Merging matrices of the same base into a single matrix |
Concatenates 2 matrices of same cardinality into a single matrix |
Cvb |
Lda |
LDA via collapsed variation Bayes (0th deriv. approx) |
Cvb0_local |
LDA Local |
LDA via collapsed variation Bayes, in memory locally. |
Describe |
Describe fields and target variables in a dataset |
Describe The fields and target variable in a data set |
Evaluatefactorization |
Calculate Rmse and MAE |
Compute RMSE and MAE of a rating matrix factorization against probes |
Fkmeans |
Fuzzy K-means Clustering |
Fuzzy K-means Clustering |
Hmmpredict |
Generating random observation sequences from a given HMM model |
Generate random sequence of observations by given HMM |
Itemsimilarity |
Similarity of goods |
Compute the item-item-similarities for item-based collaborative filtering |
Kmeans |
K-means Clustering |
K-means Clustering |
Lucene.vector |
Generate Lucene Index Vectors |
Generate Vectors from a Lucene index |
Lucene2seq |
Lucene index produces text sequence |
Generate Text sequencefiles from a Lucene index |
Matrixdump |
Dump matrix in CSV format |
Dump Matrix in CSV format |
Matrixmult |
Get the product of two matrices |
Take the product of two matrices |
Parallelals |
Parallel als |
ALS-WR factorization of a rating matrix |
Qualcluster |
Running clustering experiments and abstracts |
Runs clustering experiments and summarizes results in a CSV |
Recommendfactorized |
Use the divide factor to get the recommendation |
Compute recommendations using the factorization of a rating matrix |
Recommenditembased |
Using a collaborative filtering recommendation based on items |
Compute recommendations using item-based Collaborative filtering |
Regexconverter |
Convert a text file by row based on a regular expression |
Convert text files on a (on) based on regular expressions |
Resplit |
Splitting file files into multiple halves |
Splits a set of sequencefiles into a number of equal splits |
rowID |
Map series Files |
Map sequencefile<text,vectorwritable> to {sequencefile<intwritable,vectorwritable>, SequenceFile< Intwritable,text>} |
Rowsimilarity |
Pair similarity of computed row matrices |
Compute the pairwise similarities of the rows of a matrix |
Runadaptivelogistic |
Running Adaptive logic regression |
Score new production data using a probably trained and validated adaptivelogisticregression model |
Runlogistic |
Running logical regression from CSV data |
Run a logistic regression model against CSV data |
seq2encoded |
Getting coded sparse vectors from a text sequence file |
Encoded Sparse Vector generation from Text sequence files |
Seq2sparse |
To obtain a sparse vector from a text sequence file |
Sparse Vector generation from Text sequence files |
Seqdirectory |
Create a sequence file from a directory |
Generate sequence files (of Text) from directory |
Seqdumper |
Generic sequence file Dump |
Generic Sequence File dumper |
Seqmailarchives |
Create a sequence file from a compressed mail directory |
Creates sequencefile from directory containing gzipped Mail archives |
Seqwiki |
Wikipedia XML dump to sequence file |
Wikipedia XML dump to sequence file |
Spectralkmeans |
Spectral K-mean Clustering |
Spectral K-means Clustering |
Split |
Input data is divided into test and training data |
Split Input data into test and train sets |
Splitdataset |
Divide training and test data |
Split a rating dataset into training and probe parts |
Ssvd |
Random SVD |
Stochastic SVD |
Streamingkmeans |
Flow Type K-mean Clustering |
Streaming K-means Clustering |
Svd |
Lanczos Singular value decomposition |
Lanczos Singular Value decomposition |
Testforest |
Test Random Forest classifier |
Test the Random forest classifier |
Testnb |
Test Bayes classifier |
Test the vector-based Bayes classifier |
Trainadaptivelogistic |
Training self-adaptive logistic regression model |
Train an adaptivelogisticregression model |
Trainlogistic |
Logical regression based on stochastic gradient descent training |
Train a logistic regression using stochastic gradient descent |
Trainnb |
Based on Bayes classification training |
Train the vector-based Bayes classifier |
Transpose |
Transpose matrix |
Take the transpose of a matrix |
Validateadaptivelogistic |
Validation of adaptive Logic regression model |
Validate an adaptivelogisticregression model against hold-out data set |
Vecdist |
Calculate vector Distance |
Compute the distances between a set of Vectors (or Cluster or canopy, they must fit in memory) and a list of Vectors |
Vectordump |
Dump vector to text file |
Dump vectors from a sequence file to text |
Viterbi |
Viterbi algorithm |
Viterbi decoding of hidden states from given output states sequence |
Of course, some of the above Chinese translation is not very accurate, and did not use, the specific use of a lot of details.
Mahout provides a number of clustering, classification, recommendation (collaborative filtering) aspects of the calculation method, to the I data analysis provides the intentional help, at present the more mature should be to recommend this piece, in many systems has obtained the actual application, the effect is also good; relatively speaking, cluster classification or use of the occasion is relatively limited, Further research is needed.
The previous few have analyzed the recommendations, from theory to practice, and here are examples of a logistic regression (logistic regression) model.
1. Data preparation
Using the iris data, the iris data is analyzed using more experimental data than is said.
Open R, enter Iris, you can see what the data looks like, export data using the following command
Write.csv (iris,file= "D:/work_doc/doc/iris.csv")
The data is like this:
"ID", "Sepal.length", "Sepal.width", "Petal.length", "petal.width", "species"
"1", 5.1,3.5,1.4,0.2, "Setosa"
"2", 4.9,3,1.4,0.2, "Setosa"
"3", 4.7,3.2,1.3,0.2, "Setosa"
"4", 4.6,3.1,1.5,0.2, "Setosa"
2. Use Java code to actually manipulate it.
Import Java.io.File;
Import java.io.IOException;
Import Java.io.OutputStreamWriter;
Import Java.io.PrintWriter;
Import java.util.List;
Import Java.util.Locale;
Import Org.apache.commons.io.FileUtils;
Import Org.apache.mahout.classifier.sgd.CsvRecordFactory;
Import Org.apache.mahout.classifier.sgd.LogisticModelParameters;
Import org.apache.mahout.classifier.sgd.OnlineLogisticRegression;
Import Org.apache.mahout.math.RandomAccessSparseVector;
Import Org.apache.mahout.math.SequentialAccessSparseVector;
Import Org.apache.mahout.math.Vector;
Import Com.google.common.base.Charsets;
Import com.google.common.collect.Lists;
public class Irislrtest {
private static logisticmodelparameters LMP;
private static printwriter output;
public static void Main (string[] args) throws IOException {
Class
LMP = new Logisticmodelparameters ();
Output = new PrintWriter (System.out, New OutputStreamWriter
Charsets.utf_8), true);
Lmp.setlambda (0.001);
Lmp.setlearningrate (50);
Lmp.setmaxtargetcategories (3);
Lmp.setnumfeatures (4);
list<string> targetcategories = lists.newarraylist ("Setosa", "versicolor", "versicolor"); Corresponding species property three categories
Lmp.settargetcategories (targetcategories);
lmp.settargetvariable ("species"); What needs to be predicted is the species property
list<string> typelist = lists.newarraylist ("Numeric", "Numeric", "Numeric", "Numeric"); The type of each property
list<string> predictorlist = lists.newarraylist ("Sepal.length", "Sepal.width", "Petal.length", "Petal.Width"); Name of the property
Lmp.settypemap (Predictorlist, typelist);
Reading data
List<string> raw = Fileutils.readlines (New File (
"D:\\work_doc\\doc\\iris.csv"));
String Header = raw.get (0);
list<string> content = raw.sublist (1, raw.size ());
Csvrecordfactory csv = lmp.getcsvrecordfactory ();
Csv.firstline (header);
Training
onlinelogisticregression lr = lmp.createregression ();
for (int i = 0; i < i++) {//training times
for (String line:content) {
Vector input = new Randomaccesssparsevector (Lmp.getnumfeatures ());
int targetvalue = Csv.processline (line, input);
Lr.train (targetvalue, input);
}
}
Evaluate classification results
Double correctrate = 0;
Double samplecount = Content.size ();
for (String line:content) {
Vector v = new Sequentialaccesssparsevector (Lmp.getnumfeatures ());
int target = Csv.processline (line, v);
int score = Lr.classifyfull (v). Maxvalueindex ();
System.out.println ("target:" + target + "\treal:" + score);
if (score = = target) {
correctrate++;
}
}
output.printf (locale.english, "Rate =%.2f%n", correctrate/samplecount);
}
}
The code gives a comment, and the process is easier to understand. Not only this model is such a thought, many other algorithms are such a process, the specific training methods, algorithms or processes, there are differences.
Of course here is the code based on Mahout, the same in R can do many models, basic steps similar.