Mahout Project-based collaborative filtering algorithm source code Analysis (3)--rowsimilarityjob__ algorithm

Source: Internet
Author: User

Mahout version: 0.7,hadoop version: 1.0.4,jdk:1.7.0_25 64bit.

This article analyzes whether the analysis is correct, mainly to write the last output file read and add log information printing related variables.

First, write the following test file to analyze all the output:

Package Mahout.fansy.item;
Import java.io.IOException;

Import Java.util.Map;

Import Mahout.fansy.utils.read.ReadArbiKV;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.Writable;
Import Org.apache.mahout.math.Vector;

Import org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors;

Import Junit.framework.TestCase;
		public class Readrowsimilarityjobout extends TestCase {//Test weights output: public void Testweights () throws ioexception{
		String path= "hdfs://ubuntu:9000/user/mahout/item/temp/weights/part-r-00000";
		map<writable,writable> map= readarbikv.readfromfile (path);
		System.out.println ("weights=================");
	SYSTEM.OUT.PRINTLN (map); }//normspath public void Testnormspath () throws ioexception{String path= "hdfs://ubuntu:9000/user/mahout/item/temp/no
		Rms.bin ";
		Vector map=getvector (path);
		System.out.println ("normspath=================");
	SYSTEM.OUT.PRINTLN (map); }//maxvalues.bin Public VoID testmaxvalues () throws ioexception{String path= "Hdfs://ubuntu:9000/user/mahout/item/temp/maxvalues.bin";
		Vector map=getvector (path);
		System.out.println ("maxvalues=================");
	SYSTEM.OUT.PRINTLN (map); }//numnonzeroentries.bin public void Testnumnonzeroentries () throws ioexception{String path= "Hdfs://ubuntu:9000/user
		/mahout/item/temp/numnonzeroentries.bin ";
		Vector map=getvector (path);
		System.out.println ("numnonzeroentries=================");
	SYSTEM.OUT.PRINTLN (map); }//pairwisesimilaritypath public void Testpairwisesimilaritypath () throws ioexception{String path= "Hdfs://ubuntu:9
		
		000/user/mahout/item/temp/pairwisesimilarity/part-r-00000 ";
		map<writable,writable> map= readarbikv.readfromfile (path);
		System.out.println ("pairwisesimilaritypath=================");
	SYSTEM.OUT.PRINTLN (map); }//similaritymatrix public void Testsimilaritymatrix () throws ioexception{String path= "Hdfs://ubuntu:9000/user/mah Out/item/temp/similaritymatrix/part-r-00000 ";
		map<writable,writable> map= readarbikv.readfromfile (path);
		System.out.println ("similaritymatrix=================");
	SYSTEM.OUT.PRINTLN (map);
		}//read. bin file public Vector getvector (String path) {configuration conf=new configuration ();
		Conf.set ("Mapred.job.tracker", "ubuntu:9001");
		Vector Vector=null;
		try {vector = vectors.read (new path, conf);
		} catch (IOException e) {e.printstacktrace ();
	} return vector;
 }
}
Run the above file to get the following output:

weights================= {1={103:2.5,102:3.0,101:5.0}, 2={101:2.0,104:2.0,103:5.0,102:2.5}, 3={ 101:2.5,107:5.0,105:4.5,104:4.0}, 4={101:5.0,106:4.0,104:4.5,103:3.0}, 5={ 106:4.0,105:3.5,104:4.0,103:2.0,102:3.0,101:4.0}} normspath================= {
107:25.0,106:32.0,105:32.5,104:56.25,103:44.25,102:24.25,101:76.25} maxvalues================= {} numnonzeroentries================= {} pairwisesimilaritypath================= {102={ 106:0.14972506706560876,105:0.14328432723886902,104:0.12789210656028413,103:0.1975496259559987}, 103={ 106:0.1424339656566283,105:0.11208890297777215,104:0.14037600977966974}, 101={ 107:0.10275248635596666,106:0.1424339656566283,105:0.1158457425543559,104:0.16015261286229274,103:0.15548737703860027,102 : 0.14201473202245876}, 106={}, 107={}, 104={ 107:0.13472338607037426,106:0.18181818181818182,105:0.16736577623297264}, 105={ 107:0.2204812092115424,106:0.14201473202245876}} similaritymatrix================= {102={ 101:0.14201473202245876,106:0.14972506706560876,105:0.14328432723886902,104:0.12789210656028413,103:0.1975496259559987}, 103={ 101:0.15548737703860027,106:0.1424339656566283,105:0.11208890297777215,104:0.14037600977966974,102:0.1975496259559987 }, 101={ 107:0.10275248635596666,106:0.1424339656566283,105:0.1158457425543559,104:0.16015261286229274,103:0.15548737703860027,102 : 0.14201473202245876}, 106={ 101:0.1424339656566283,105:0.14201473202245876,104:0.18181818181818182,103:0.1424339656566283,102:0.14972506706560876 }, 107={105:0.2204812092115424,104:0.13472338607037426,101:0.10275248635596666}, 104={ 107:0.13472338607037426,106:0.18181818181818182,105:0.16736577623297264,103:0.14037600977966974,102:0.12789210656028413,1 01:0.16015261286229274}, 105={ 107:0.2204812092115424,106:0.14201473202245876,104:0.16736577623297264,103:0.11208890297777215,102:0.14328432723886902,10
 1:0.1158457425543559}}
The first of these weights is exactly the same as the analysis, where it is no longer believed to be written. That would only analyze Pairwisesimilaritypath and Similaritymatrix:

(1) Pairwisesimilaritypath:

The previous analysis of this in the last reducer is wrong, should be said to be not finished, as shown below (this screenshot is the variable information printed with log):


You can see that the previous article was just analyzing the second line (the second and third rows), not the final output. In fact, it's just a little bit of analysis of a while loop:

while (Dotswith.hasnext ()) {
        Vector.element B = dotswith.next ();
        Double similarityvalue = similarity.similarity (B.get (), NormA, Norms.getquick (B.index ()), numberofcolumns);
        if (Similarityvalue >= treshold) {
          similarities.set (B.index (), similarityvalue);
        }
      }
Here to analyze how the value of the fourth row is calculated based on the values of the second row, first Norma is the value of 102 in norms, that is, 24.25, and then see the similarity function:

Public double similarity (double dots, double NormA, double normb, int numberofcolumns) {
    double euclideandistance = Ma TH.SQRT (normA-2 * dots + NORMB);
    return 1.0/(1.0 + euclideandistance);
  }
The parameter that is called in item 106 should be similarity (12.0,24.25,32.0,5), so the value returned is 1/(1+SQRT (24.25-2*12+32)) = 0.149725067, which corresponds to the value of the fourth row, and the last output does not have 102. is because the Similarities.setquick (Row.get (), 0) is set, and the corresponding value is set to 0, that is, not output.

(2) Similaritymatrix

By the analysis of (1) It is known that (2) the input is this:

{102={106:0.14972506706560876,105:0.14328432723886902,104:0.12789210656028413,103:0.1975496259559987},
103 ={106:0.1424339656566283,105:0.11208890297777215,104:0.14037600977966974},
101={ 107:0.10275248635596666,106:0.1424339656566283,105:0.1158457425543559,104:0.16015261286229274,103:0.15548737703860027,102 : 0.14201473202245876},
106={}, 
107={}, 
104={ 107:0.13472338607037426,106:0.18181818181818182,105:0.16736577623297264}, 
105={ 107:0.2204812092115424,106:0.14201473202245876}}
The mapper analysis of the job is correct, but the merge method in the Combiner analysis is incorrect, and you can see the code for the merge as follows:

public static Vector merge (Iterable<vectorwritable> partialvectors) {
    iterator<vectorwritable> Vectors = Partialvectors.iterator ();
    Vector accumulator = Vectors.next (). get ();
    while (Vectors.hasnext ()) {
      Vectorwritable v = vectors.next ();
      if (v! = null) {
        iterator<vector.element> nonzeroelements = V.get (). Iteratenonzero ();
        while (Nonzeroelements.hasnext ()) {
          Vector.element nonzeroelement = Nonzeroelements.next ();
          Accumulator.setquick (Nonzeroelement.index (), Nonzeroelement.get ());
    }}} return accumulator;
  }
The function of seeing this code is to set the value of the same key in all, and look at the log information as follows:

The first is the output of the map (key in 101~103):


(Key in 104~107):



Output of the combiner:


So see the output of the data, you can very well understand the specific operation of the combiner;

Finally, the operation of Reducer is to sort the output of combiner:


However, see the above log information, it does not seem like this, about the Vectors.topkelements method did not look closely, should be and guess the different operation bar, this next time.


Share, grow, be happy

Reprint Please specify blog address: http://blog.csdn.net/fansy1990


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.