Mahout source code meanshiftcanopydriver Analysis 3 meanshiftcanopyreducer data logic Flow

Source: Internet
Author: User

First, paste the imitation code of meanshiftcanopyreducer, as follows:

Package mahout. fansy. meanshift; import Java. io. ioexception; import Java. util. collection; import Java. util. hashmap; import Java. util. map; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. mahout. clustering. iterator. clusterwritable; import Org. apache. mahout. clustering. meanshift. meanshiftcanopy; import Org. apache. mahout. clustering. meanshift. meanshiftcanopyclusterer; import Org. apache. mahout. clustering. meanshift. meanshiftcanopyconfigkeys; import COM. google. common. collect. lists; public class extends {/*** meanshiftcanopyreducer counterfeit code * @ author fansy * @ Param ARGs * // Private Static int convergedclusters = 0; Private Static Boolean allconverged = true; public static void main (string [] ARGs) {// cleanup (); // debug the cleanup function reduce (); // debug reduce function}/*** reduce operation imitation */public static Map <text, collection <clusterwritable> reduce () {collection <meanshiftcanopy> canopies = lists. newarraylist (); // obtain the map output collection <clusterwritable> values = meanshiftcanopymapperfollow. cleanup (). get (new text ("0"); meanshiftcanopyclusterer clusterer = setup (); collection <clusterwritable> V = lists. newarraylist (); For (clusterwritable: values) {meanshiftcanopy canopy = (meanshiftcanopy) clusterwritable. getvalue (); clusterer. mergecanopy (canopy. shallowcopy (), canopies);} Map <text, collection <clusterwritable> map = new hashmap <text, collection <clusterwritable> (); For (meanshiftcanopy canopy: canopies) {Boolean converged = clusterer. shifttomean (canopy); If (converged) {// system. out. println ("clustering" + "converged clusters" + convergedclusters ++);} allconverged = converged & allconverged; clusterwritable = new clusterwritable (); clusterwritable. setvalue (canopy); V. add (clusterwritable); map. put (new text (canopy. getidentifier (), V); // system. out. println ("key:" + canopy. getidentifier () + ", value:" + clusterwritable. getvalue (). tostring ();} // map. put (new text (canopy. getidentifier (), V); Return map;}/*** counterfeit setup function, directly call the mapperfollow Method * @ return to return the meanshiftcanopyclusterer */public static meanshiftcanopyclusterer setup () {return meanshiftcanopymapperfollow. setup ();}/*** counterfeit cleanup function * @ throws ioexception */public static void cleanup () throws ioexception {// int num1_cers = 1; // set it by yourself, here, we set it to 1 configuration conf = getconf (); // to determine whether all of them meet the criterion threshold. If yes, we create a new file. If (allconverged) {Path = New Path (Conf. get (meanshiftcanopyconfigkeys. control_path_key); filesystem. get (path. touri (), conf ). createnewfile (PATH) ;}}/*** obtain the configured configuration * @ return */public static configuration getconf () {string measureclassname = "org. apache. mahout. common. distance. euclideandistancemeasure "; string kernelprofileclassname =" org. apache. mahout. common. kernel. triangularkernelprofile "; double convergencedelta = 0.5; double T1 = 47.6; double t2 = 1; Boolean runclustering = true; configuration conf = new configuration (); Conf. set (meanshiftcanopyconfigkeys. distance_measure_key, measureclassname); Conf. set (meanshiftcanopyconfigkeys. kernel_profile_key, kernelprofileclassname); Conf. set (meanshiftcanopyconfigkeys. cluster_convergence_key, String. valueof (convergencedelta); Conf. set (meanshiftcanopyconfigkeys. t1_key, String. valueof (T1); Conf. set (meanshiftcanopyconfigkeys. t2_key, String. valueof (T2); Conf. set (meanshiftcanopyconfigkeys. cluster_points_key, String. valueof (runclustering); Return conf;}/*** get the map output data, that is, canopies * @ return Map <text, clusterwritable> canpies */public static Map <text, collection <clusterwritable> getmapdata () {return meanshiftcanopymapperfollow. cleanup ();}}

The setup function is the same as that in mapper. The cleanup function is only used to create a new function when the threshold value is met. Here we will not talk about it more. It mainly analyzes the reduce function (in fact, the main code is similar to the map + cleanup function in Mapper ).

The first three records output by map are as follows:

MSC-0{n=100 c=[29.942, 30.443, 30.325, 30.018, 29.887, 29.777, 29.855, 29.883, 30.128, 29.984, 29.796, 29.845, 30.436, 29.729, 29.890, 29.518, 29.546, 30.052, 30.077, 30.001, 29.837, 29.928, 30.288, 30.347, 29.785, 29.799, 29.651, 30.008, 29.938, 30.104, 29.997, 29.684, 29.949, 29.754, 30.272, 30.106, 29.883, 30.221, 29.847, 29.848, 29.843, 30.577, 29.870, 29.785, 29.923, 29.864, 30.184, 29.977, 30.321, 30.068, 30.570, 30.224, 30.240, 29.969, 30.246, 30.544, 29.862, 30.099, 29.907, 30.169] r=[3.384, 3.383, 3.494, 3.523, 3.308, 3.605, 3.315, 3.518, 3.472, 3.519, 3.350, 3.444, 3.273, 3.274, 3.400, 3.443, 3.426, 3.499, 3.154, 3.506, 3.509, 3.436, 3.484, 3.475, 3.360, 3.164, 3.460, 3.491, 3.608, 3.484, 3.477, 3.748, 3.628, 3.378, 3.327, 3.600, 3.455, 3.562, 3.534, 3.566, 3.213, 3.645, 3.615, 3.274, 3.197, 3.373, 3.595, 3.452, 3.609, 3.518, 3.262, 3.477, 3.755, 3.830, 3.494, 3.676, 3.423, 3.491, 3.641, 3.374]}
MSC-1{n=101 c=[29.890, 30.422, 30.280, 30.046, 29.891, 29.805, 29.828, 29.875, 30.133, 30.035, 29.773, 29.900, 30.441, 29.751, 29.906, 29.490, 29.508, 30.013, 30.082, 30.049, 29.815, 29.934, 30.286, 30.294, 29.828, 29.831, 29.712, 30.005, 29.977, 30.128, 30.015, 29.675, 29.963, 29.766, 30.259, 30.095, 29.855, 30.139, 29.704, 29.797, 29.808, 30.530, 29.743, 29.745, 29.883, 29.741, 30.140, 29.935, 30.271, 29.934, 30.437, 30.184, 30.180, 29.823, 30.146, 30.494, 29.767, 30.061, 29.854, 30.130] r=[3.407, 3.373, 3.506, 3.517, 3.292, 3.598, 3.310, 3.502, 3.455, 3.538, 3.341, 3.471, 3.257, 3.265, 3.387, 3.437, 3.430, 3.504, 3.139, 3.522, 3.499, 3.419, 3.466, 3.497, 3.371, 3.165, 3.496, 3.474, 3.610, 3.475, 3.464, 3.730, 3.613, 3.363, 3.313, 3.584, 3.449, 3.639, 3.797, 3.585, 3.215, 3.658, 3.818, 3.282, 3.205, 3.573, 3.605, 3.460, 3.626, 3.748, 3.507, 3.482, 3.784, 4.079, 3.616, 3.692, 3.535, 3.495, 3.663, 3.380]}
MSC-2{n=100 c=[29.942, 30.443, 30.325, 30.018, 29.887, 29.777, 29.855, 29.883, 30.128, 29.984, 29.796, 29.845, 30.436, 29.729, 29.890, 29.518, 29.546, 30.052, 30.077, 30.001, 29.837, 29.928, 30.288, 30.347, 29.785, 29.799, 29.651, 30.008, 29.938, 30.104, 29.997, 29.684, 29.949, 29.754, 30.272, 30.106, 29.883, 30.221, 29.847, 29.848, 29.843, 30.577, 29.870, 29.785, 29.923, 29.864, 30.184, 29.977, 30.321, 30.068, 30.570, 30.224, 30.240, 29.969, 30.246, 30.544, 29.862, 30.099, 29.907, 30.169] r=[3.384, 3.383, 3.494, 3.523, 3.308, 3.605, 3.315, 3.518, 3.472, 3.519, 3.350, 3.444, 3.273, 3.274, 3.400, 3.443, 3.426, 3.499, 3.154, 3.506, 3.509, 3.436, 3.484, 3.475, 3.360, 3.164, 3.460, 3.491, 3.608, 3.484, 3.477, 3.748, 3.628, 3.378, 3.327, 3.600, 3.455, 3.562, 3.534, 3.566, 3.213, 3.645, 3.615, 3.274, 3.197, 3.373, 3.595, 3.452, 3.609, 3.518, 3.262, 3.477, 3.755, 3.830, 3.494, 3.676, 3.423, 3.491, 3.641, 3.374]}

After completing the preparations, you can directly access clusterer. mergecanopy (canopy. shallowcopy (), canopies); the analysis here is the same as that in the previous Mapper. The input in the first line is the same, but the input in the second line is different, if the norm of the input in the second line and canopies (1) is 0.44 <t1 and 0.44 <t2, enter here:

if (norm < t2 && (closestCoveringCanopy == null || norm < closestNorm)) {        closestNorm = norm;        closestCoveringCanopy = canopy;      }

Then you should enter else, instead of if, as shown below:

 if (closestCoveringCanopy == null) {      canopies.add(aCanopy);    } else {      closestCoveringCanopy.merge(aCanopy, runClustering);    }

Here, merge only refers to the corresponding canopies (1) (here 1 may be another number. For the previous data, here it is all 1) boundpoints and mass values, for example, if canopies (1) has three values, mass is 4 and boundpoints is [0, 1, 2, 3]. Here the Merge function is better understood.

Now return to the previous touch function, I think this function is not very easy to understand; the detailed code of this function is as follows:

void touch(MeanShiftCanopy canopy, double weight) {    canopy.observe(getCenter(), weight * mass);    observe(canopy.getCenter(), weight * canopy.mass);  }

Call method: acanopy. Touch (canopy, weight); where acanopy is one of the input records, and canopy is one of canopies (I;

For the two operations in the above Code, S0, S1, and S2 are actually set: In the first sentence, S0 + 1 of canopy is set, because mass of acanopy is always 1, therefore, the S1 of canopy refers to the center of the current S1 + acanopy. (The computing of S2 is a bit complicated. It is similar to that of S1. it is only a complex computing point ); in the second sentence, the S1 of acanopy (its S0, S1, and S2 are empty) is worth it through the mass of canopies (1) center value * canopies (1, set the S0 value to the mass value;

In fact, I understand the S0, S1, and S2 of canopies (1) settings above, because they will be used later, but why should I set S0, S1, and S2 of acanopy? It means you don't understand it. The location where acanopy is used later is only the Merge function. This function only uses the boundpoints and mass values of acanopy and does not use S0, S1, and S2, so it is not very understandable here. Well, it seems that acanopy is also used in the add method. It is useful when closescoveringcanopy is null, that is, to create a canopies (I.

In this way, 479 values can be output in the reduce output, which is the same as the number of reduce outputs obtained in the first loop in the first blog, as shown below:

In this way, the basic analysis of meanshiftcanopydriver is OK.

Sharing, happiness, and growth

Reprinted please indicate the source: http://blog.csdn.net/fansy1990

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.