Aprior algorithm on Hadoop implementation ideas and key parts of the code

Source: Internet
Author: User

I recently studied the Aprior algorithm, because to realize the massive data analysis mining, needs to implement in the Hadoop platform.

On the internet to see some aprior algorithm MapReduce code, feel it is not good to use directly, and, most of them are not original aprior, or improved, is fp-growth algorithm, or is the data block, each block independent operation Aprior algorithm, is not a aprior algorithm in strict sense.

I was also a few experiments, and finally realized a kind of mapreduce based on the original Aprior algorithm distributed implementation.

Aprior Distributed implementation problems: There are multiple inputs, one is a transactional database, one is a global set of candidates, and the candidate set is not chunked, if the sub-block by the different node execution, the statistical support will certainly be missing. This requires the Aprior algorithm candidate set to be a complete parameter, and then think of the candidate set into memory. This is one of the key points of my distributed implementation.

Aprior algorithm idea of distributed implementation: The MAPPER:K-V input is a transactional database, plus an in-memory data structure that is used to store candidate itemsets, ensuring that the set of candidates is intact and will not be fragmented. The output is a candidate (key)-Support degree (value). Reducer: The general sum can be summed. When writing to a file, add a judgment that satisfies the support requirements before writing to the file. 2 basic atomic operations are available.

Then a complete Aprior algorithm flow is: Run a job, that is, a mapreduce, generating the corresponding order of frequent itemsets, the output of the job read into memory, the high-order candidate set as the next job in memory input, that is, the candidate set, and then, Start the next job. The iteration is then iterated until a new frequent set is not produced.

The Mapper class is as follows:

public static class Getglobalcanditmsupportmapper extends Mapper<object, Text,text, intwritable>
{//Calculate candidate Set Input: Key is invalid, value, transaction database all (one efficiency is somewhat low.) Raises a question, if the original file 500m,64m a shard, one read in is 500m or 64m, if it is 64m that is good. You can test it later. Record, in-memory global candidate set, output, Key, a candidate, value, support for a candidate
Algorithm Overview: Gets the set of candidate items in memory. Disadvantage, each record is going to work once from the string to the set of candidates.

int support=3;
Private final static intwritable one = new intwritable (1);
Private text Word = new text ();


public void Map (Object key, Text value, Context context) throws IOException, Interruptedexception
{
A data structure stores the transactional database, a data structure that stores all the candidate sets, and passes in the parameters that you want to use from Conf.
Arraylist<arraylist<string>> rawdb=new arraylist<> ();
Arraylist<arraylist<string>> canditms=new arraylist<> ();
String canditmsets=context.getconfiguration (). Get ("canditmsets");
System.out.println ("Canditmsets is" +canditmsets);
Constructing the memory structure of a transactional database

String[] Tmplinesrawdb=value.tostring (). Split ("\ n"); Get a row of data
String[] Tmponelinerawdb=null;
for (int i=0;i<tmplinesrawdb.length;i++)//row of data,
{
Tmponelinerawdb=tmplinesrawdb[i].split ("\\s+");
Arraylist<string> tmponelinefrqarr=new arraylist<> ();
for (int j=0;j<tmponelinerawdb.length-1;j++)//Split a row of data. The experiment here needs to be reduced by 1, remember to add back
{
Tmponelinefrqarr.add (Tmponelinerawdb[j]);
}
if (!tmponelinefrqarr.isempty ())
{
Rawdb.add (Tmponelinefrqarr);
}
}
System.out.println ("Rawdb is" +rawdb.tostring ());

Traverse the transaction database and the candidate set for support calculation.
Add a branch, if it is the first time, that is, the array of candidate itemsets is empty, then count the number of unary elements, and if the array of candidate itemsets is not empty, then count the occurrences of the elements in the candidate set.
if (canditmsets==null| | Canditmsets.isempty () | | Canditmsets.equalsignorecase ("Nofile"))//is empty, stating is the first time, then count the number of occurrences of each unary element.
{
string[] Itr = value.tostring (). Split ("\\s+");
for (int i=0;i<itr.length;i++)
{
Word.set (Itr[i]);
Context.write (Word, one);
}
}
else//is not empty, indicating that there is a candidate set, then count the number of candidate itemsets that appear.
{
Construct the memory structure of a candidate set
String[] Tmplines=canditmsets.split ("\ n");
String[] Tmponeline=null;
for (int i=0;i<tmplines.length;i++)
{
Arraylist<string> tmplinearr=new arraylist<> (); You can omit the operation of a cloned object by placing it in the inside.
Tmponeline=tmplines[i].trim (). Replace ("\\s+", "\\s"). Split ("\\s+");
for (int j=0;j<tmponeline.length;j++)
{
Tmplinearr.add (Tmponeline[j]);
}
Canditms.add (Tmplinearr);
}
Iterate through the transaction database and count the number of candidate itemsets occurrences.
for (int i=0;i<rawdb.size (); i++)
{
for (int j=0;j<canditms.size (); j + +)
{
if (Isarraacontainsarrab (Rawdb.get (i), Canditms.get (j)))
{
String arrtostring= "";
for (int k=0;k<canditms.get (j). Size (), k++)//string array arraylist the ToString format is not good.
{
Arrtostring=arrtostring.concat (Canditms.get (j). Get (k) + "");
}
System.out.println ("Raddb Contains frqitm" + "Rawdb is" +rawdb.get (i) + "Frqitm is" +canditms.get (j));
Context.write (New Text (arrtostring), New intwritable (1));
}
}
}
}
String[] Arr=value.tostring (). Split ("\\s+");
for (int i=0;i<arr.length)
if (integer.valueof (arr[1]) >support)
Context.write (New Text (Arr[0]), New Intwritable (Integer.valueof (arr[1]));
}
}

The Reducer class is as follows:

public static class Getglobalfrqitmsetsreducer extends Reducer<text,intwritable,text,intwritable>
{
A reducer input is a key record, not all the input is traversed in a reducer, so can do more limited. Here, just filter by the support requirements to determine which is the frequent itemsets. The frequent itemsets are saved to the file. The task of generating a higher-order candidate set is done by mapper.
Private intwritable result = new intwritable ();
public void reduce (Text key, iterable<intwritable> Values,context Context) throws IOException, Interruptedexception
{
Arraylist<arraylist<string>> frqitmsets=new arraylist<> ();
Arraylist<string> frqoneitm=new arraylist<> ();
Arraylist<string> frqelement=new arraylist<> ();//frequent elements that give combinatorial algorithms, generating frequent elements of the set of combinations.
int sum = 0;
int support=context.getconfiguration (). GETINT ("Support", 2);
for (intwritable val:values)
{
Sum + = Val.get ();
}
if (Sum>support)//is greater than the support level, only one record is made. Here the algorithm to change, the frequent itemsets to write memory, the next order of candidate Itemsets to write to the file.
{
String[] Tmponeitm=key.tostring (). Split ("\\s+");//split a frequent entry into a string array
for (int i=0;i<tmponeitm.length;i++)
{
Frqoneitm.add (Tmponeitm[i]);
if (!frqelement.contains (Tmponeitm[i]))
Frqelement.add (Tmponeitm[i]);
}
if (!frqoneitm.isempty ())
{
Frqitmsets.add (FRQONEITM)///Frequent Set data structure, can be deleted later, here only to see whether the output is the correct result
}
Context.write (key,new intwritable (sum));
}
}
}

The main function is as follows

Configuration conf = new configuration ();
Conf.setint ("Support", 3);
Conf.set ("Canditmsets", "nofile");
string[] Inandout = new string[]{"Hdfs://localhost:9000/bin/in", "hdfs://localhost:9000/bin/out0"};
Job Job = new Job (conf, "word count");
Job.setinputformatclass (Wholeinputformat.class);
Job.setjarbyclass (Wordcount.class);
Job.setmapperclass (Getglobalcanditmsupportmapper.class);
Job.setcombinerclass (Getglobalfrqitmsetsreducer.class);
Job.setreducerclass (Getglobalfrqitmsetsreducer.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Intwritable.class);
Fileinputformat.addinputpath (Job, New Path (Inandout[0]));
Fileoutputformat.setoutputpath (Job, New Path (inandout[1]));
if (Job.waitforcompletion (true))
{
SYSTEM.OUT.PRINTLN ("First step had been done");
}

If you have any questions, please feel free to contact. If there is a more ingenious way of realization, please advise Kazakhstan.

Aprior algorithm on Hadoop implementation ideas and key parts of the code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.