Realization of incremental Apriori algorithm using coprocessor of HBase

Source: Internet
Author: User

    • Apriori is a classical frequent itemsets mining algorithm in data mining, the main idea is that if an item set is not frequent, then any itemsets containing this set of itemsets must be infrequent. The incremental Apriori algorithm to be implemented today is a bit like distributed apriori, because we can think of the mining transaction set and the new transaction set as two separate datasets, mining the new transaction set, getting all the new frequent sets, and then making a set with the existing frequent sets, For both sides of the same frequent itemsets must be frequent, and only one side of the frequent itemsets need to be counted on both sides of the frequent count, so that the completion of all the global frequent sets, do not need to re-mining the existing transaction set, the efficiency will inevitably improve.

    • As for HBase's coprocessor, I believe you know it, this is an open source implementation of Percolator based on Google's bigtable, with the goal of providing incremental operations and establishing a two-level index. HBase provides a database-like stored procedure that provides two types of coprocessor,endpoint and Observer,endpoint, requires a program to be deployed to each regionserver in advance, and then called by the client. and summarizes the data returned after each regionserver processing. Observer is like a trigger in the database, just deploy to Regionserver, which provides Preget, Postget, Preput, Postput, Predelete, Postdelete, etc. So when each regionserver occurs, the observer is triggered.

    • Today we only use the endpoint type of coprocessor, by each Regionserver statistics its transaction set all the frequent itemsets, and then the client summarizes each region's frequent itemsets, do a set, for the count has reached the minimum support requirements of the itemsets identified as a global frequent , the remaining itemsets continue to count their frequent counts in all the region and eventually get all global frequent itemsets. The second step is to insert the transaction set incrementally, mark it with timestamp, and then get all the globally frequent itemsets again in the first way.
      It is necessary to mention that hbase starting from the 0.98 version, Coprocessor's remote communication adopted the PROTOBUF standard, protobuf need to implement the definition of communication format, the following is the algorithm required proto
Package Apriori;optionJava_package ="Dave.apriori.protos";optionJava_outer_classname ="Aprioriprotos";optionJava_generic_services =true;optionJava_generate_equals_and_hash =true;optionOptimize_for = speed;messageApriorirequest {required Int32 length =1; Required FLOAT support =2;}messageAprioriresponse {messageFrequentset {Required bytes Fset =3; Required Int32 support =4; } Required Int32 count =5; Repeated Frequentset fsets =6;}messageSpecialrequest {repeated bytes fsets =7;}messageSpecialresponse {repeated Int32 supportcount =8;}messagehellorequest{Required bytes Hellostr =9;}messagehelloresponse{Required bytes Helloresp =Ten;}  Service Apriori {RPC Getfrequentset (apriorirequest) returns (Aprioriresponse);  RPC Getsepecialsupport (specialrequest) returns (Specialresponse); RPC SayHello (hellorequest) returns (Helloresponse);}

Defines three service, one is to get all the frequent itemsets of the region, the other is to get the count of an item set in that region, and finally the test SayHello.
After you have defined it, use protoc–java_out=. The Apriori.proto command allows you to generate the appropriate Java files in the current directory, and then import them into the project to write the server and client.
The deployment process and source code have been uploaded, and a friend in need can download it at http://download.csdn.net/detail/xanxus46/8801857

Realization of incremental Apriori algorithm using coprocessor of HBase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.