Original: http://www.cnblogs.com/pinard/p/6340162.html
In the summary of the principle of FP tree algorithm and the principle of prefixspan algorithm, we summarize the principle of two kinds of association algorithms, FP Tree and Prefixspan, and introduce how to use these two algorithms from the practical point of view. Since there is no class library associated with the algorithm in Scikit-learn, and Spark Mllib has, this article uses spark Mllib as the usage environment. 1. Spark Mllib Association Algorithm overview
In Spark Mllib, only two association algorithms, our FP tree and Prefixspan, are implemented, and association algorithms like APRIORI,GSP are not. These algorithms support Python,java,scala and R interfaces. Since we are all based on Python in the previous practice, the introduction and use of the Mllib Python interface is also used later in this article.
The Spark Mllib Association algorithm is based on the Python interface in the PYSPARK.MLLIB.FPM package. The class that corresponds to the FP tree algorithm is pyspark.mllib.fpm.FPGrowth (hereinafter referred to as the Fpgrowth Class), which starts with the Spark1.4. The Prefixspan algorithm corresponds to the class is Pyspark.mllib.fpm.PrefixSpan (hereinafter referred to as Prefixspan Class), from the beginning of Spark1.6. So if your learning environment of Spark is less than 1.6, it is not normal to run the following example.
Spark Mllib also provides classes that read the correlation algorithm training model, namely Pyspark.mllib.fpm.FPGrowthModel and Pyspark.mllib.fpm.PrefixSpanModel. These two classes can read our previously saved FP tree and Prefixspan training model. 2. Spark mllib Correlation Algorithm parameter Introduction
For the Fpgrowth class, using its training function train the main need to input three parameters: Data item set, the support threshold Minsupport and data parallel runtime data block number numpartitions. For the support threshold Minsupport, its value size affects the collection size of the last frequent itemsets, the higher the support threshold, the fewer the last frequent itemsets, and the default value of 0.3. While data in parallel runtime data block number Numpartitions mainly in the distributed environment when the time is useful, if you are single-machine spark, you can ignore this parameter.
For the Prefixspan class, using its training function train mainly needs to input four parameters: sequence itemsets data, support threshold Minsupport, length of the longest frequent sequence maxpatternlength The maximum number of items in a single projection database is maxlocalprojdbsize. The definition of the support threshold Minsupport is similar to the Fpgrowth class, with the only difference being that the threshold value defaults to 0.1. The maxpatternlength limits the length of the longest frequent sequence, and the smaller the last frequent sequence number. The Maxlocalprojdbsize parameter is designed to protect the stand-alone memory from being blown. If it is only a small amount of data to learn, you can ignore this parameter.
As can be seen from the above description, there is no threshold for using the FP tree and the Prefixspan algorithm. When learning, you can control the results of frequent sequences by controlling the support threshold Minsupport. And Maxpatternlength can help the prefixspan algorithm sieve apart too long frequent sequences. In the distributed big Data environment, it is necessary to consider the data block number numpartitions of the fpgrowth algorithm and the number of items in the largest single projection database of the Prefixspan algorithm maxlocalprojdbsize. 3. Example of Spark FP tree and Prefixspan algorithm use
Here we use a concrete example to demonstrate how to use the spark FP tree and the Prefixspan algorithm to mine frequent itemsets and frequent sequences.
To use Spark to learn the FP tree and Prefixspan algorithms, you first need to make sure that you have Hadoop and Spark installed (not less than 1.6) and that you have set up environment variables. Generally we are studying in Ipython notebook (Jupyter notebook), so it's best to make a notebook-based spark environment. Of course it doesn't matter if you don't take notebook's spark environment, but you need to set the environment variables every time before you run them.
If you don't have a spark environment for notebook, you'll need to run the following code first. Of course, if you've already done that, the following code doesn't have to run.
Import OS
import sys
#下面这些目录都是你自己机器的Spark安装目录和Java安装目录
os.environ[' spark_home '] = "c:/tools/ spark-1.6.1-bin-hadoop2.6/"
sys.path.append (" C:/tools/spark-1.6.1-bin-hadoop2.6/bin ")
Sys.path.append ( "C:/tools/spark-1.6.1-bin-hadoop2.6/python")
sys.path.append ("c:/tools/spark-1.6.1-bin-hadoop2.6/python/ Pyspark ")
sys.path.append (" C:/tools/spark-1.6.1-bin-hadoop2.6/python/lib ")
sys.path.append (" c:/tools/ Spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip ")
sys.path.append (" c:/tools/spark-1.6.1-bin-hadoop2.6/ Python/lib/py4j-0.9-src.zip ")
sys.path.append (" C:/Program Files (x86)/java/jdk1.8.0_102 ") from
Pyspark Import Sparkcontext from
pyspark import sparkconf
sc = sparkcontext ("local", "testing")
Before running the algorithm, it is recommended to output spark context as follows, if the memory address can be printed normally, then the operating environment of Spark is done.
Print SC
For example, my output is:
<pyspark.context.sparkcontext Object at 0x07d9e2b0>
Now let's use the data to run the FP tree algorithm, in order to compare with the analysis of the FP tree algorithm principle Summary, we use the same data item set as the principle, the same support threshold value of 20%, to train the data. The code is as follows:
From pyspark.mllib.fpm import fpgrowth
data = [["A", "B", "C", "E", "F", "O"], ["A", "C", "G"], ["E", "I"], ["A", "C" "," D "," E "," G "], [" A "," C "," E "," G "," L "],
[" E "," J "],[" a "," B "," C "," E "," F "," P "],[" a "," C "," D "],[" a "," C "," E "," G "," M "],[" A "," C "," E "," G "," N "]"
Rdd = sc.parallelize (data, 2)
#支持度阈值为20%
model = Fpgrowth.train (RDD, 0.2, 2)
Let's take a look at the results of the frequent itemsets, with the following code:
Sorted (Model.freqitemsets (). Collect ())
The output is the frequent itemsets that meet the requirements, which can be compared with the frequent itemsets generated during the analysis in the schematic. The code output is as follows:
[Freqitemset (Items=[u ' a '], freq=8), Freqitemset (Items=[u ' B '], freq=2), Freqitemset (items=[u ' B ', U ' A '], freq=2), Freqitemset (items=[u ' B ', U ' C '], freq=2), Freqitemset (items=[u ' B ', U ' C ', U ' A '], freq=2), Freqitemset (items=[u ' B ', U ' E ') , freq=2), Freqitemset (items=[u ' B ', U ' e ', U ' A '], freq=2), Freqitemset (items=[u ' B ', U ' e ', U ' C '), freq=2), Freqitemset (it Ems=[u ' B ', U ' E ', U ' C ', U ' A '], freq=2), Freqitemset (Items=[u ' C '], freq=8), Freqitemset (Items=[u ' C ', U ' A '], freq=8), freq Itemset (Items=[u ' d '], freq=2), Freqitemset (Items=[u ' d ', U ' A '], freq=2), Freqitemset (Items=[u ' d ', U ' C '], freq=2), Freqi Temset (items=[u ' D ', U ' C ', U ' A '], freq=2), Freqitemset (Items=[u ' e '], freq=8), Freqitemset (Items=[u ' E ', U ' A '], freq=6), F Reqitemset (Items=[u ' E ', U ' C '], freq=6), Freqitemset (Items=[u ' E ', U ' C ', U ' A '], freq=6), Freqitemset (items=[u ' F '), freq= 2), Freqitemset (Items=[u ' F ', U ' A '], freq=2), Freqitemset (Items=[u ' F ', U ' B '], freq=2), Freqitemset (Items=[u ' F ', U ' b ', U ' A '], freq=2), Freqitemset (Items=[u 'F ', u ' b ', U ' C '], freq=2), Freqitemset (Items=[u ' F ', U ' b ', U ' C ', U ' A '], freq=2), Freqitemset (Items=[u ' F ', U ' b ', U ' E '], fre q=2), Freqitemset (Items=[u ' F ', U ' b ', U ' E ', U ' A '], freq=2), Freqitemset (Items=[u ' F ', U ' b ', U ' e ', U ' C '), freq=2), Freqite Mset (Items=[u ' F ', U ' B ', U ' E ', U ' C ', U ' A '], freq=2), Freqitemset (Items=[u ' F ', U ' C '], freq=2), Freqitemset (Items=[u ' F ', U ' C ', U ' A '], freq=2), Freqitemset (Items=[u ' F ', U ' e '], freq=2), Freqitemset (Items=[u ' F ', U ' e ', U ' A '], freq=2), Freqitemset (Items=[u ' F ', U ' e ', U ' C '], freq=2), Freqitemset (Items=[u ' F ', U ' e ', U ' C ', U ' A '], freq=2), Freqitemset (items=[u ' G '), freq= 5), Freqitemset (Items=[u ' G ', U ' A '], freq=5), Freqitemset (Items=[u ' G ', U ' C '], freq=5), Freqitemset (Items=[u ' G ', U ' C ', U ' A '], freq=5), Freqitemset (Items=[u ' G ', U ' e '], freq=4), Freqitemset (Items=[u ' G ', U ' e ', U ' A '], freq=4), Freqitemset (items =[u ' G ', U ' e ', U ' C '], freq=4), Freqitemset (Items=[u ' G ', U ' e ', U ' C ', U ' A '], freq=4)]
Then let's take a look at using the Prefixspan class to dig up frequent sequences. In order to compare with the analysis of the Prefixspan algorithm principle, we use the same set of data items as the schematic, the same support threshold of 50%, and the longest frequent sequence to 4 to train the data. The code is as follows:
From pyspark.mllib.fpm import prefixspan
data = [[[
' A '],[' a ', ' B ', ' C