Frequent sequential pattern mining

Last Update:2015-01-20 Source: Internet

Author: User

Tags terminates

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Frequent sequential pattern mining

Sequence patterns are a special case of frequent patterns, and their applications are completely different! such as:

Buy items

Diapers, beer, coke

Bread, diapers, beer

The shopping list is a two-user shopping list, and according to the list above, we can see that diapers and beer are combined together to buy more, so supermarkets can put diapers and beer closer to each other based on this frequent itemsets analysis, or increase sales by selling diapers and beer at the same time.

And the sequence pattern is as follows:

U1 Purchase Items

HD STB

U2 Purchase Items

TV, refrigerator

HD Set-top box, refrigerator cleaners

The above shopping list is two users two times before and after the shopping list, you can find that high-definition set-top box often in a user buy a TV after a certain shopping in the next shopping list, but often rarely appear before the set-top box in the purchase of TV, This is because the purchase history of televisions has brought (promoted) the sale of set-top boxes, so supermarkets can be based on such a frequent sequence model analysis, set-top boxes are recommended to buy TV users, or after the sale of TV set-top box promotion to increase sales.

Very often, we can use many algorithms to mining frequent patterns such as Apriori and Fpgrowth, and in order to have time-sequence data sets for frequent pattern mining, there is the GSP and spade algorithm.

Definition: Sequential pattern mining, assuming that DataSet D has n sequences (seq), each SEQ consists of multiple events (event), all of which are time-ordered.

If there is a SEQ, which is a subsequence of the partial sequence of data set D, then its support is calculated as follows:

Support (s) = number of occurrences of s in N sequences

If Spport (s) >min_support, then we can call it a frequent sequence, or a sequence pattern. Here are two common mining algorithms:

2.GSP algorithm

The GSP algorithm is actually a class Apriori algorithm, it also has two steps: 1. Self-Connection 2. Pruning

How does a sequence self-connect? There is a definition, that is, for sequences S1 and S2, if the sequence S1 remove the first item, and sequence 2 is the same as the sequence that gets the last item, then sequence 1 and sequence 2 can be connected. Add the last item of sequence 2 to sequence 1 to get a new connection, which can be the result of a connection to sequence 1 and sequence 2.

The pruning condition of GSP algorithm is the same as the pruning condition of Apriori algorithm:

1. If the support degree of the sequence is less than the minimum support, then it will be cut off

2. If the sequence is a frequent sequence, then all its sub-sequences must be frequent sequences;

We assume that DataSet D has three sets of sequences (Sid=1, sid=2, sid=3) and assumes min_support=2:

Sid	EID	Item
1	1	I1,i2
1	2	I3
1	3	I4

Sid	EID	Item
2	1	I2
2	2	I3

Sid	EID	Item
3	1	I2
3	2	I4

We first get the candidate sequence (in the following statement, the angle brackets <> indicates that the items inside are ordered, and the parentheses () indicate that the items inside are in an event):

They can be supported in the following degrees:

<i1>:1

<i2>:3

<i3>:2

<i4>:2

For pruning, the sequence pattern that can be obtained is:

<i2>:3

<i3>:2

<i4>:2

Then using the sequence pattern obtained in the previous step, we can get the candidate sequence by self-linking:

<i2,i2>:0

<i2,i3>:2

<i2,i4>:2

<i3,i2>:0

<i3,i3>:0

<i3,i4>:1

<i4,i2>:0

<i4,i3>:0

<i4,i4>:0

(I2,I3): 0

(I2,I4): 0

(I3,I4): 0

For pruning, the resulting sequence pattern is:

<i2,i3>:2

<i2,i4>:2

Because <I2,I3> removes the first item to get <i3>, and <I2,I4> removes the last item to get <I2>, the sequence is not the same. Therefore, they cannot be connected, and the algorithm terminates at this time. The resulting sequence pattern is:

<I2,I3>,<I2,I4>

That is, if the user buys the I2, it is likely to buy the I3 again, and if the user buys the I2, it is likely to purchase I4 again.

The GSP algorithm is straightforward, but there are drawbacks, that is, each time the support of the sequence is computed, a full table scan DataSet D is required. Let's simply estimate that the data set D contains a different m-item. Assuming that the single sequence candidate set obtained in the first step does not have to be pruned, the number of sequences to be obtained from that single sequence set is: M*m+m (m-1)/2. (M*m is an ordered sequence, M (m-1)/2 is the number of sequences that occur simultaneously in an event), then the number of scans of the DataSet D that is required at this time = (M*m+m (m-1)/2) * Data set D records. The total number of scans for a single candidate set pruning is very large.

For the above problem, there is a spade algorithm proposed.

3.SPADE algorithm

Based on the GSP algorithm, the spade algorithm puts forward the concept of id_list, which avoids the problem of multi-table scan of data set D. Id_list is a collection of (Sid,eid) components.

Let's take a look at the above example to see how the spade algorithm is done.

The first round of candidate sequences <i1>,<i2>,<i3>,<i4> we get their id_list:

Sid	EID
1	1

Sid	EID
1	1
2	1
3	1

Sid	EID
1	2
2	2

Sid	EID
1	3
3	2

So that we can quickly know their support:

<i1>:1

<i2>:3

<i3>:2

<i4>:2

For pruning, the sequence pattern that can be obtained is:

<i2>:3

<i3>:2

<i4>:2

We then get the candidate set for this round based on the sequence pattern, and the id_list for the candidate set (Id_list can be obtained from the previous round of id_list), because the candidate set includes:

<i2,i2>:0

<i2,i3>:2

<i2,i4>:2

<i3,i2>:0

<i3,i3>:0

<i3,i4>:1

<i4,i2>:0

<i4,i3>:0

<i4,i4>:0

(I2,I3): 0

(I2,I4): 0

(I3,I4): 0

Examples of the id_list of <I2,I3>:

The Id_list of I2 contains 3 items (2,1), (3,1);

And I3 's id_list has 2 items, (2,2);

Then we can know that the serial number is the same, and that there are two I3 after I2:

(+)

(2.1) (2,2)

You can get the id_list of <I2,I3>:

Sid	EID	EID
1	1	2
2	1	2

Therefore, the support of <I2,I3> can be 2. The same second round will be pruned to get the sequence pattern:

<i2,i3>:2

<i2,i4>:2

<I2,I3>,<I2,I4>

In fact, Spade and GSP algorithm is basically the same, but because it has a id_list record, so that each time the id_list based on the previous id_list get (and thus get support), and the scale of id_list is with the continuous pruning, And the scale shrinks gradually. So it solves the problem of multi-scan data set D of GSP algorithm.

(Reference: http://blog.csdn.net/rongyongfeikai2/article/details/40478335)

Frequent sequential pattern mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More