Frequent sequential pattern mining

Source: Internet
Author: User
Tags terminates

1. Frequent sequential pattern mining

Sequence patterns are a special case of frequent patterns, and their applications are completely different! such as:

Buy items
Diapers, beer, coke
Bread, diapers, beer

The shopping list is a two-user shopping list, and according to the list above, we can see that diapers and beer are combined together to buy more, so supermarkets can put diapers and beer closer to each other based on this frequent itemsets analysis, or increase sales by selling diapers and beer at the same time.

And the sequence pattern is as follows:

U1 Purchase Items
TV
HD STB
U2 Purchase Items
TV, refrigerator
HD Set-top box, refrigerator cleaners

The above shopping list is two users two times before and after the shopping list, you can find that high-definition set-top box often in a user buy a TV after a certain shopping in the next shopping list, but often rarely appear before the set-top box in the purchase of TV, This is because the purchase history of televisions has brought (promoted) the sale of set-top boxes, so supermarkets can be based on such a frequent sequence model analysis, set-top boxes are recommended to buy TV users, or after the sale of TV set-top box promotion to increase sales.

Very often, we can use many algorithms to mining frequent patterns such as Apriori and Fpgrowth, and in order to have time-sequence data sets for frequent pattern mining, there is the GSP and spade algorithm.

Definition: Sequential pattern mining, assuming that DataSet D has n sequences (seq), each SEQ consists of multiple events (event), all of which are time-ordered.

If there is a SEQ, which is a subsequence of the partial sequence of data set D, then its support is calculated as follows:

Support (s) = number of occurrences of s in N sequences

If Spport (s) >min_support, then we can call it a frequent sequence, or a sequence pattern. Here are two common mining algorithms:


2.GSP algorithm


The GSP algorithm is actually a class Apriori algorithm, it also has two steps: 1. Self-Connection 2. Pruning

How does a sequence self-connect? There is a definition, that is, for sequences S1 and S2, if the sequence S1 remove the first item, and sequence 2 is the same as the sequence that gets the last item, then sequence 1 and sequence 2 can be connected. Add the last item of sequence 2 to sequence 1 to get a new connection, which can be the result of a connection to sequence 1 and sequence 2.

The pruning condition of GSP algorithm is the same as the pruning condition of Apriori algorithm:

1. If the support degree of the sequence is less than the minimum support, then it will be cut off

2. If the sequence is a frequent sequence, then all its sub-sequences must be frequent sequences;

We assume that DataSet D has three sets of sequences (Sid=1, sid=2, sid=3) and assumes min_support=2:

Sid EID Item
1 1 I1,i2
1 2 I3
1 3 I4

Sid EID Item
2 1 I2
2 2 I3

Sid EID Item
3 1 I2
3 2 I4

We first get the candidate sequence (in the following statement, the angle brackets <> indicates that the items inside are ordered, and the parentheses () indicate that the items inside are in an event):

<I1>,<I2>,<I3>,<I4>

They can be supported in the following degrees:

<i1>:1

<i2>:3

<i3>:2

<i4>:2

For pruning, the sequence pattern that can be obtained is:

<i2>:3

<i3>:2

<i4>:2

Then using the sequence pattern obtained in the previous step, we can get the candidate sequence by self-linking:

<i2,i2>:0

<i2,i3>:2

<i2,i4>:2

<i3,i2>:0

<i3,i3>:0

<i3,i4>:1

<i4,i2>:0

<i4,i3>:0

<i4,i4>:0

(I2,I3): 0

(I2,I4): 0

(I3,I4): 0

For pruning, the resulting sequence pattern is:

<i2,i3>:2

<i2,i4>:2

Because <I2,I3> removes the first item to get <i3>, and <I2,I4> removes the last item to get <I2>, the sequence is not the same. Therefore, they cannot be connected, and the algorithm terminates at this time. The resulting sequence pattern is:

<I2,I3>,<I2,I4>

That is, if the user buys the I2, it is likely to buy the I3 again, and if the user buys the I2, it is likely to purchase I4 again.

The GSP algorithm is straightforward, but there are drawbacks, that is, each time the support of the sequence is computed, a full table scan DataSet D is required. Let's simply estimate that the data set D contains a different m-item. Assuming that the single sequence candidate set obtained in the first step does not have to be pruned, the number of sequences to be obtained from that single sequence set is: M*m+m (m-1)/2. (M*m is an ordered sequence, M (m-1)/2 is the number of sequences that occur simultaneously in an event), then the number of scans of the DataSet D that is required at this time = (M*m+m (m-1)/2) * Data set D records. The total number of scans for a single candidate set pruning is very large.

For the above problem, there is a spade algorithm proposed.


3.SPADE algorithm

Based on the GSP algorithm, the spade algorithm puts forward the concept of id_list, which avoids the problem of multi-table scan of data set D. Id_list is a collection of (Sid,eid) components.

Let's take a look at the above example to see how the spade algorithm is done.

The first round of candidate sequences <i1>,<i2>,<i3>,<i4> we get their id_list:

I1

Sid EID
1 1

I2

Sid EID
1 1
2 1
3 1

I3

Sid EID
1 2
2 2

I4

Sid EID
1 3
3 2

So that we can quickly know their support:

<i1>:1

<i2>:3

<i3>:2

<i4>:2

For pruning, the sequence pattern that can be obtained is:

<i2>:3

<i3>:2

<i4>:2


We then get the candidate set for this round based on the sequence pattern, and the id_list for the candidate set (Id_list can be obtained from the previous round of id_list), because the candidate set includes:

<i2,i2>:0

<i2,i3>:2

<i2,i4>:2

<i3,i2>:0

<i3,i3>:0

<i3,i4>:1

<i4,i2>:0

<i4,i3>:0

<i4,i4>:0

(I2,I3): 0

(I2,I4): 0

(I3,I4): 0


Examples of the id_list of <I2,I3>:

The Id_list of I2 contains 3 items (2,1), (3,1);

And I3 's id_list has 2 items, (2,2);

Then we can know that the serial number is the same, and that there are two I3 after I2:

(+)

(2.1) (2,2)

You can get the id_list of <I2,I3>:

Sid EID EID
1 1 2
2 1 2

Therefore, the support of <I2,I3> can be 2. The same second round will be pruned to get the sequence pattern:

<i2,i3>:2

<i2,i4>:2

Because <I2,I3> removes the first item to get <i3>, and <I2,I4> removes the last item to get <I2>, the sequence is not the same. Therefore, they cannot be connected, and the algorithm terminates at this time. The resulting sequence pattern is:

<I2,I3>,<I2,I4>

In fact, Spade and GSP algorithm is basically the same, but because it has a id_list record, so that each time the id_list based on the previous id_list get (and thus get support), and the scale of id_list is with the continuous pruning, And the scale shrinks gradually. So it solves the problem of multi-scan data set D of GSP algorithm.

(Reference: http://blog.csdn.net/rongyongfeikai2/article/details/40478335)

Frequent sequential pattern mining

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.