--GSP algorithm for sequential pattern Mining of association rules

Source: Internet
Author: User
Tags new set

--apriori Algorithm for association rules the concept of associative patterns in some of the discussions emphasizes the simultaneous occurrence of relationships while ignoring the sequence information in the data (time / space):

time series : Customers buy product X, it is possible to buy products in a period of time Y;

Spatial Sequence : A is found at a point , and it is possible to find the phenomenon Yat the next point.

Example:customers who purchased a Pentium PC 6 months ago are likely to order a new CPU chip within one months .

Note:1) sequence model = Association rule + time / space Dimension

2) The sequence pattern mining discussed here refers to the mining in the time dimension.

First, the basic definition

sequence: will be associated with the object all events related to A are ordered by timestamp, and a sequence sof Object a is obtained.

element (transaction): a sequence is an ordered list of transactions that can be remembered as each of these is the set family of one or more events (items), that is.

the length of the sequence : The number of elements in the sequence.

the size of the sequence : The number of events in the sequence, The K- sequence is a sequence that contains K events.

For Example: The following course sequence contains 4 elements,8 events.

Sub-sequence: The sequence T is a sub-sequence of another sequence s , if each ordered element in T is s A subset of the ordered elements in the. That is, the sequence is a subsequence of the sequence, if there is an integer, so.

Cases:

Sequence Database : A DataSet containing one or more series data, as follows:

Second, sequential pattern mining

the support degree of the sequence : The support degree of the sequence s means all data sequences that contain s (an ordered list of events associated with a single data object (a/b/c in the previous example )) if the support degree of the sequence S is greater than or equal to Minsup, then s is a Sequence pattern (frequent sequence).

Sequential pattern Mining : Given sequence data set D and user-specified minimum support degree minsupto find all sequences with asupport degree greater than or equal to minsup .

Example: In the following example, suppose minsup=50%, because the sequence (subsequence)<{2} {2,3}> is included in the A,b,c , so the degree of support =3/5=0.6 , other similar.

Generating sequence patterns

1. Brute Force Method

Enumerates all possible sequences and counts their respective degrees of support. It is worth noting that the number of candidate sequences is much larger than the number of candidate sets, two reasons are as follows:

2, class Apriori algorithm

Candidate Process : A pair of frequent (k-1) sequences are combined to produce a candidate K - sequence. For non-repetition, the merger principle is as follows:

sequence S1 and sequence s2 merge, only if from S1 get subsequence and from S2 event resulting in the same sub-sequence, the result of merging is S1 s2 There are two ways to connect the last event:

1) If the last two events of S2 belong to the same element , then the last event of S2 in the combined sequence is S1 part of the last element;

2) If the last two events of S2 belong to a different element , then S2 The last event in the merged sequence becomes a separate element connected to the tail of the S1 .

Cases:

 < (1) (2) (3) > + < (2) (3) (4) > = < (1) (2) (3) (4) > : Remove S1 first event (1) s2 (4) The remaining subsequence is < (2) (3); s2 last two events (3) (4)

< (2 5) (3) > + < (5) (3 4) > = < (2 5) (3 4) >: Remove event 2 and event 4with theremainder of the subsequence being the same because S 2 The last two events (3 4) belong to the same element, so merge to S1 last instead of written < (2 5) (3) (3 4) > .

Candidate Pruning : If the candidate K- sequence (k-1)- sequence has at least one non-frequent, it is pruned.

In the example above, only <{1} {2,5} {3}> is left after the candidate pruning .

3. limitation of time limit

When the time constraint is applied, each element of the sequence pattern is associated with a time window [L,u] Associated, where L is the earliest occurrence of an event within that time window, u is the latest occurrence of an event within that time window.

maximum span constraints: The maximum time difference between the latest and earliest occurrences of an event allowed in the entire sequence, recorded as Maxspan, generally,the longer theMaxspan , the greater the likelihood of a pattern being detected in the data series, but a longer Maxspan It is also possible to capture an unreal pattern.

Note: The maximum span affects the support degree count of the sequence pattern discovery algorithm, and after the maximum time span constraint is applied, some data sequences no longer support candidate patterns.

minimum and maximum interval constraints: assume the maximum interval maxgap=3(days), the minimum interval mingap=1, that is, the events in the element must be in the previous Occurs within three (one) days after the occurrence of an element's event.

Note: Using the maximum interval constraint may violate a priori principle to figure2.1For example, in the case of no constraint,<{2} {5}>and the<{2}{3}{5}>the level of support is60%, if you apply a constraintmingap=0,maxgap=1,<{2} {5}>reduced the support level to40%(MissingDsupport), while<{2}{3}{5}>the support level is still60%, that is, the support of the superset is higher than the original set--contrary to the transcendental principle. This problem can be avoided by using the concept of adjacency sub-sequences.

Cases:

Using the adjacency subsequence to modify the transcendental principle is as follows:

a priori principle of revision: if a K- sequence is frequent, then all its adjacency (k-1)- subsequence must also be frequent.

Note: According to the above principle, not all sequences of k-1- sequences need to be checked (violating the maximum interval constraint) during the candidate pruning phase.

Example: ifMaxgap=1, you do not have to check<{1}{2,3}{4}{5}>the sub-sequence<{1}{2,3}{5}>is frequent, because{2,3}and the{5}the time difference between2, greater than one unit, simply examine its adjacency subsequence:<{1}{2,3}{4}>,<{2,3}{4}{5}>,<1} {2} {4} {5}>,<{1}{3}{4}{5}>.

window size constraints: Elements does not have to occur at the same time, you can define a window size threshold (ws) to specify the maximum allowable time difference between the latest and earliest occurrences of an event in any element of the sequence pattern. (ws=0 indicates that all events in the same element must occur at the same time).

--GSP algorithm

basic idea of algorithm :

1, the length of 1 sequence pattern L1, as the initial seed set;

2, according to the seed set Li of length I, through the connection operation and the shearing operation to generate the candidate sequence pattern of length i+1, then scan the database, calculate the support degree of each candidate sequence pattern, produce the sequence pattern of length i+1 And as a new set of seeds.

3 . Repeat the second step until no new sequence pattern or new candidate sequence pattern is generated.

solve two major problems :

1. candidate Set Generation: Merging + pruning = Expecting as few candidate sets as possible;

2, support degree Count

Two tips:

1) hash tree stores data to reduce the number of original data sequences that need to be checked for candidate sequences.

2) Change the expression of the original data series to effectively discover whether a candidate is a subsequence of the data series.

3. Specific practices:

hashes each item of each data series in the object database to determine which leaf nodes in the hash tree should be examined K sequence; For each candidate K sequence in a leaf node, it is necessary to investigate whether it is contained in the data series, and the count value of each candidate sequence contained in the data series is 1.

How to investigate if the data series D contains a candidate K sequence s? In two steps:

Example: assuming maxgap=30,mingap=5,ws=0, the study candidate sequence s=< (3) (4) > is included in the following data series.

1) First find the first element of S (1,2) in the data series the first occurrence of the position, the corresponding time is ten;

2) by mingap=5, so in the time after the search for the next element (3), found that its first occurrence time is 45 , and 45-10>30, into the backward phase;

3) re-search ( 1 2 first occurrence location: 50 55 after searching ( 3 ): 65 65-50<30

4) Look forthe next elementof (3) (4) where thefirst occurrence occurs after time(65+5): - , by 90-65<30 , meet;

5) Conclusion of the study, including.

Reference:

Srikant R, AGRAWAL.R. Mining sequential Patterns:generalizations and performance improvements.

Introduction to pang-ning Tan data mining.

--GSP algorithm for sequential pattern Mining of association rules

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.