Fp-growth Sequence Frequent pattern mining _ data Mining

Source: Internet
Author: User
1 Algorithm Design Objectives

Entering different commands is the basic way for users to use the Linux server, through a long time to collect different users in the use of the server process of the command sequence, mining the frequent occurrence of the command sequence, can help us understand the user to use the basic rules of the server.

In addition, if there are more than one server, then we can analyze mining these servers in the user input sequence of commands, mining the frequent patterns exist, you can understand the basic purpose of users to use these servers. If these servers are attacked by the same hacker, or if these servers suffer from the same type of attack, then the command sequence of the hackers entered in the frequent command pattern we have dug up can be used to try to understand the hacker's attacking means, restore the attack scene, and lay the foundation for the precaution.

This project is to use fp-growth algorithm to realize user input command sequence frequent pattern mining, based on the user input command sequence collected in different time period, build transaction by user shell+ip+ hostname according to different user's login (all three are the same user) Based on this, the basic principle of mining 2 algorithm for user input command sequence frequent pattern is realized.

The fp-growth algorithm mainly solves the collection of frequent items where the number of occurrences reaches a certain threshold in multiple sets. A FP tree is a compressed representation of input data that is constructed by reading into the transaction one at a time and mapping the transaction to a path in the FP tree, because different transactions may have several identical items, so their paths may overlap partially. The more paths overlap, the better the compression effect is obtained by using FP tree structure. The following table shows a dataset that contains 10 transactions and 5 items.

TID

Item

1

{A, B}

2

{b, C, D}

3

{A, C, D, E}

4

{A, D, e}

5

{A, B, c}

6

{A, B, c D}

7

A

8

{A, B, c}

9

{A, B, d}

10

{b, C, E} The basic process of 2.1 FP tree construction

In the following figure, the structure of the FP tree after reading three transactions and the FP tree which completes the construction are given, the first, the FP tree contains only one root node, with the symbol null tag, then the FP tree is expanded with the following methods:

STEP1: Scan the data, determine the count of support for each item, discard infrequent items, and the frequent items in accordance with the descending order of support, for the dataset given above, the frequency from high to low is a,b,c,d,e in turn.

STEP2: The algorithm scans the dataset for the second time, constructs the FP tree, reads the first transaction {a,b}, creates a node labeled A and B, and then forms a null->a->b path that encodes all nodes on the path of 1.

STEP3: After reading the second transaction {b,c,d}, create a new node set for the item b,c,d, then connect the null->b->c->d to form a new node set to form a path representing the transaction, and the frequency count for each node of the path is equal to 1. Although the first two transactions have a common item B, their paths do not intersect because the two transactions do not have a common prefix

STEP4: The third transaction {a,c,d,e} shares a common prefix item A with the first transaction, so the path of the third transaction overlaps with the path null->a->b part of the first transaction, Because their paths overlap, the frequency count of Node A increases to 2, while the number of new nodes c,d and E is equal to 1.

STEP5: Continue the process until each transaction is mapped to a path of the FP tree, read all the transactions and form a FP tree

The STEP6:FP tree also contains a list of pointers to nodes that have the same items, which are then represented by dashed lines in the diagram, which help to quickly access the items in the tree.


2.2 Frequent item mining processes

Fp-growth is a bottom-up approach to the tree to generate frequent itemsets from the FP tree algorithm, given the above built FP-tree, the algorithm in the order of e,d,c,b,a in each condition FP tree recursively find the end of the frequent itemsets. Because each transaction is mapped to a path in the FP tree, the frequent itemsets ending with E can be found by examining only the paths that contain a specific node (for example, E), and these paths can be accessed quickly using pointers associated with node E.

The first step is to collect all the paths that contain the E node, which is called the prefix path, as shown in Figure a below.

STEP1: The prefix path shown in Figure A, by adding the support count associated with node E, gets the count of support for E. It is assumed that the minimum support is 2 because {e} has a support degree of 3 so it is a frequent itemsets

STEP2: Because {e} is frequent, the algorithm must solve the problem of discovering a set of frequent itemsets with de,ce,be and AE endings, before solving these problems, we must first convert the prefix path into a conditional FP tree, in addition to discovering the frequent itemsets ending with a particular suffix, The structure of the conditional FP tree is similar to the FP tree, and the conditional FP tree is obtained by the following steps.

Step2.1: First, the support count on the prefix path must be updated because some of the counts contain transactions that do not include item E. For example, the rightmost path in the following figure, Null->b:2->c:2->e:1, contains a transaction {B,C} that does not contain item E, so you must adjust the count on the prefix path to 1 to reflect the actual number of transactions that contain {b,c, E}.

Step2.2: Deletes the nodes of E, trims the prefix path, and deletes these nodes because the count of support along these prefix paths has been updated to reflect those transactions containing e, and finds that the child problem of the frequent itemsets ending with de,ce, be, and AE no longer requires node E information.

Step2.3: After updating the support count along the prefix path, some items may no longer be frequent, for example, Node B only appears once, and its support count equals 1, which means that only one transaction contains both B and E, because all the itemsets that are ending with is must be infrequent. So in the future analysis can safely ignore B.

The step2.4:e condition FP tree is shown in Figure B below, which looks different from the original prefix path, because the frequency count has been updated and the nodes B and e have been deleted, and because the tree is not a single path, the mining needs to continue.

STEP2.5:FP grew using the condition FP tree of E to solve the discovery with de,ce,be, and AE end of the frequent itemsets, in order to find the set of frequent itemsets with de end, from the condition FP tree of the item e collects all prefix paths of D, and by summing the frequency count associated with node D, get the item set {D , the support count for E}. Because the item set {D,E} support count is equal to 2, it is a frequent itemsets, and then the algorithm uses the method in the previous step to construct the condition FP tree of de. After updating the support count and removing non frequent item C, the condition FP of de is shown in Figure D below, because the FP tree contains only one item A with a support degree equal to the minimum support, and it is a single path FP tree, so the node permutation combination on the path is combined with {e,d} to extract the {a,d,e} And go to the next child problem, produced with the CE end frequent itemsets, processing c prefix path, only found the item set {C,e} is frequent, next, the algorithm continues to solve the next child problem and found that the item set {A,e} is the only frequent set of items left.

After discovering the set of frequent itemsets ending in E, the algorithm further looks for frequent itemsets ending with d by processing the path associated with node D, and continues until all the paths associated with node c,b and a are processed. For each recursion, the conditional FP tree is constructed by updating the support count in the prefix path and deleting infrequent items. Because the child problem does not intersect, FP growth does not produce any duplicate set of items, and the support count associated with the node allows the algorithm to count support when generating the same suffix items.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.