PrefaceIn the Enterprise Security Construction topic occasionally mentioned the application of the algorithm, many students want to understand this piece, so I specifically opened a sub-topic to introduce the security field often used in machine learning model, from the entry-level SVM, Bayesian wait Hmm, Neural networks and deep learning (in fact, deep learning can be thought of as an enhanced version of neural networks).
Mining Association RulesAssociation rule mining is usually unsupervised learning, by analyzing data sets and digging out potential association rules, the most typical example being diapers and beer stories. According to legend, Wal-Mart's data analysts analyzed a large number of shopping lists found that a significant portion of consumers will buy diapers and beer at the same time, they put diapers and beer impressively sold together, resulting in both sales growth. The result of association rule analysis is the manifestation of objective phenomena, some are obvious, such as the purchase of salmon and mustard at the same time, some can barely explain, such as diapers and beer, some are unthinkable, such as lighters and cheese. The most famous of the association algorithms is the Apriori algorithm.
Apriori IntroductionFirst, we introduce three basic concepts, support degree, confidence and frequent K itemsets. The degree of support, P (A∩B), both A and B, shows the frequency at which A and B two events occur relative to the entire data set, such as diapers and beer support 0.2, indicating that 20% of the consumer list, consumers also buy diapers and beer. Confidence level, P (b| A) The probability of B (AB)/P (a) at the same time as the event of a occurs, which shows the degree of correlation between the AB two events, and the size of the entire data set, such as the confidence of the diaper and the beer is 0.8, indicating that in the consumer who bought both, the purchase of 80% of diapers bought beer. As a special note, p (a∩b) =p (b∩a), but P (b| A) and P (a| B) are two different things. If event A contains k elements, then it is called the event A is K Itemsets event a meets the minimum support threshold of events known as frequent K itemsets. The Apriori algorithm is a mining association rule that satisfies both the minimum support threshold and the minimum confidence threshold value.
Apriori FundamentalsThe Apriori algorithm uses a priori knowledge of frequent itemsets, using an iterative approach called layer-wise search, and K-itemsets are used to explore (k+1) itemsets. First, by scanning the transaction (transaction) records, find all the frequent 1 itemsets, the collection is L1, and then use L1 to find the collection of frequent 2 itemsets l2,l2 find L3, so go on until you can no longer find any frequent k itemsets. Finally, the strong rules are found in all the frequent concentrations, that is, the association rules that generate the user's interest. Among them, the Apriori algorithm has the property that all non-empty-empty sets of any frequent itemsets must also be frequent. Because if P (I) < minimum support threshold is present, when element A is added to I, the result itemsets (a∩i) cannot be more than I, so a∩i is not frequent.
Application of Apriori In the field of security, Aprioir is widely used, and it is possible to try out any potential relationships that need to be tapped, such as the sqllog of the Accesslog and backend databases associated with WAF, and identify exceptions in the SSH operation log. Here we take the analysis of Accesslog as an example. We extracted the XSS attack log from the sample of the Xssed website and the interception log of the WAF as a sample, as follows:/0_1/?%22onmouseover= ' Prompt (42873) ' Bad=%22%3e/0_1/api.php?op=map &maptype=1&city=test%3Cscript%3Ealert%28/42873/%29%3C/script%3E/0_1/api.php?op=map&maptype=1& defaultcity=%e5%22;alert%28/42873/%29;//Our goal is to analyze the potential correlation relationship, and then as the SVM, KNN and other classification algorithms feature extraction based on one. The machine has no way to directly identify the log, need to line up the log text vectorization, the simplest way is to follow a certain separator cut into a word vector, the sample code is as follows: Mydat=[]with open ("Xss-train.txt") as f: for Line in f: tokens=re.split (' \=|&|\?| \%3e|\%3c|\%3e|\%3c|\%20|\%22|<|>|\\n|\ (|\) |\ ' |\ "|;|:|, ', line) Mydat.append ( Tokens) F.close () The following vector examples are as follows: ['/0_1/', ' ', ' onmouseover ', ' ', ' Prompt ', ' 42873 ', ' ', ' ', ' ', ' ', ' ' 0_1/api.php ', ' op ', ' Map ', ' maptype ', ' 1 ', ' City ', ' Test ', ' script ', ' Alert%28/42873/%29 ', '/script ', ', '] we have a very strict confidence level Running, trying to find an association close to 100% cases: L, suppdata = Apriori (Mydat, 0.1)Rules = Generaterules (L, Suppdata, minconf=0.99) The interesting phenomenon arises: frozenset (['///', ' 1 '])--Frozenset ([', ' Alert ']) conf: 1.0frozenset ([' 1 ', ' script '])--Frozenset ([', '/script ']) conf:1.0frozenset (['/', ' script '])--Frozenset ([' ', '/script ']) conf:0.997576736672frozenset ([' type ', ' title '])--Frozenset ([' A ', '] ') conf: 0.996108949416frozenset ([' A ', ' title '])--Frozenset ([' ', ' type ']) conf:0.993210475267frozenset ([' A ', ' C '])-- > Frozenset ([' ', ' m ']) conf:1.0frozenset ([' 1 ', '/', ' script '])--Frozenset ([', '/script ']) conf:1.0frozenset ([ ' 1 ', ' alert ', ' script '])--Frozenset ([' ', '/script ']) conf:1.0frozenset ([' Alert ', '/', ' script '])---Frozenset ( [', '/script ']) Conf:0.997416020672frozenset ([' 1 ', ' alert ', '/', ' script '])--Frozenset ([' ', '/script ']) conf:1.0 Some of the results are easy to understand, such as ' Script ' and ' 1 ' appear to be 100% of the probability of causing '/script ', some of the results are unthinkable, such as ' A ' and ' C ' appear to be 100% probability cause ' m '.
code Implementation of AprioriThere is a lot of code implementation on the Web Apriori, here is one of the implementations.
def createC1 (dataSet): C1 = [] for transaction in Dataset:for item on TRANSACTION:IF [item] not in C 1:c1.append ([item]) C1.sort () return map (Frozenset, C1) def scand (D, Ck, minsupport): sscnt = {} for Tid in D:for can in Ck:if Can.issubset (TID): Sscnt[can] = Sscnt.get (CAN, 0) + 1 Numitem s = float (len (D)) retlist = [] Supportdata = {} for key in Sscnt:support = Sscnt[key]/numitems if SUP Port >= MinSupport:retList.insert (0, key) Supportdata[key] = support return retlist, Supportdatadef Apriorigen (LK, k): Retlist = [] lenlk = Len (Lk) for I in Range (LENLK): for j in range (i + 1, lenlk): L1 = List (Lk[i]) [: k-2]; L2 = List (Lk[j]) [: k-2]; L1.sort (); L2.sort () if L1 = = L2:retList.append (Lk[i] | LK[J]) return retlistdef Apriori (DataSet, minsupport=0.5): C1 = createC1 (DataSet) d = Map (set, DAtaset) L1, Suppdata = Scand (D, C1, minsupport) L = [L1] k = 2 while (len (l[k-2]) > 0): Ck = Apriorigen (L[k-2], k) Lk, SUPK = Scand (D, Ck, Minsupport) suppdata.update (SUPK) l.append (Lk) k + = 1 return L, Suppdatadef CA Lcconf (Freqset, H, Supportdata, BRL, minconf=0.7): Prunedh = [] for conseq in h:conf = Supportdata[freqset]/s UPPORTDATA[FREQSET-CONSEQ] If conf >= minconf:print freqset-conseq, '--', Conseq, ' conf: ', con F Brl.append ((Freqset-conseq, Conseq, conf)) Prunedh.append (CONSEQ) return prunedhdef Rulesfromcon Seq (Freqset, H, Supportdata, BRL, minconf=0.7): M = Len (h[0]) if Len (Freqset) > m + 1:hmp1 = Apriorigen (h, M + 1) Hmp1 = calcconf (Freqset, HMP1, Supportdata, BRL, minconf) If Len (HMP1) > 1:rulesfromconseq (Freqset, HMP1, Supportdata, BRL, minconf) def generaterules (L, Supportdata, minconf=0.7): Bigrulelist = [] for i in RA Nge1, Len (L)): For Freqset in l[i]: H1 = [Frozenset ([item]) for item in Freqset] if i > 1: Rulesfromconseq (Freqset, H1, Supportdata, Bigrulelist, minconf) else:calcconf (Freqset, H 1, Supportdata, Bigrulelist, minconf) return bigrulelist
Learn some algorithms for security Apriori