Learn some algorithms for security hmm (prev)

Source: Internet
Author: User
Tags ord prev

PrefaceImplicit Markov (HMM), also known as Han Meimei, is widely used in speech recognition, Text processing and network security and other fields, 2009 I corona,d ariu,g Giacinto three great gods on the application of HMM to web security research papers, so that hmm gradually by the major security manufacturers attention. This paper focuses on the most common and relatively basic applications of HMM based on URL parameter anomaly detection, and the subsequent article will introduce the application of HMM and NLP in XSS, SQL and Rce. "More than one formula, half of the readers", so Hawking's "A Brief History of Time" and "the Ming of those things," as well-sold, my machine learning series are as little as possible to talk about concepts, more examples, hoping to let machine learning by more people understand and use. Principle of HMM foundationIn the real world, there is a kind of problem with obvious timing, such as intersection traffic lights, weather changes for several days, the context of our speech, hmm, the basic hypothesis is that a continuous time series event, its state is affected by and only by the N events in front of it, the corresponding time series can become n-order Markov chain.  If today there is fog and haze only by the day before and yesterday decided, so it constitutes a 2-order Markov chain, if yesterday and the day before the sun is sunny, then today is a sunny probability is 90%. Slightly more complicated, if you want to know 2000 kilometers away from a city haze situation, but you can not go directly to the local air condition, only the local wind conditions, that is, the air state is hidden, wind conditions are observable, need to observe sequence speculation hidden sequence, because the wind does have a greater impact on haze situation , even assuming that the wind is large in the case of 90% probability is sunny, so through the sample learning, it is possible to achieve from the observation sequence to infer the effect of hidden sequences, this is the implicit Markov.URL parameter ModelingCommon GET request-based XSS, SQL injection, RCE, attack load is mainly concentrated in the request parameters, with XSS as an example:
/0_1/include/dialog/select_media.php?userid=%3cscript%3ealert (1)%3c/script%3e
The range of parameters in a normal HTTP request is deterministic, and here the certainty is that it can be represented by alphanumeric special characters, not all of which can be determined by the 1-200 range of values. Take the following few logs for example:
/0_1/include/dialog/select_media.php?userid=admin123/0_1/include/dialog/select_media.php?userid=  Root/0_1/include/dialog/select_media.php?userid=maidou0806/0_1/include/dialog/select_media.php? userid=52maidou/0_1/include/dialog/select_media.php?userid=wjq_2014/0_1/include/dialog/ Select_media.php?userid=mzc-cxy

The human eye can conclude that the UserID field is composed of alphanumeric and special characters '-_ ', and if you are strong enough to read tens of thousands of normal samples, you can even summarize the value range for [0-9a-za-z-_]{4,}. If there are millions of parameters on billions of logs, how is manual work done? At this time machine learning can play a role.

Take the UID field as an example, the UID value as the observation sequence, the simplification period can be the value of the UID generalization, the entire model is 3 order Hmm, the state of the hidden sequence only three S1, S2, S3:
    • [A-za-z] generalization to a
    • [0-9] generalization is n
    • [\-_] generalization to C
    • Other characters are generalized to T
Like what:
    • admin123 Generalization to aaaaannn
    • Root generalized to AAAA
    • wjq_2014 Generalization to aaaacnnn

The hidden sequence is the cyclic transformation between the three states of the S1-S4, which is called the transfer probability matrix, while four states are in the probability of determining, in order to observe a, C, N, T four states in the sequence, the probability of this conversion is called the probability matrix of emission. The HMM modeling process is the process of generating these two matrices by learning samples. The generalization in the production environment should be cautious, at least the domain name, Chinese and other special characters need to be generalized separately.data processing and feature extractionBecause each URL of each domain name may be different in scope, some userid may be [0-9]{4,}, some may be [0-9a-za-z-_]{3,}, so you need to follow the different URL different parameters of each domain to learn separately. The generalization process is as follows:
1 defETL (str):2vers=[]3     forI, CinchEnumerate (str):4C=C.lower ()5        ifOrd (c) >= Ord ('a') andOrd (c) <= Ord ('Z'):6Vers.append ([Ord ('A')])7        elifOrd (c) >= Ord ('0') andOrd (c) <= Ord ('9'):8Vers.append ([Ord ('N')])9        Else:TenVers.append ([Ord ('C')]) One    returnNp.array (vers)
Friendly tips, in order to avoid Chinese characters such as interference, ASCII greater than 127 or less than 32 can not be processed directly skip. Extracting URL parameters from weblog, you need to address the URL encoding, parameter extraction and other nausea problems, fortunately Python has a ready-made interface:
1 With open (filename) as F:2     forLineinchF:3        #Cutting Parameters4result =Urlparse.urlparse (line)5        #URL decoding6query=urllib.unquote (result.query)7params =URLPARSE.PARSE_QSL (query, True)8         forKvinchParams:
#k为参数名, V is the parameter value friendship hint, URLPARSE.PARSE_QSL parsing URL request cutting parameters, encountered '; ' is truncated, resulting in missing parameter values '; ' Behind the content, this is a big pit, the production environment must pay attention to this problem. Training Model Installing HmmlearnHmmlearn is a hmm implementation under Python, a project that is independent from the Scikit-learn, depending on the environment as follows:
    • Python >= 2.6
    • NumPy (tested to work with >=1.9.3)
    • SciPy (tested to work with >=0.16.0)
    • Scikit-learn >= 0.16
The installation commands are as follows:
Pip Install-u--user Hmmlearn
Training ModelThe generalization of the vector x and the corresponding length matrix X_lens input can be x_lens because the parameter sample length may be inconsistent, so you need to enter separately.
Model = Hmm. GAUSSIANHMM (n_components=3, covariance_type="full", n_iter=100) Remodel.fit (x,x_ LENS)
Training samples are divided into: Score:16 query param:admin123score:9 query param:rootscore:21 query param:maidou0806score:16 query param : 52maidouscore:15 query Param:wjq_2014score:12 Query Param:mzc-cxy Model ValidationThe HMM model can usually solve three kinds of problems after training, one is the hidden sequence of the maximal probability of the input observation sequence, the most typical application is speech decoding and POS tagging, and the other is the next value with the largest prediction probability of the input part observation sequence, such as the search word guessing. The other is the input observation sequence to obtain the probability, thus judging the legitimacy of the observation sequence. Parameter anomaly detection is the third type. We define t as a threshold, and a parameter with a probability less than T is identified as an exception, and the T definition is usually slightly larger than the minimum of the training set, in which case 10 can be taken.
1 With open (filename) as F:2     forLineinchF:3        #Cutting Parameters4result =Urlparse.urlparse (line)5        #URL decoding6query =urllib.unquote (result.query)7params =URLPARSE.PARSE_QSL (query, True)8         forKvinchparams:9            ifIscheck (v) andLen (v) >=N:TenVers =ETL (v) OnePro =remodel.score (vers) A                ifPro <=T: -                    Print  "pro:%d v:%s line:%s"% (Pro,v,line)
Take Userid=%3cscript%3ealert (1)%3c/script%3e as an example, after decoding for <script>alert (1) </script> After generalization, Taaaaaataaaaatntttaaaaaat,score is-13945, which is recognized as an exception. In this paper, we introduce the application of hmm in web security, because only the parameters of the text feature for anomaly detection, although in theory, as long as the white sample is enough to recognize almost all of the unknown attacks based on the GET request parameters, but because lack of semantic level anomaly detection, high false alarm rate。 In addition, the scanner and other impact on the results of a large, how to further improve the detection capability, please see the next chapter.

Learn some algorithms for security hmm (prev)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.