Objective
In the Enterprise Security Construction topic occasionally mentioned the application of the algorithm, many students want to understand this piece, so I specifically opened a sub-topic to introduce the security field often used in machine learning model, from the entry-level SVM, Bayesian wait Hmm, Neural networks and deep learning (in fact, deep learning can be thought of as an enhanced version of neural networks).
Rule vs algorithm
Traditional security almost the ability of the rule to play to the extreme, mature enterprise WAF, IPS, anti-virus can even reach two more than nine accuracy, very good to do "I know I can intercept." But the rules are based on the existing security knowledge, for 0day or even neglected nday, basically no ability to find, that is, the so-called "I do not know I do not know", this time the algorithm or machine learning to play a role, machine learning advantage is can from a large number of black and white samples, mining potential laws, identify anomalies , and then through semi-automatic or artificial analysis, further confirm that the anomaly is false negative is false. In short, the accurate judgment invasion, is not the machine learning the forte, the recognition false negatives is hooping.
Scikit-learn
Scikit-learn is a well-known machine learning Library with extensive model support and detailed online documentation to support the Python language with short development cycles, especially for principle validation.
Environmental support is:
The installation commands are:
Pip Install Scikit-learn
SVM Support Vector Machine
SVM is one of the most widely used algorithms in machine learning, often used for pattern recognition, classification, and regression analysis, especially for non-black and white in the secure world, so we focus on classification-related knowledge.
Assuming that there are only two-dimensional eigenvectors, we need to solve a classification problem by separating the normal user from the hacker, and if it is possible to differentiate by a straight line, then the problem becomes linearly distinguishable (linear separable), and if not, it becomes non-linear (linear Inseparable). To discuss the simplest case, assuming that the classification problem can be linearly differentiated, then the line of distinction becomes the hyper-plane, and the closest sample to the hyper-plane becomes the support vector (Supprot verctor).
For example, in the case of non-linearity, the need to upgrade to a higher plane to differentiate, such as two-dimensional plane can not be upgraded to three-dimensional plane to distinguish, this upgrade needs to rely on the kernel function.
Supervised learning and unsupervised learning
Supervised learning (supervised learning) is simply to take the tagged data to train, unsupervised learning (unsupervised learning) is simply to take unmarked data to train, SVM training needs to provide tags, so belong to supervised learning.
General steps for supervised learning
Example
The two-dimensional plane assumes that there are only two training samples [[0, 0], [1, 1]], and the corresponding markers are [0, 1], which need you to predict [2., 2.] tag, the code is implemented as:
From Sklearn import SVM
X = [[0, 0], [1, 1]] y = [0, 1]
CLF = SVM. SVC ()
Clf.fit (X, y)
Print Clf.predict ([[2., 2.]])
The result of the operation is array ([1]), as expected, following a common XSS test to illustrate the simple application of the lower SVM.
Data collection & Data Cleansing
Since our example is simple, merging the two steps together, preparing web logs for equal numbers of normal Web Access logs and XSS attacks, the simplest way is to refer to my previous article, "WAVSEP-based Shooting range construction Guide", A Web log of XSS attacks can be obtained using a scanner such as WVS to scan only for XSS-related vulnerabilities.
Characterization of
In practice data collection & Data cleansing is the most time-consuming, characterized by the most burning brain, because everything in the world is very complex and has many properties, but machine learning usually only understands the digital vector, the process of converting from real-world objects to computing the world's numbers is characterized, also called vectorization. For example, to characterize your ex-girlfriend, you can not say beautiful, gentle words, need the most representative of her characteristics of the place to be digitized, the following is an example:
You will find that the range of data between the vectors varies greatly, and one month's spending power may kill the effects of other features on the results, although the real-life indicator does have a big impact, but it's not enough to kill all the other features, so we also need to standardize the features, the common way:
Standardization
Mean variance Scaling
Go to mean value
Back to the problem of XSS, we need to characterize the Web log, the following characteristics are an example:
The sample code for feature extraction is as follows:
def get_len (URL):
return len (URL)
def get_url_count (URL):
If Re.search (' (/http) | ( https://) ', url, re. IGNORECASE):
Return 1
Else
return 0
def get_evil_char (URL):
Return Len (Re.findall ("[<>,\ ' \"/] ", URL, re. IGNORECASE))
def get_evil_word (URL):
Return Len (Re.findall (alert) | ( script=) (%3c) | (%3e) | (%20) | (onerror) | (onload) | (eval) | (src=) | (prompt) ", Url,re. IGNORECASE))
Data normalization uses the following code:
Min_max_scaler = preprocessing. Minmaxscaler ()
X_min_max=min_max_scaler.fit_transform (x)
Data marking
This step is very easy because we already know what XSS is and what is not.
Data splitting
This step is to randomly divide the data into training and test groups, usually directly using cross_validation.train_test_split, usually using 40% as a test sample, 60% as a training sample, this ratio can be adjusted according to their own needs.
X_train, X_test, y_train, y_test = Cross_validation.train_test_split (x, Y, test_size=0.4, random_state=0)
Data training
Using the Scikit-learn SVM model, the SVM model for classification is called SVC, and we use the simplest kernel function linear
CLF = SVM. SVC (kernel= ' linear ', c=1). Fit (x, y)
Joblib.dump (CLF, "XSS-SVM-200000-MODULE.M")
Model validation
By loading the trained model, we can predict the test set and compare the predicted result with the marking result.
Clf=joblib.load ("XSS-SVM-200000-MODULE.M")
Y_test=[]
Y_test=clf.predict (x)
Print Metrics.accuracy_score (y_test, y)
We trained in a 200,000-sample black-and-white model of each of the 50,000 samples in a black and white test set, any black and white prediction errors are judged to be wrong, the final result accuracy rate of 80%, for machine learning, only rely on model optimization, this proportion is already very high. By training on larger data sets (such as the log of large cdn& cloud WAF clusters), further increasing the number of features and increasing the automation or semi-automatic verification of the subsequent links can further improve this ratio, the final accuracy rate we have achieved more than 90%.
Exception data
Through SVM we identified the abnormal data, after the artificial confirmation, in addition to the distorted XSS, there are many other attacks, because the test only opened the rules of the XSS signature, so other attacks did not intercept, but also into the white sample, for example, as follows:
/.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /windows/win.ini
/index.php?op=viewarticle&articleid=9999/**/union/**/select/**/1331908730,1,1,1,1,1,1,1--&blogid=1
/index.php?go=detail&id=-99999/**/union/**/select/**/0,1,concat (1331919200,0x3a,512612977), 3,4,5,6,7,8,9,10,11,12,13,14,15,16
/examples/jsp/num/.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /windows/win.ini
/cgi-bin/modules/tinymce/content_css.php?templateid=-1/**/union/**/select/**/1331923205,1,131076807--
/MANAGER/AJAX.PHP?RS=__EXP__GETFEEDCONTENT&RSARGS[]=-99 Union Select 1161517009,2,1674610116,4,5,6,7,8,9,0,1,2,3--
We speculate that the attack characteristics that the machine learns from the XSS sample may partially cover the characteristics of an attack such as SQL injection with code injection characteristics.
Summarize
Many people have asked me, "What kind of case can be used algorithm?". This does not have a unified answer, I think in the rules, the sandbox is done in place, you can try to use machine learning, if it is known that the vulnerability is not very good defense and detection, the use of machine learning is really low cost. The car did not have a carriage fast when it came out, but it was a matter of opinion whether to build a faster carriage or choose to build a future car.
Learn some algorithms for security SVM