Learn some algorithms for security SVM

Source: Internet
Author: User
Tags svm

Objective

In the Enterprise Security Construction topic occasionally mentioned the application of the algorithm, many students want to understand this piece, so I specifically opened a sub-topic to introduce the security field often used in machine learning model, from the entry-level SVM, Bayesian wait Hmm, Neural networks and deep learning (in fact, deep learning can be thought of as an enhanced version of neural networks).

Rule vs algorithm

Traditional security almost the ability of the rule to play to the extreme, mature enterprise WAF, IPS, anti-virus can even reach two more than nine accuracy, very good to do "I know I can intercept." But the rules are based on the existing security knowledge, for 0day or even neglected nday, basically no ability to find, that is, the so-called "I do not know I do not know", this time the algorithm or machine learning to play a role, machine learning advantage is can from a large number of black and white samples, mining potential laws, identify anomalies , and then through semi-automatic or artificial analysis, further confirm that the anomaly is false negative is false. In short, the accurate judgment invasion, is not the machine learning the forte, the recognition false negatives is hooping.

Scikit-learn

Scikit-learn is a well-known machine learning Library with extensive model support and detailed online documentation to support the Python language with short development cycles, especially for principle validation.

Environmental support is:

    • Python (>= 2.6 or >= 3.3),

    • NumPy (>= 1.6.1),

    • SciPy (>= 0.9).

The installation commands are:

Pip Install Scikit-learn

SVM Support Vector Machine

SVM is one of the most widely used algorithms in machine learning, often used for pattern recognition, classification, and regression analysis, especially for non-black and white in the secure world, so we focus on classification-related knowledge.

Assuming that there are only two-dimensional eigenvectors, we need to solve a classification problem by separating the normal user from the hacker, and if it is possible to differentiate by a straight line, then the problem becomes linearly distinguishable (linear separable), and if not, it becomes non-linear (linear Inseparable). To discuss the simplest case, assuming that the classification problem can be linearly differentiated, then the line of distinction becomes the hyper-plane, and the closest sample to the hyper-plane becomes the support vector (Supprot verctor).

For example, in the case of non-linearity, the need to upgrade to a higher plane to differentiate, such as two-dimensional plane can not be upgraded to three-dimensional plane to distinguish, this upgrade needs to rely on the kernel function.

Supervised learning and unsupervised learning

Supervised learning (supervised learning) is simply to take the tagged data to train, unsupervised learning (unsupervised learning) is simply to take unmarked data to train, SVM training needs to provide tags, so belong to supervised learning.

General steps for supervised learning

  

Example

The two-dimensional plane assumes that there are only two training samples [[0, 0], [1, 1]], and the corresponding markers are [0, 1], which need you to predict [2., 2.] tag, the code is implemented as:

From Sklearn import SVM

X = [[0, 0], [1, 1]] y = [0, 1]

CLF = SVM. SVC ()

Clf.fit (X, y)

Print Clf.predict ([[2., 2.]])

The result of the operation is array ([1]), as expected, following a common XSS test to illustrate the simple application of the lower SVM.

Data collection & Data Cleansing

Since our example is simple, merging the two steps together, preparing web logs for equal numbers of normal Web Access logs and XSS attacks, the simplest way is to refer to my previous article, "WAVSEP-based Shooting range construction Guide", A Web log of XSS attacks can be obtained using a scanner such as WVS to scan only for XSS-related vulnerabilities.

Characterization of

In practice data collection & Data cleansing is the most time-consuming, characterized by the most burning brain, because everything in the world is very complex and has many properties, but machine learning usually only understands the digital vector, the process of converting from real-world objects to computing the world's numbers is characterized, also called vectorization. For example, to characterize your ex-girlfriend, you can not say beautiful, gentle words, need the most representative of her characteristics of the place to be digitized, the following is an example:

You will find that the range of data between the vectors varies greatly, and one month's spending power may kill the effects of other features on the results, although the real-life indicator does have a big impact, but it's not enough to kill all the other features, so we also need to standardize the features, the common way:

    • Standardization

    • Mean variance Scaling

    • Go to mean value

Back to the problem of XSS, we need to characterize the Web log, the following characteristics are an example:

The sample code for feature extraction is as follows:

def get_len (URL):

return len (URL)

def get_url_count (URL):

If Re.search (' (/http) | ( https://) ', url, re. IGNORECASE):

Return 1

Else

return 0

def get_evil_char (URL):

Return Len (Re.findall ("[<>,\ ' \"/] ", URL, re. IGNORECASE))

def get_evil_word (URL):

Return Len (Re.findall (alert) | ( script=) (%3c) | (%3e) | (%20) | (onerror) | (onload) | (eval) | (src=) | (prompt) ", Url,re. IGNORECASE))

Data normalization uses the following code:

Min_max_scaler = preprocessing. Minmaxscaler ()

X_min_max=min_max_scaler.fit_transform (x)

Data marking

This step is very easy because we already know what XSS is and what is not.

Data splitting

This step is to randomly divide the data into training and test groups, usually directly using cross_validation.train_test_split, usually using 40% as a test sample, 60% as a training sample, this ratio can be adjusted according to their own needs.

X_train, X_test, y_train, y_test = Cross_validation.train_test_split (x, Y, test_size=0.4, random_state=0)

Data training

Using the Scikit-learn SVM model, the SVM model for classification is called SVC, and we use the simplest kernel function linear

CLF = SVM. SVC (kernel= ' linear ', c=1). Fit (x, y)

Joblib.dump (CLF, "XSS-SVM-200000-MODULE.M")

Model validation

By loading the trained model, we can predict the test set and compare the predicted result with the marking result.

Clf=joblib.load ("XSS-SVM-200000-MODULE.M")

Y_test=[]

Y_test=clf.predict (x)

Print Metrics.accuracy_score (y_test, y)

We trained in a 200,000-sample black-and-white model of each of the 50,000 samples in a black and white test set, any black and white prediction errors are judged to be wrong, the final result accuracy rate of 80%, for machine learning, only rely on model optimization, this proportion is already very high. By training on larger data sets (such as the log of large cdn& cloud WAF clusters), further increasing the number of features and increasing the automation or semi-automatic verification of the subsequent links can further improve this ratio, the final accuracy rate we have achieved more than 90%.

Exception data

Through SVM we identified the abnormal data, after the artificial confirmation, in addition to the distorted XSS, there are many other attacks, because the test only opened the rules of the XSS signature, so other attacks did not intercept, but also into the white sample, for example, as follows:

/.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /windows/win.ini

/index.php?op=viewarticle&articleid=9999/**/union/**/select/**/1331908730,1,1,1,1,1,1,1--&blogid=1

/index.php?go=detail&id=-99999/**/union/**/select/**/0,1,concat (1331919200,0x3a,512612977), 3,4,5,6,7,8,9,10,11,12,13,14,15,16

/examples/jsp/num/.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /.|. /windows/win.ini

/cgi-bin/modules/tinymce/content_css.php?templateid=-1/**/union/**/select/**/1331923205,1,131076807--

/MANAGER/AJAX.PHP?RS=__EXP__GETFEEDCONTENT&RSARGS[]=-99 Union Select 1161517009,2,1674610116,4,5,6,7,8,9,0,1,2,3--

We speculate that the attack characteristics that the machine learns from the XSS sample may partially cover the characteristics of an attack such as SQL injection with code injection characteristics.

Summarize

Many people have asked me, "What kind of case can be used algorithm?". This does not have a unified answer, I think in the rules, the sandbox is done in place, you can try to use machine learning, if it is known that the vulnerability is not very good defense and detection, the use of machine learning is really low cost. The car did not have a carriage fast when it came out, but it was a matter of opinion whether to build a faster carriage or choose to build a future car.

Learn some algorithms for security SVM

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.