Information fusion technology in the field of internal threat detection based on multi-Domain Information Fusion

Source: Internet
Author: User
Tags in domain idf

Information fusion technology in the field of internal threat detection based on multi-Domain Information Fusion


Preface

Here, I would like to give my friends from FreeBuf a good old age and wish them a happy monkey year!

As the first article of the Year of the Monkey, we continued the topic of "internal threat detection" last year and introduced information integration technology in the field of internal threat detection.

Directory

1. Why Information Fusion )?

2. Experiment Data

3. Blend-in Attack Detection

4. Unusual-change Attack Detection

5. Experiment results

6. Summary

7. References

I. Why information fusion?

The internal threat detection we are talking about today can be traced back to the earliest "active intrusion detection" research. The difference is that the focus is onInternalNot external.

The dataset used for threat detection generally comes from multipleSourceOr we can say there are multipleTypeFor example, the HTTP data that records users' network behaviors and the Logon data recorded by the system are considered as two types of data or differentDomain). Today we are going to talk about how to use these multi-domain data when building a classifier.

Adding behavior data that has been collected to multiple user domains through sensors, such:

Logon Data + Device Data + File Data + HTTP Data ... ...

So how can we build the initial dataset based on the data? A simple idea is to directly connect these different types of data as above (Concatenating), but there are many problems with such an approach.

First, the value ranges of data values in different domains are usually different. This is natural, but the problem is that if you use the data as a feature when building a classifier, in this case, the function of some fields in the final result is not obvious. For example, in K-means classification, the feature square with a large value during distance calculation has the greatest impact on distance, the effect of a small value range is negligible, which affects the final classifier.

Second, we can adoptNormalizationWe assume that all domains play an equally important role in Behavior Prediction. However, as a human assumption, we may have deviated from reality, therefore, the effect of classifier is also limited.

Finally, let alone the above two problems. Simply processing data in all domains will lead to ultra-long dimensions, and eventually lead to overfitting or high computational complexity.

Therefore, multi-domain information cannot be simply joined and processed. It is necessary to study information fusion methods that comply with data laws.

Ii. Experiment Data

The experimental dataset of the detection method we introduced today is taken from a subset of the U.S. ADAMS project, which is jointly provided by CERT and Carnegie Mellon University. For example:

 

For more information about how to generate a dataset, see [4]. datasets are classified by released versions. Some types have multiple data subsets, such as r4.1-r4.2, later versions are generally supersets of previous versions, and the content is contained. The readmefile contains detailed information about each dataset, while answer.tar.bz2 contains information about malicious data segments in each dataset, which facilitates experiment training or testing.

The following describes the experiment dataset. The dataset is used in two parts. One is composed of audit records used by computers of 1000 users. It is used to test Blend-in attack detection using multi-domain fusion; the other dataset comes from the actual computer usage data of 4600 users. Attack Data in both datasets is based on the analysis of real eventsAttack scenariosAfter simulation, insert it into normal data. Datasets are divided into five types, for example:

1,Logon and logout events;

2,Use of removable devices, Such as USB, recording the mobile device name and type;

3,File Access EventsThat is, file creation, copying, moving, overwriting, renaming, and deletion. Each record should contain the accessed file name, path, file type, and content;

4,HTTP access eventMainly records URL, domain name information, activity code (upload or download), browser information (IE, Firefox, Chrome), and whether the webpage is encrypted;

5,Email sending and reading eventsRecord the email address, CC/forwarding address, title, sending time, text content, attachment information, and whether the email is encrypted.

Some original samples of the preceding dataset are as follows.

Readme file:

 

Logon/off file:

 

Device file:

 

HTTP file:

 

Email file:

 

Based on the five categories in the dataset, we further perform statistical analysis to obtain statistical features. 1 (note that the "category" here is the same as what we call the "data domain ):

 

The features of each type of data are counted and, for exampleLogonIn the class data, # Logons indicates the number of user logins within a specific time window, while # Logons on user's PC indicates the number of logins on the user's PC within a specific time window, and so on. In the final use, we sort out the data features based on the "Day", that is, describe the above domain feature set based on the format of (User, Day.

In subsequent attack detection, the above data domain features will be used when detecting "Unusual-change" attacks. For "Blend-in" attacks, we will separately construct features.

Iii. Blend-in Attack Detection

First, we will introduce the detection of specific "Blend-in" attacks. This attack literally means "break in/mix in... "Internal" means that the attacker obtains the logon permission of the internal network (stealing the account of a valid user) and tries to pretend that the user is valid in the Organization.

To detect such attacks, we proposeConsistence)To describe users' behaviors in different ways.DomainConsistency. Before continuing our discussion, we will give two concepts:

Definition-1: Inconsistency of Inter-Domain behaviors, that is, user A and B. If A and B belong to the same group (Cluster) in domain S1 ), however, if domain S2 does not belong to Group B, the Inter-Domain behavior of Domain A is inconsistent.

This concept may not be as rigorous as the mathematical definition, but the idea behind it is simple: user behavior should be reflected in the data of various domains. Naturally, users with similar jobs and roles should be similar in each domain because of similar behaviors. This similarity can be manifested in the similarities and differences of the owner group. For example, if the user A and the engineer belong to the same group in the S1 and S2 domains, we can expect that A should also belong to the same group as the "engineer" user in the new domain S3. The "group" here isCluster.

For your understanding, we can refer to. User ABC is located in the same Cluster in domain 1, but B and C are obviously consistent in domain 2, while A is inconsistent;

 

Definition-2: Inter-Domain consistency. We say that A user A is consistent with Sj (j is not equal to I) IN THE Si domain, that is, we can have the Cluster that user A belongs to in the Sj (j is not equal to I) domain, and predict the Cluster that user A belongs to IN THE Si domain.

The second concept is mainly used to rate user behavior exceptions. For example, when a prediction error occurs, the score is + 1 (of course, it can also be another scoring mechanism ). The penalty coefficient can be dynamically adjusted based on the consistency of all users on the Si domain. For example, if the consistency of all users on the Si is poor, the penalty coefficient is low, on the other hand, a large penalty coefficient is given to obtain a high exception score if a prediction error occurs.

To clearly understand a concept, we must understand the purpose of this concept.So we will first explainInter-Domain consistencyThis is mainly used for scoring. Next, we need to introduce how this definition is implemented in practice?

Here, we follow the steps to implement the method to gradually introduce the "consistency" in practice.

Blend-in detection steps:

1. original multi-domain data clustering: we use the K-means clustering method to perform Clustering Analysis on data in each domain to obtain the virtual "user group" (Clusters) between domains;

2. We apply the GMM algorithm to the data in each domain, that is, assume that one Cluster is a componant, and then use the maximum likelihood estimation (MLE) to calculate the GMM of each domain;

3. calculate the MAP (maximum posterior probability) of specific user data based on the GMM model of the calculated data in each domain, and give a data record of a user in a domain, we can determine the most likely Cluster Based on GMM;

4. Create the Cluster vector Cu: Cu1, Cu2, Cu3,… for each user ,... Cum indicates that for user u, when there are m domains, Cui indicates the Cluster in which the user maps the largest MAP in the domain I, that is, the most likely Cluster;

5. When we use Cuj (j! = I) when a Cui error is predicted, an exception occurs and an exception score is scored. The penalty coefficient is described above (the prediction must be compared with the user's u group or peer Cluster );

6. three methods can be used for specific application: one is that the feature and the final score are Discrete values, that is, Discrete features and Discrete evaluation. At this time, the score is based on Hamming distance, that is, the correct prediction value is 0, and the prediction error score is + 1. The second is that the feature is a Discrete value, while the score is a Continuous value, that is, Discrete features, Continuous evaluation, in this case, scoring is essentially a density estimation. scoring is to use 1-user cluster to predict the correct likelihood. The third is that features and scores are Continuous, that is, continuous features, Continuous evaluation, in this case, the feature uses MAP instead of the Cluster of MAP, and the prediction result becomes the probability value.

Score Integration

In the end, we can get the user's abnormal scores in each domain. Now we need to combine these scores. The main method isWeighted sum. Here we mainly refer to the TF/IDF framework method in the vocabulary frequency field of documents, that is, Term frequency-inverse document frequency, it mainly calculates the ratio of the frequency of occurrence of a word in a document to the frequency of occurrence in the entire collection to indicate the relative importance of the word to the document. Next we will provide specific pseudo code to calculate the final weighted score. 2:

 

As shown in 2, rows 1-4 calculate the weight of each domain, rows 5-7 calculate the weighted score of each domain, and finally calculate the weighted sum of the scores of all user domains. The final F is the final exception rating set for each user.

Iv. Unusual-change Attack Detection

No attacker has been detected by adding the Blend-in attack detection, so we need to analyze user behavior based on Unusual-change. The starting point here is that user behavior will change normally, that is, there is a reasonable offset. If only a specific user is used as the standard, many reasonable changes will undoubtedly be regarded as exceptions. Therefore, we select a user peer group (user groups with the same positions, roles, and work tasks) as the benchmark for comparison. Here, according to the previous method, we still provide a definition:

Definition-3: consistency of behavior changes, that is, user B, A, and B in user A and peer C have some changes between specific clusters in the S1 domain, the changes on the Cluster of A and B in the S2 domain should be consistent.

The premise assumption here is similar to that in Definition-2, that is, the Change Mode of users with the same task and role should also be similar. Of course, changes do not need to happen at the same time, we are concerned with the consistency of a change pattern over a long period of time. For example, the changes in the Logon domain of user A and peer-to-peer "engineer" are in Cluster2 and Cluster4, the change of "engineer" and "A" in the email domain is between Cluseter3 \ 4 \ 5, which means that A's behavior is consistent, and vice versa.

Similarly, we provide a diagram for your reference and understanding. User ABC changes in domain 1 between status 1 and status 2, but in domain 2, user A changes in status 1 and 4, different from user B and C:

 

Next, the detection method is similar to Blend-in. we will briefly describe it as follows:

1.Clustering original data: Note that the feature of the domain described in [experiment data] is not based on GMM anymore. (Figure 1), different features are used as the "State" of the user in the domain ", in addition, a transformation probability matrix Qd is created. Each element is in the form of qd (Ck, Cm), indicating the probability that the user changes from State k to state m, the probability value is calculated based on the frequency of occurrence of the State;

2.Behavior Change Modeling: Here we will briefly introduce the two algorithms used. One is the Markov model and the other is the Rarest variable model. We will not detail the details here, but will directly give the model formula used:

Figure 3.1 Markov model, where S represents the user's exception score in the domain d, and Pd represents the prior probability of the State C0:

 

Figure 3.2 Rarest change model,

 

3. Information Integration: This is relatively simple. Select the largest threat, that is, the smallest value of S, from the S score set calculated above, 3.3:

 

5. Experiment results

To verify the effectiveness of information fusion, we first conduct an experiment on the Blend-in attack. The experiment results are as follows:

Figure 4: Data in the Device and File fields is more suitable for exception detection, while data in the HTTP and Logon fields is slightly different:

 

Figure 5: using the device domain alone for exception detection is not significant

 

Figure 6: Inter-Domain fusion detection, with obvious abnormality

 

For the Unusual-change attack, we use the data set of 4600 users to conduct an experiment, and we will find that the analysis cost-effectiveness is significantly improved, the data from the past seven to eight months indicates that only 50% of user data needs to be sampled to detect all malicious attackers. In the data from February September, all attackers only need to analyze 13% of the attackers, 7 and 8:

 

 

The Markov model is used separately for experiments to draw the likelihood values of changes for each user. We can see thatDeviceAndLogonThe likelihood of changes in the domain is very consistent, so the inconsistency of changes is very obvious, that is, exceptions in these two domains are more easily detected;Email Sent/ReceivedThe likelihood consistency of changes on the domain is poor, which is not conducive to exception detection.

Likelihood curve of user behavior changes in the Device domain:

 

Likelihood curve of user behavior changes in the Logon domain:

 

Email Sent:

 

Email Received:

 

Vi. Summary

Traditional detection methods focus on data in a specific domain, or simple splicing of data in multiple domains. The method we introduce today uses TF/IDF and GMM to effectively combine information in various domains, this improves the efficiency of specific detection and effectively reduces the false positive rate. The key of the fusion method is to automatically calculate the weighted sum of the information of each domain through the TF/IDF framework, and use the definition of "consistency" to skillfully combine the information, in this way, the Information Association of each domain takes effect and the detection efficiency is optimized.

VII. References

1. M. Salem and S. Solfo. Masqurade attack detection using a search-behavior modeling approach. Columbia University Computer Science Department, 2009.

2. M. Salem, S. Hershkop and S. Stolfo. A survy of insider attack detection research. Insider Attack and Cyber Securtiy: Beyond the Hacker, Springer, 2008.

3. Multi-Domain Information Fusion for Insider Threat Detection, IEEE Security and Privacy Workshops, 2013

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.