Examples of exception detection methods and ideas based on Big Data Analysis
1 Overview
With the deepening of information technology in human society, the data produced by information systems is also growing exponentially. In-depth analysis of such data can produce a lot of valuable information. Because of the large amount of data and the diversity of data attributes, classic statistical analysis methods cannot be applied. Therefore, Big Data Analysis Methods Based on Machine Learning theory must be used. At present, the big data analysis method has been widely used in the business intelligence (BI) field and achieved satisfactory results. This method can also be used in the information security field to detect information system exceptions (intrusion and attack, data leakage, etc ). When using big data analysis to discover abnormal events, several conditions must be met: 1) the behavior log must be detailed enough to distinguish normal and abnormal behaviors from the log Content. That is, it is assumed that the abnormal behavior, no matter how normal it looks on the surface, is always different from the normal behavior in terms of details. 2) select an appropriate analysis algorithm for different analysis objectives. 3) perform reasonable modeling on the behavior description.
2. botnet Detection Based on DNS Log Analysis
2.1 format and description of DNS resolution request logs
The generated DNS resolution request logs vary depending on the DNS system and configuration parameters. Here, only one log is used to describe it.
Default
1
Jul 22 10: 59: 59201307221059 GSLZ-PS-DNS-SV07-YanT '17 75 1374461999.999790 listen 53 218.203.199.90 5826 dns, 4692,0 | 4 | 5 1 www.baidu.com, www.baidu.com, 3, 111.11.184.114'
Where
The meanings of tag values are as follows:
QR code is a bit: 0 indicates the query message, and 1 indicates the corresponding message.
Opcode is a four-bit field. 0 indicates a standard query, 1 indicates a reverse query, and 2 indicates a server status request.
AA is a bit, abbreviated as Authoritative Answer, indicating that the name server is authorized to the domain
TC is a bit, which is short for Truncated. It indicates that when the UDP response packet exceeds 512 bytes, only 512 bytes are returned.
RD is a bit, short for Recursion Desired, meaning expected Recursion. The name server is expected to process this query, rather than providing a list of iterative query servers.
RA is a bit, short for Recursion Available, indicating that recursive queries are Available. If the name server supports recursive queries, this bit is set to 1.
Zero is 3 bits and is set to 0
Rcode has four bits, indicating that the name is incorrect. 0 indicates no error, and 3 indicates an error. If the specified domain in the query does not exist, 3 is returned.
Response status
"NOERROR" => 0, no error condition.
"FORMERR" => 1. The Domain Name Server cannot interpret this request because of a format error.
"SERVFAIL" => 2. When processing this request, the Domain Name Server encountered an internal error. For example, an operating system error occurs or the forwarding times out.
"NXDOMAIN" => 3. Some domain names should exist but do not exist.
"NOTIMPL" => 4. The Domain Name Server does not support this specified Opcode.
"REFUSED" => 5. the Domain Name Server refuses to perform specific operations for policy and security reasons.
"YXDOMAIN" => 6, some domain names should not exist but exist.
"YXRRSET" => 7. Some RRsets should not exist but exist.
"NXRRSET" => 8. Some RRsets should exist but do not exist.
"NOTAUTH" => 9. The Domain Name Server is not authorized to the region name.
"NOTZONE" => 10. A domain name in the pre-query or update segment is not recorded in the region segment.
2.2 Comparison and Analysis of normal and abnormal DNS resolution requests
Most botnets are infected with malicious programs. They are only a downloader program. malicious programs that can really perform harmful operations must be downloaded from the malicious program distribution server. Therefore, after a botnet host is installed with a download loader, the primary task is to initiate a series of domain name resolution requests to obtain the IP addresses of the hosts distributed by malicious programs to download malicious program entities. After the entity malware is completed, the zombie host sends a domain name query request to obtain and establish a connection with the IP address of the Control Server, waiting for the control server to send instructions. To prevent distribution servers and control servers from being discovered and destroyed by network administrators, botnet controllers use many technical means to protect these two types of key servers, such as dynamic domain names and Fast Flux technology. There is also a type of DNS query request itself, which is an attack initiated by a zombie host. Its features are also quite different from normal query requests. In short, botnets send a large number of domain name query requests, and these requests are significantly different from normal domain name requests in many attributes.
Table 2-1 Comparison of abnormal query requests and normal query requests
2.3 general process of Similarity Analysis
Because normal domain name query requests account for the vast majority and have obvious similarity with each other, the domain name query request logs of zombie hosts are significantly different, therefore, similarity analysis is very suitable for distinguishing. The common steps for similarity analysis are as follows: 1) determine the object to be analyzed (source IP address or domain name ). 2) determine the analysis attributes. 3) quantize attributes into analytic values. 4) write data into the descriptive matrix. 5) use the descriptive matrix as the input data and substitute the similarity calculation formula to calculate the similarity between the analyzed objects.
Similarity analysis usually regards each data object as a point in a multi-dimensional space. similarity between objects can be expressed by similarity coefficient or a certain distance. Objects with a similarity coefficient close to 1 or closer have similar properties. objects with a similarity coefficient close to 0 or a distance are significantly different. Different data types are applicable to different similarity coefficient calculation formulas. Commonly used similarity coefficient or distance calculation formulas include:
(2-1)
(2-2)
(2-3)
The formula (2-1) is the formula for calculating the spatial distance between the variable Xi and Xj.
The formula (2-2) is the formula for calculating the similarity coefficient.
The formula (2-3) is the formula for calculating the Jaccard similarity coefficient. Generally, the jiekard similarity coefficient processes asymmetric binary variables. Assume that A and B are two n-dimensional vectors, and the values of all dimensions are 0 or 1. Asymmetric means that two outputs of a State are not equally important, for example, positive and negative results of a disease check. Where:
M11 indicates that the corresponding dimensions of A and B are the number of dimensions of 1,
M10 indicates the number of dimensions corresponding to A and B, which are 1 and 0 respectively.
M01 indicates the number of dimensions 0 and 1 respectively.
M00 indicates the number of dimensions corresponding to A and B that are both 0.
Typically, more important output results are encoded as 1 (for example, HIV positive) with a lower probability, and the other result is encoded as 0. In some fields, positive matching (M11) is more meaningful than negative matching (M00. The number of negative matching M00 is considered unimportant and can be ignored during calculation.
When analyzing domain name query logs, you can take the Host IP address that sends the query request as the object, or the domain name to be queried as the object. You can analyze a specific attribute or a group of attributes. Therefore, different objects and attributes can be combined to obtain many descriptive matrices. The following example describes the process of Similarity analysis. In this example, a descriptive matrix (such as table 2-2) is obtained by taking the domain name as the object and the number of times the domain name is queried by each IP address as the attribute ).
Table 2-2 domain name request behavior description Matrix
For the sake of simplicity, the value of the description matrix can be substituted into the formula (2-2) to calculate the "distance" between each domain name and obtain the similarity Matrix (such as table 2-3 ). Observed that domain name n has the lowest similarity with other domain names. You can basically determine that the host with the query domain name n is a zombie host.
Table 2-3 Similarity analysis results of domain name request Behavior
Similarity analysis data is a matrix of objects-object structures. You can use domain names or IP addresses as objects, or use IP addresses and domain names to construct a matrix.
2.4 general process of cluster analysis
When the attributes of each domain name resolution request are used as a variable, these attributes constitute a multidimensional vector, for example, table 2-4. Each row is a multidimensional vector. By performing clustering analysis on these multi-dimensional vectors, we can find that the domain names corresponding to those vectors outside of aggregation are abnormal domain names. These domain name resolution requests may be sent by zombie programs or webshells. The following attributes can be considered: domain Name Length, domain name similarity, TTL, domain name layer level, request sending interval, number of request source IP addresses, response status, number of IP addresses corresponding to the domain name, domain name query type.
Table 2-4 domain name attribute multi-dimensional Vector
You can use the partition method or K-means algorithm to perform Clustering Analysis on multidimensional vectors of domain name attributes. Because the attribute values of abnormal domain names are usually significantly different from those of normal domain names, clustering is usually used to obtain a high clustering quality, separates abnormal domain names from the clusters of normal domain names.
Clustering Analysis Data is a set of objects-Multidimensional vectors with attribute structures. domain names are used as objects and query request attributes are used as attributes.
3. Internal abnormal behavior detection based on Big Data Analysis
3.1 Overview of internal information system behavior
In the industry, the behaviors of internal information systems (hereinafter referred to as internal behaviors) are divided into host (including server and terminal) behavior and network behavior. One is host behavior, that is, local behavior of the host, for example, Account creation, file creation, Registry Modification, Memory attribute (read/write, execution) changes, process changes (start, stop), and so on. If multiple virtual hosts are running on the physical host, the host behavior should also include the behavior of some virtualization systems. Second, network behavior, that is, behavior related to network access, such as domain name resolution requests, HTTP access requests, ARP broadcast, sending and receiving emails, instant messages, file upload and download, and database access.
3.2 Basic Principles of big data analysis on internal Behaviors
Early information security measures focus on the protection of external attacks, while internal abnormal behaviors are often not paid enough attention and lack of detection methods. A large number of information security practices have made it possible to reach a consensus on the importance of internal abnormal behavior detection. In particular, many APT attacks have been disclosed in recent years. The main attack process occurs on internal networks and information systems. Generally, internal abnormal behaviors are very concealed. Attackers can hide their own attack behaviors. Generally, a single behavior seems to be normal, but after some behaviors are associated, this kind of association has very few combinations and the behavior subject has no special characteristics. This kind of behavior is probably abnormal. Or some internal behaviors have been identified as exceptions, so the associated behaviors will also greatly increase the probability of exceptions.
3.3 general association analysis process
The Analysis of Internal behaviors is also applicable to similarity analysis. To avoid duplication, the association analysis algorithm is used to analyze internal behaviors to illustrate the practical application of this method. The general steps of association analysis are as follows: 1) parse the internal behavior logs by using the behavior subject as the analyzed object (usually the IP address or identity, convert heterogeneous logs describing various behaviors into behavior chains suitable for analysis and comparison (2-4 ). 2) substitute the behavior chain data into the association analysis algorithm to calculate various possible associations. 3) based on certain Judgment Rules, identify the combination of abnormal behaviors from multiple calculated associations.
Figure 3-1 build a behavior chain
The purpose of association analysis is to find association rules from data. The so-called association rule is like the implication formula of X → Y, which means that X can be used to deduce "get" Y, where X and Y are called the premise and result of association rules respectively. Only when the minimum support and minimum confidence level are met can we consider that "'y' can be deduced through X. Before you understand an algorithm, you must first understand several basic concepts:
Support: refers to the probability that event X and event Y occur simultaneously, that is, support = P (XY)
Confidence Level: refers to the probability of event Y Based on Event X. Confidence Level = P (Y | X) = P (XY)/P (X) item set: B = {B1, B2 ,......, Bm} is a collection of items.
Behavior chain Record Library: D = {t1, t2 ,......, Tn}
Behavior chain: behavior chain t consists of multiple items, and t is a non-empty subset of B.
TID: each behavior chain corresponds to a unique identifier.
Frequent Item set: the item set that meets the minimum support threshold
To better understand the above concepts, Figure 1-1 provides a more vivid description. The rounded rectangle indicates the set I of all items, the blue dot in the ellipse indicates the X event, and the green triangle in the diamond indicates the Y event.
Figure 3-2 Basic concepts of Association Analysis
Table 3-1 Internal behavior record library
In order to describe simple but not general, nine behavior subjects and five behaviors are used here (recorded as B1 ~ B5) to illustrate the principles of the association analysis algorithm. For example, in table 3-1, there are nine behavior chain records in the behavior record library, involving B1 ~ B5 and other 5 behaviors. There was a certain behavior, but it was recorded as 1, not 0. First, scan the record library. Appropriate frequent item set C1 (Table 3-2)
Table C1
If the minimum support level is 2, all frequent item sets are selected as the first-order sequence item set. Take B1 ~ The arrangement and combination of B5 are used as the second-order frequent items, and the record library is scanned to obtain the second-order frequent item set C2 (Table 3-3)
Table 3-3 second-order frequent itemset C2
Remove the item set with support less than 2 and obtain the second-order sequence item set L2 (Table 3-4)
Table 3-4 second-order sequence entry L2
Arrange and combine the element of the L2 item set to generate a third-level frequent item set. Because the subset of the frequent item set must also be frequent, we need to remove the less frequent combinations of those subsets, at last, there are only two sets of maximum frequency items {B1, B2, B3} and {B1, B2, B5} with support greater than 2 (Table 3-5 ).
Table 3-5 maximum frequent item set meeting minimum support
Generates all of its non-empty real subsets for each frequent item Set B (see Table 3-5), and then calculates the confidence level for the association rules of each combination of non-empty real subsets, that is, the support level (B) /support (S), if the ratio is greater than the minimum confidence level, then output S (B-S), that is, S and (B-S) Association.
Table 3-5 non-empty subsets of frequent item sets
Table 3-6 Association Rules
Now we have three association rules: {B1, B5 }à {B2}, {B2, B5 }à {B1}, and {B1 }à {B2, B5 }. Similarly, the maximum frequent item set {B1, B2, B5} is calculated Similarly, and a set of association rules can be obtained.
3.4 examples of application of Association Analysis Methods
In an information system, there should be a vast majority of normal access behaviors, so the proportion of abnormal behaviors is very low. Therefore, in association analysis algorithms, we do not require a higher level of support than a certain value, instead, it must be greater than 0 and smaller than a specific value.
For example, a common client host scanning other IP addresses is obviously abnormal. If you use the association analysis method, if the same or highly similar domain names are queried on these scanning hosts, this domain name query request is also abnormal, it is likely that the client host is infected with the trojan program.
For example, a common client host has high-frequency concurrent domain name query behavior. If the association analysis method is used, it is found that the system function call behavior characteristics of these terminal hosts are very similar, or these terminal hosts can access local sensitive files (password files, configuration files, etc. These system function calls or access to local sensitive files are also abnormal.
4 Conclusion
Internet-based B/S-based information systems generate a large number of different types of logs during operation, such as security device alarms, operating system logs, database logs, terminal behavior logs, network traffic logs, Web access logs, DNS request logs, and Internet access logs. These logs have rich information. As long as appropriate analysis algorithms are used, valuable analysis results can be obtained. In addition to the two application scenarios described in this article, big Data analysis can also be used for security analysis scenarios such as denial of service attack detection, security intelligence analysis, Situation Awareness, web page tampering detection, application layer attack detection, and malicious file detection.
Successful big data analysis relies on data, ideas, and algorithms. Security-oriented big data analysis mainly uses various system logs and behaviors. This article attempts to introduce the security-oriented Big Data Analysis ideas using close-to-reality cases. There are many algorithms available for big data analysis, but not all of them are applicable to security-oriented application scenarios. The reason is that both system logs and behavior logs belong to low-Dimension Data, and the algorithms for High-Dimension Data are basically not applicable. Therefore, only similarity analysis, association analysis, and clustering are supported. The classification algorithm can also be used when there is sufficient training data, but it is usually difficult to obtain the training data, so the use of the classification algorithm may be limited.
In addition to analysis algorithms, visualization is also an important and effective analysis method. Visualization can be used as an analysis tool to directly display the relationships between data in graphs, improving data readability and serving as a tool for presenting analysis results, making the analysis results more intuitive. Due to space limitations, this article does not provide a description of the visualization presentation, and we hope to have the opportunity to supplement it in the future.