Analysis of malware through machine learning: Basic Principles of clustering algorithms in Deepviz
Since last year, we have discovered that many audiovisual companies have begun to engage in machine learning and artificial intelligence, hoping to find a fast and effective way to analyze and isolate new types of malware and expand the malicious software library. However, in fact, there is a big problem here: many people regard machine learning as a magic wand that is omnipotent-when they started using machine learning, I just handed over as many examples as possible to the algorithm for calculation. Even if it is done, this is actually not completely correct.
0 × 00 Introduction
Deepviz is a powerful Automatic Analysis Platform for malware and a powerful and effective intelligent threat intelligence platform. In addition, it can analyze all the data extracted from the malware analyzer, make the best choice by associating algorithms.
Machine Learning is a key technology used by Deepviz. It can identify new malware, identify similar samples for association, and expand the malicious software library. But in fact, executing an effective machine learning algorithm is not as simple as most people think-it is not "giving smart machines as much detailed information as possible, and it will find its own tricks ".
Next we will explore the principles of machine learning step by step.
0 × 01. Clustering Algorithm
Clustering Analysis is a statistical analysis technique designed to identify important groups from a given dataset. We can use existing malware groups to discover similar new malware or samples with the same characteristics as this group. To group A data set, we need an expression to describe the similarity between each element in the dataset.
Each element is described by a feature set. The attributes of this feature set are very important to this element in some aspects. When designing a feature set, there are two key points to consider:
First, you must understand how to select attributes. As mentioned above, many people think that clustering analysis should allow the analysis program to extract millions of attributes for calculation, but ignore one point: the more attributes are considered, the more time the calculation takes, and some attributes cannot be used to accurately distinguish malware. For example, using the unique entropy attribute of the PE file as a feature set will lead to incorrect clustering. However, in any case, we need to find enough attributes to divide some malicious software with special behaviors into several different malware groups. Therefore, the attributes we want to extract are meaningful attributes for malware analysis, which is also the largest component of the Deepvid malware analyzer.
Second, find the most appropriate measure to verify and compare the attributes of malware. Each malware can be described as a numerical attribute (such as information entropy) or an abstract attribute. In mathematics, similarity measurement is used to describe the similarity between two objects. Euclidean distance is a widely known method for comparing similarity between numeric attributes. How can we measure the similarity of abstract data? Deepviz provides two malware and related IPs and URLs. How can we compare these two sets? How to aggregate them?
For example, the following figure shows the MD5 aggregation of all software that has visited the website complifies.ru:
Our clustering algorithm divides samples into four different groups, which is the last step in the aggregation process.
To get a similar set, we need to know the similarities and differences between each element and other elements. These values can be expressed through the distance matrix.
Based on this, what we need is:
1. Select a feature set that consists of one or more attributes. You can use this set to classify the elements in the original dataset. 2. Select a distance calculation algorithm to calculate the distance between each element. Then, we need to do the following steps: 3. compare the distance between an element and the element itself, and the distance between the element and other elements in the dataset. A clustering algorithm is used to aggregate similar elements. In our example, we aggregate malware groups.
Similarity Between Elements
Euclidean distance is widely used to calculate the distance between digital elements. Here, I will introduce another method for calculating distance-Jaccard distance, as a measure of abstract values.
First, let's give an example. Here are two samples and the list of URLs related to them. The similarity between each element in the Set and the sample is calculated. The specific results are as follows:
Sample
201714a9d627606c4974d8c3f372b0797 and 27f72541c93e206dcd5b2dda-1e66f9a:
Result page:
Https://intel.deepviz.com/hash/26414a9d627606c4974d8c3f372b0797/
Https://intel.deepviz.com/hash/27f72541c93e206dcd5b2d4171e66f9a/
Jiekard similarity is the most widely used similarity algorithm to compare the similarity between abstract data points. This algorithm is defined as the proportion of the number of elements that combine the intersection to the Union. If there are no repeated elements in the two sets, the jekard similarity is 0. If all the elements in the two sets are the same, the similarity between the two sets is 1. The formula for calculating the similarity between the two sets is as follows:
Think about the following questions:
1. Set A consists of 19 Elements
2. Set B consists of 12 Elements
3. There are a total of four elements in the intersection of the two Sets
The jekard formula can be used to calculate that the similarity between the two sets is 0.15. Therefore, the formula above can also be transformed and calculated using distance, as shown below:
distance = 1 – similarity
If we calculate the distance between all samples, we can obtain the following distance matrix:
The distance between A and B is 0.15, and the distance between A and C is 0.8. The distance between an element and itself is 0, symmetric. The distance between A and B is the same as the distance between B and A, and is symmetric on A diagonal line. To optimize the calculation time of the matrix, we can use this symmetric property to calculate only half of the matrix.
After the matrix is calculated, the distance matrix can be input into the clustering algorithm for further calculation. From this we can see that the more effective the extracted attributes from the data, the more accurate the result of the next clustering algorithm aggregation.
Clustering Algorithm (DBSCAN)
The clustering algorithm classifies each element by distance between elements. Simply put, it is the distance between an element and its set, which also includes the elements in the feature set. So how can we apply this algorithm to malware analysis?
DBSCAN (density-based clustering algorithm) is a representative density-based clustering algorithm that defines a cluster as the largest set of points connected by density, divides high-density areas into clusters, and discovers clusters of any shape in the noise spatial database.
Previously, we have calculated the distance matrix of the set, so we can know the distance between each element and any other element.
This algorithm requires two inputs:
1. Min density (min_pts) 2. scanning radius (eps)
As shown in:
1. Density: the number of points in an area with a center and an eps radius is called the density of a point. 2. core point: this point is the center. If the number of points in the region with the radius of the eps is greater than the minimum number of contained points (min_pts), this point is called the core point. The points in these areas form a cluster 3. boundary Point: The min_pts point is called boundary point. Interestingly, the boundary point contains at least one core point in the neighborhood. 4. Noise point: the noise point is not the core point or the boundary point
From the above definition, we can know that each cluster should contain at least min_pts points. The higher the value of the eps radius, the smaller the restriction on cluster formation. The sample of moderate distance may be divided into the same cluster. The original noise point may become a boundary point (or even a core point ). Therefore, the value of Min_pts and eps may change in our malicious analysis program.
On the other hand, it should be noted that a noise point does not have any relationship with the cluster, which depends on whether we want to find new variants or replace existing malware clusters.
Now let's take a few more examples.
In the previous example, we have obtained the distance between elements through "URLs" as the feature attribute. Next we will use IPs for clustering.
From our web threat awareness page, we found an interesting IP Address: 1.234.83.146
Through our threat awareness system API, we found that the IP address 1.234.83.146 is associated with our 353 samples at the same time.
Step 1: Use the jiekard distance algorithm to calculate the distance matrix obtained from all samples and IPs lists,
The value of the jiekard distance ranges from 0 (indicating that the two samples have the same IPs List) to 1 (indicating that the two samples have no relationship). Then we use the DBSCAN algorithm to calculate the cluster, here, the radius of the eps is 0.5, and the minimum density of min_pts is set to 1, which means that the noise point will also become a cluster.
It should be noted that it is not the result of the DBSCAN algorithm, but the representation of the distance matrix on the spatial map. The matrix graph needs to be further calculated to obtain the cluster. The figure clearly shows that DBSCAN divides two clusters, the complete information of the cluster: https://deepvizblog.files.wordpress.com/2016/01/cluster1.docx
Then we try to change the value of the parameter to see the result: Set the radius of the eps to 0.1, and the minimum density of min_pts to 1.
As expected, the DBSCAN algorithm divides more clusters.
Specific information: https://deepvizblog.files.wordpress.com/2016/01/cluster2.docx
The above are just some simple examples, but you can use clustering algorithms to identify more new samples, whether it is a malware cluster or not, as long as you can extract the correct data, it can help you identify and isolate malware.
The following are some isolated samples from our threat awareness platform:
Deepviz Threat Intel290f3104a53cc5776d3ad8b5622916804fa660009cba0b3401f71439b885e0676368cc6d88c559bb27da31ef251a52a127bd99bf75491447fb3383d1f54f4e40c7b19f8250b70ae5bd46590749bf96608f68c9a4a1769f57651a6a26b0ea2cf9