Summary:
In order to improve the detection efficiency of similarity between source program code, a similar code detection algorithm based on sequential clustering is proposed. The algorithm first extracts the source code according to its own structure, then carries on the partial code transformation to each segment, then carries on the sequence clustering to the similar measure standard with the weight editing distance, obtains the similar program code fragment, achieves the purpose which the source program carries on the similar function detection.
Application:
The source program can be simplified by detecting similar code in the source program, or it can be used to find out similar functions between multiple programs, and also for plagiarism detection.
Steps:
1. Extracting the functional segments from the source code
2. Weighted editing distance as a similar metric
3. Through the method of cluster source code sequence, find out similar Code function section in source program, in order to achieve the purpose of detecting similar function program.
1. Problem definition:
size :
number of characters :
two sequence edit distance between S1 and S2 : The minimum number of operations required to convert S1 into S2 by inserting, deleting, replacing, etc.
Signature :
eg
Signature Distance :
1.1 Editing distance with weights
This article assigns different weights to the different types of characters in the sequence, and specifies the weights of each character according to the size of the character's substitution. If it is a character that represents a keyword, the resulting weight is larger, the character that represents the variable gets a smaller weight because, by contrast, the substitution of the variable is greater than the substitution of the keyword.
To cluster the program code snippets, the first thing to solve is how to define the distance measurement between the program code segments, and how to make sure that the 2 pieces of code are similar. This article defines a band
The editing distance of weights (weight edit distance,wed) measures the distance between 2 sequences, measuring their similarity.
eg
If a sequence of symbols s1= asdfght, another symbol sequence s2= abcfght. It can be seen that there are only 2 symbols in these 2 sequences, and that B, C, d are all symbols of the keyword type, while S is the weight of the operator type.
Pos:1 2 3 4 5 6 7
Sequence s1:a s d f g h t
Sequence s2:a b c f g h t
Weight between 2 sequences S1 and S2 edit distance is =3+2=5
1.2 Similarity between sequences
Similarity of 2 sequences between S1 and S2
=1-5/(7+7) =9/14
Smin is called the minimum similarity threshold between sequences .
In order to facilitate the calculation, according to Smin draw DMax.
Nature One:
The signature distance reflects the difference in the composition of the sequence letters
Weight edit distance reflects differences in the weights of INSERT, delete, and replace operations between S1 and S2
Because the time complexity of calculating the signature distance O (m + N) is much less than the time complexity O (MN) of the calculated weight editing distance. Therefore, according to the nature of 1, in determining whether the sequence is similar, you can first
The initial filtering is done by calculating the signature distance between the sequences, and then the final judgment is made by calculating the weight editing distance.
2 Similarity sequence Clustering algorithm based on weighted editing distance
2.1 Fragment extraction of source program code
To detect the similar code of the source program, we first need to fragment the source code and extract the function section. When the source code is segmented, a multilevel segmentation method is used to divide the source code into various segments under different criteria, and the standard of segmentation has classes, functions, statements, and then the individual segments under the same level are processed. When looking for similar code, since the function segment is the body of the entire source code, it is only necessary to use the second-level function fragment, which makes little sense for functions, statements, and classes.
2.2 Partial conversion of program code
Convert the code of the keyword type to a number, and then calculate the weight editing distance.
2. Clustering of 3 symbol sequences
Determine if a sequence is similar to a sequence
Determine if a sequence is similar to a cluster
Clustering of symbolic sequences
3 Experiments and analysis
Because of the large proportion of function functions in the source program and the decisive function of the program, if the function function is similar, it is basically possible to determine that the source program in which it resides is similar. As can be seen from table 4, the algorithm can correctly judge the similarity of the code.
Summarize:
The first part of this paper presents two concepts:
1. Editing distance and signature distance with weights
2. Based on the weighted editing distance and the signature distance, to determine the similarity between sequences, set a threshold value smin, when more than Smin, it is considered that the two sequences are similar
The second part is about the similarity sequence Clustering algorithm based on the weighted editing distance:
1. Fragment extraction of source code (piecewise standard has classes, functions, statements, mainly function classes, since function segments are the body of the entire source program)
2. After segmentation, the keywords in each paragraph are converted to numbers for easy extraction and processing
3. Clustering of symbol sequences:
A. Judging if the sequence is similar to the sequence (depending on the front distance)
B. Determine if a sequence is similar to a cluster (a cluster consists of multiple sequences, in which all the sequences in a cluster are similar to the sequence in turn, and if any sequence in the cluster is similar to the sequence, it is determined that the sequence is similar to the cluster)
C. Clustering of symbolic sequences:
In this paper, an improved density-based clustering algorithm is used to sequence clustering, first, a cluster list is created to store cluster result clusters, and for any sequence si stored in the map, the similarity between sequence SI and cluster list is judged until all the sequence Si are processed. The similarity relationship at this time may be 3 cases: 1) If no cluster in the cluster list is similar to the sequence SI, a new cluster is stored and the newly created cluster is added to the cluster list; 2) If a cluster in the cluster list is similar to the sequence, add it to the cluster and update the cluster's eigenvalues; 3) If more than one cluster in the cluster list is similar to the sequence, the clusters are synthesized into a new cluster, the sequence Si is added to the new cluster, and the merged clusters are removed from the cluster list, and the newly created clusters are added.
This paper simply suggests how to find similar code in a code snippet.
"Code similarity paper Notes" A similarity code detection algorithm based on sequence clustering