Hustoj--sim Introduction

Source: Internet
Author: User

Speaking of Sim, really is Dick Grune This old professor was impressed, had always thought his C language and compiling principles of good, now only realize, ignorance is really terrible. The so-called college class is like an introduction to a book. There should not be a few people in the book after reading the introduction of the book is very familiar with the content of it! What's more, what we're learning is a subject that has developed dozens of or even baiqian years.

Sim is a utility program for detecting The similarity of computer programs, of course, has extended to the discrimination of text similarity in the later development. The core idea is actually the LCS (the longest common subsequence) we often call.

It is useful for detecting a large part of the program work plagiarism. This software is a project building tool to assist in the teaching of computer science as a part.

1. Introduction

More than 30 years ago, computer science education began to establish the first CS department, but its practitioners had lagged behind other CS subdomain colleagues in developing software tools for their trade. Today there are a large number of outstanding software designers, circuit designers, network administrators, numerical analysis and digital artists in software system processing. In contrast, computer science educators still rely on traditional tools and techniques: the tools of their primary exchange of ideas have only recently been transformed into course pages, while the implementation remains chalk and blackboard, while the main tools for evaluating them continue to depend on human labor.

We believe that the lack of this software tool is not due to the lack of diligent CS educators, but the absence of a set of mature theory. On the study of computational learning theory, the limitations of the survey and the cost of different learning strategies began only in the middle of the 1980 's. Similarly, in the program calibration study, the measurement of their behavior [3,4,5] to determine the correctness of the program, from the late 1980 onwards. The profound and surprising results of these developing regions will certainly contribute to improving the practical education of computer science in the near future.

At this point, however, not all CS education for software development requires further basic work. Take a look at the lower level programming course assignments and exam evaluation tasks! Such courses tend to have high enrolment rates and a large number of programming tasks, but can only evaluate their correctness and their style and uniqueness with simple solutions. The software tool that implements this task needs to examine the structure of the program and will benefit from the mature string algorithm theory.

In this article, we describe the design and implementation of a program called SIM, which measures the structural similarity between two C-language computer programs and evaluates its correctness, style, and uniqueness. This program is a direct application of the string alignment technique, which has recently been used to detect similar substrings between DNA (6, 12) (see [7,13, 14]). For a given two programs, SIM first uses a standard lexical parser to build their parse tree. We treat the parse tree as a string, and the sim then adds spaces between each other to get the longest common subsequence. The degree of similarity between the two programs is represented by a fraction between the 0.0~1.0. In its current situation analysis, the SIM run time complexity is O (s^2) and S is the maximum height of the parse tree. Most of the SIM is implemented based on C + +, and its graphical user interface is implemented by TCL/TK.

The first SIM-based application we used to detect plagiarism. Real experimental data show that the SIM can detect changes in identifiers, statements, functions re-ordering, as well as the space and comment additions and deletions, modifications. For small projects, SIM is fairly fast: 22 compared to 56 programs with an average length of 3415 bytes is just 3.5.

The remainder of this article is organized as follows: section II describes the use of the underlying string alignment algorithm based on the SIM, section III depicts the design and implementation of the SIM, section fourth describes an experiment and its results, and section fifth discusses how to improve in the future.

The next step is to review the similarity of the test procedure to summarize the contents of this section. The diff command under Unix also uses the string alignment method to detect the similarity between two items, but it is made up of rows as basic text units, unlike Sims constructed with parse trees; and diff cannot detect changes to the system name or reordering of text. Baker's DUP program detects the maximum exact match between all 22 programs with a length exceeding a certain threshold; It detects a large module of system (parameterized) name changes and reordering, but it does not detect insert comments, or a small range of statements to reorder. Aiken's Moss Project, which was developed for plagiarism testing at UC Berkeley, also examines the program structure, but its basic algorithm comparisons have not yet been made public. Other tools for plagiarism detection [10, 11] are based on heuristic methods, which measure the frequency of statistical variables and identifiers, and thus may have a high false positive rate.

2. String matching algorithm

In this section we compare two programs with a SIM to introduce a string matching algorithm, the alignment of the two strings S and T (possibly different lengths) is made by inserting spaces to make their lengths the same. Please note that there are many possible permutations. For example, there are two cases of two strings masters and stars:

Masters Masters

STA RS Stars

Consider two-character matching: match fraction m, mismatch score D and Gap fraction G, M, D, g are selected values. The score for each group is the sum of each score, and the highest rank is the value that the group matches. In the example above, if m = 1,d = -1,g =-2, then "Rs/rs" and "Sters/stars" are the highest points in the first and second sets respectively. So the match scores were 2 and 3, respectively. The sequence matching score is used to measure similarities between the two almost identical objects and is related to the editing distance. String matching is suitable for imprecise matching and is widely used in computational biology to detect the relationship between DNA chains [6, 12].

The score of the best match between two strings is the highest score in all 22 matches. This value can be calculated using the dynamic programming algorithm. Formally, we are given two strings s and T, defining the optimal matching score between D (I,J) for two strings s[1..i] and T[1..J]. Max (D (i,j)) (where 1=<i<=|s|,1=<j<=|t|) Is the value we expect. Definition

The following recursive relationship gives us a calculation method to get the solution:

Boundary conditions are given by D (1,i) =i*g and D (j,1) = J*g. The elements of the matrix D can be initialized by the first row and column of the boundary condition, and then calculated from left to right, top to bottom elements. This is possible because the values of D (I,J) depend only on D (I-1,j-1), D (I-1,j), and D (i,j-1). The best time complexity for this match is O (|s| | t|). Because only two rows are required for each calculation, the spatial complexity is O (max (|s|,|t|)).

3. Design and implementation

The SIM is implemented by 1780 lines of C + + code and 393 lines of TCL/TK code. The design of the SIM is shown in a block diagram, one:

Each input C program first generates a compact structure in the form of an integer stream called a token through a lexical parser. The lexical parser is automatically generated by the UNIX command Flex and gives a proper subset of C syntax, so the SIM program can be easily modified for use in other languages. Each tag represents an arithmetic or logical operation, a punctuation mark, a C macro, a word, a number or a string constant, a comment, or an identifier. For example, for statement for (i = 0; i < Max; i++) will be labeled Stream tkn_for tkn_lparen tkn_id_i tkn_equals Tkn_zero ... Replace.

Marker keywords and special symbols are predefined, and those identifiers are dynamically allocated using a symbol table (All programs shared) so that two occurrences of variable names are replaced by two integers. Each comment, no matter how long, is replaced with a fixed tag, and all spaces are discarded. The purpose of this markup process is to reduce the code that is converted to a parse tree, which typically discards a large part of the data, and removes unnecessary information, such as spaces and comments, before performing comparisons.

When two source files are marked, the second program's tag stream is divided into segments, each representing a module of the original program. Each such module is then matched with the marked stream of the first program, respectively. This technique allows the SIM to detect the similarity of the position arrangement of the program module in an upset situation. The actual match is done using the following scenarios:

L One involving two identifiers tag matches 2 points; other matches 1 points;

L have a production record-2 points;

L involving two identifiers mismatch 0 points; Other mismatches-2 points.

The rationale for this scenario is obvious: The identifier match is considered extraordinary and the score is the highest, and we think that the name mismatch between the two identifiers is clearly below the structural mismatch between an identifier and an operator. The total match score is accumulated by the individual block scores, then normalized to a number between 0.0~1.0, by matching the sum of the scores divided by its first and second program itself. namely: S= (2*score (P1,P2))/(Score (P1,P1) +score (P2,P2)).

TCL/TK provides a graphical user interface to allow the collection of referenced files and results to be displayed and printed in a bar chart format. Figure 2 and Figure 3 show a sample output screen and a file selection box. Red, yellow, and green bars reduce the similarity of reference programs based on user-specified thresholds. It also uses gnuplot to provide display results in a separate interface.

4. Experiment Settings and Results

In the following simple experiment, we will have a simple C program that changes all variable names and function names as much as possible, removes comments, and reverses the order of adjacent statements as much as possible. shown in programs 4 and 5. In this example, the SIM score is 0.4, which is generally enough to check the program for the lower-class students.

To explain the score, we found that it is reliable to draw all SIM scores into curves based on each set of programs, rather than using a set of fixed thresholds. To illustrate this, we tested the SIM with a set of actual operating procedures for 36 lower-grade computer science courses submitted by our department recently. Each program is used as a reference program to compare the other programs, and the score table is listed in a bar chart. The results from a comparison of such a group are shown in Figure 6.

In the diagram, the leftmost column is for yourself as a reference program, with a score of 1.0. On the rightmost two programs (37 and 38), modify the author of the reference program in order to reduce the score by deleting or moving comments, rearranging the blocks, changing the variable name, and adding redundant double "{}". As expected, earlier five of the Reference Program project, which was decided by the teacher to modify, yielded significant results.

The entire set of 630 comparative statistics relates to the file size, the tag array size, and the run time of 7. The program size is in bytes and the time is in seconds. The platform used is a Pentium 200Mhz workstation that runs under Debian gnu/linux 1.3.

We repeatedly collected a large number of operating procedures from different low levels for testing, and the teacher did not find any plagiarism at the end. A typical group comparison of a reference scheme is shown in Figure 8, and the entire group is shown in Figure 9.

5. Discussion

we designed and implemented a software tool to measure the similarity between two C programs that can be used to detect low-level plagiarism in computer science course programming. We found that our tools were sufficient to detect common modifications to variable names, statements and function reordering, adding/removing annotations and whitespace. There is still much work to be done on building tools to assist in evaluating the job.


As we mentioned earlier, SIM can be extended not only to evaluate code style uniqueness, but also to determine the correctness of computer programs, a tool that can be used to test interactive primary programming courses.

By using a known method to reduce the run time of the underlying string matching algorithm by two, we plan to implement it in a future version of SIM, although it does not seem necessary to run a SIM on every possible set of programs to detect plagiarism. Because the number of cheat events is usually very small, sim is enough to help us find a group of programs with the highest ranking scores. If there is no evidence of cheating to prove that the program is false, the teacher does not need to check the other. Currently, we are working on two sub-segmentation (sub-quadratic) algorithms to find the highest matching pairs.

Hustoj--sim Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.