General clearing technology for computer viruses
Http://www.williamlong.info/archives/456.html
This is the second article I wrote in a magazine 10 years ago. Published in the third phase of the 1997 microcomputer world.
At that time, I was very interested in computer viruses. When I first went to college in year 92, I didn't actually have any idea about computer viruses. I just thought it was mysterious and profound. It was incredible that virus programs could spread across different computers.
That is indeed an old age. The operating system we use is dos 3.31, and the learning is true basic. At that time, the computer virus was also very interesting. For example, the ball virus was a small, dynamic dot that kept moving and immediately rebounded when it hit the edge of the screen. The rain point virus is a falling rain point or character. 64/bloody indicates bloody text displayed on the screen.
At that time, the university teachers were also very interested in the virus. I remember a teacher pointed to the compilation code of the ball virus and told us that there were no more than 10 people in China who would compile the virus.
At that time, the virus was not very destructive. In the present view, it is even cute. At that time, some experts were writing viruses. The purpose of writing viruses was probably to show off their programming capabilities. At that time, almost all of the viruses were compiled with code. The assembly language was actually similar to the binary machine language. It was a nightmare to write a program with that thing, I once wrote a Compilation Program of more than eight hundred lines, and I was dizzy. I know that it is not easy to write a virus in assembly, in particular, some viruses even have their own encryption and variant functions, so those people are indeed worth showing off. But now, I almost forget the assembly language. The reason is very simple. The maintainability of writing programs with machine code is very poor, and naturally fewer people will use it.
In the years after my graduation from college, I gradually lost my interest in these products, mainly because the harm of the computer industry affects my mood. The CIH virus started a bad start, the computer users' information and even hardware are maliciously damaged. Then the virus seems to be more shameless than others. Nowadays, the Internet is popularized and some scripting languages are popular, the threshold for writing Trojan viruses is very low. cainiao can write Trojan viruses. The current Trojan viruses are all malignant viruses, either stealing passwords, advertising, or changing Internet Explorer, even a "Success Story" of hao123 relying on Trojan viruses was reported, but I was disgusted with the compilation of these viruses. For those poor virus writers, I only think they are pitiful. Let them continue to write viruses for their poor ideals and aspirations. I have more important things to do.
The following is my paper, which has no practical significance at present, because I wrote the paper on the premise that "most of the viruses are not malignant viruses, even execute virus code to restore the original program. This is no longer possible in the current environment. What is the current world? Who dares to run the virus!
General clearing technology for computer viruses
Abstract: Starting from the popular file-type virus, this paper analyzes and introduces a technology to clear computer viruses starting from the file structure characteristics.
Keyword virus infection load execution (EXEC) file prefix segment (PSP) Process
Chapter 1 Introduction
Computer viruses have a long history and have been widely spread since the Middle and Late 1980s S. today, according to statistics, there are more than 5000 types of computer viruses in the world, and the average speed of dozens of viruses increases every month. the development of computer viruses affects the development of anti-virus products to some extent. The original anti-virus technology is outdated and powerless in the face of new viruses. virus Detection products are based on virus signatures to identify specific viruses. Therefore, virus variants and unknown viruses pose great difficulties for the detection software. virus removal is based on virus detection. Currently, virus removal is targeted at known viruses. this passive method always lags behind the virus technology. Although the suppression of viruses by such anti-virus products cannot be ignored, more and more vulnerabilities are exposed. the new generation of open anti-virus technology came into being. This open anti-virus technology describes the virus structure with a unified data structure. You can analyze the virus based on your own, with more flexible upgrade advantages, this method is flexible and efficient for the next generation of multidimensional variant viruses with anti-tracking and encryption technologies, this broad-spectrum anti-virus system will gradually become the development trend of Anti-Virus products.
Next, we will introduce a general anti-virus Technology Based on the executable file structure features.
Chapter 2 mechanism of computer viruses
First, let's take a look at the structural characteristics and working principles of computer viruses.
The structure of computer viruses determines the characteristics of computer viruses, which are roughly summarized as follows:
(1) computer viruses are executable programs.
Computer viruses, like other legal programs, are an illegal program that can store and execute, and can run directly or indirectly, it can be hidden in executable programs and data files and is not easy to detect and discover. when a virus program runs, it competes with a valid program for control of the system.
(2) Extensive computer viruses
Because the word "virus" comes from "biology", transmission has become an important feature of computer viruses. infectious is the first condition to determine whether a program is a virus. computer virus is a computer virus regeneration mechanism. Once a virus program is connected to a program in the system, it starts to infect other programs after running the program. in this way, the virus will soon spread to the entire computer system.
(3) Computer Virus Latent
The latent nature of computer viruses is the parasitic ability that is attached to other media. A well-developed computer virus program can be hidden in legal files within weeks, months, or even years to infect other systems without being discovered. the latent nature of computer viruses is complementary to the infectious nature. The better the latent nature, the longer the computer virus will exist in the system, and the larger the scope of virus infection.
(4) computer virus triggering
Generally, a computer virus has a trigger condition: either to trigger its transmission, or to activate the performance or damage part of the computer virus under certain conditions. trigger is essentially a condition control. A virus program can activate and launch an attack on the system at a certain point according to the designer's requirements.
(5) Targeted computer viruses
Computer viruses present in the world do not infect all computer systems. for example, for IBM PCs and compatible computers, for Apple Macintosh, and for UNIX operating systems. the vast majority of computer viruses are applicable to IBM PCs and compatible computers based on MS dossystems.
(6) computer virus vigilance
Because computer viruses are executable files (programs) in computer systems, such programs reflect a design concept of designers. at the same time, because computer viruses are also composed of several components, such as installation, transmission, and destruction, these modules are easily modified by viruses or other counterfeits, make it a computer virus different from the original virus. [1]
Computer viruses can be divided into the following types by link:
(1) Source Code virus (2) intrusion virus (3) Operating System Virus (4) shell virus ).
(1) (2) attacks are source files and target files written in advanced languages, which are rare on computers. (3) They are viruses in the boot zone, which mainly attack the boot zone of computers, the diagnosis and treatment method is relatively simple. It can be easily cleared by using tools such as debug or nu. the virus mentioned in this Article refers to the shell virus that is currently the most popular on PC and can attack executable files.
Computer shell viruses surround themselves around the main program and do not modify the original program. Shell viruses are easy to write and common, but are troublesome for diagnosis and treatment.
Shell viruses have the following features:
Copy itself to the periphery of the target file (that is, the end of the file); do not modify the original normal file [2]; The runtime virus first enters the memory. after the virus is executed, it is transferred back to the original file portal for operation (concealed ).
On a DOS-based PC, shell virus attacks mainly target two types of executable files: COM files and exe files. COM file structure is relatively simple, it is easier to detoxify. the widely used EXE files are relatively complex, but the operations are more flexible. They are suitable for programs with more than 64 KB and are more compatible with future operating systems. Therefore, they are widely used.
Chapter 3 clear com viruses
I. Implementation Principle
The COM file is an executable file of DOS binary code. The COM file structure is relatively simple and the loading process is very fast. the entire program has only one segment. therefore, the length of all codes must be less than 64 KB, And the entry code address is Cs: 100 h. when a DOS file is loaded into a COM file, a H program prefix segment (PSP, established by DOS, is the interface between the DOS user program and the command line) is created in the memory ), then, the entire file is loaded on the top of PSP without any relocation. Then, the four segment address registers DS (Data Segment), CS (code segment), and SS (stack segment) are loaded ), ES (extra segment) is initialized as the segment address of the PSP, and finally the control of the program is handed over to CS: H. as shown in table 1.
Table 1: COM file loading and execution
Address content
XXXX: 0000 PSP running CS, DS, es, SS
XXXX: 0100 IP address of the program code
Data
Stack sp
Most of the viruses that are parasitic on com files use several bytes to save the file header, and change the first command to "JMP virus entry" to ensure that the virus is first executed, some viruses are also appended to the first part of the file. After the virus is executed, the original state of the parasitic program is restored, and the program is re-returned to CS: h using commands such as JMP far, to ensure that the parasitic program is consistent with PSP.
It can be seen that after the virus is executed, the original file will be restored and run for propagation. After it restores all the parameters of the original file, the control will be placed at Cs: H. therefore, the true entrance criteria for determining COM files are: the last program segment executed at Cs: 100 h (cs = Current PSP segment address, IP = H ).
Therefore, we can imagine a tracker. Every time a command is executed, we can determine whether the above conditions are met. If the above conditions are met, the code at Cs: H is the image of the original file, because the COM file has only one segment, the memory image is the content of the disk file. write the code at Cs: H back to the original file. The virus is eliminated. If you know the length of the virus, remove the useless code at the end of the file, in this way, the virus is physically eliminated.
II. Implementation Scheme
The realization of the concept tracker is a core issue and a major difficulty. in fact, the single-step trap interrupt (INT 1) fully complies with the tracker's conditions, but due to the current widespread adoption of computer viruses, therefore, it is still difficult to implement this type of tracker.
At present, there is a convenient alternative method, namely dos exec (INT 21h function 4bh, load execution) function, this function has an interesting phenomenon, that is, after the execution of the load program, it restores all registers to the status before execution and does not clear the memory. This method is easy to implement and easy to operate, but has certain requirements and restrictions on the files to be processed.
The specific implementation is to save the interrupt vector table, allocate a piece of memory, and call the DOS exec function to execute the infected COM file. after the execution, rewrite the interrupt vector table to clear the viruses in the memory, and then write the code on the memory offset of H to the file. The file length is the original file length. Finally, remove the virus code at the end of the file when the virus length is known.
This technology can deal with any file-type virus, but has certain requirements for com files: the file cannot modify the content of the Code segment during execution, files that are not encrypted or compressed generally meet this condition.
4. Use the exec function of the Debug. com Debugger
The simpler method is to use DEBUG. First load a file with the l command and then run the G command. After the exec function is complete, the returned registers are exactly the same as those before running, run the W command to save the disk, and the virus is cleared. (Three commands are used in the entire process)
Chapter 4 cleanup of EXE Virus
I. Implementation Principle
EXE files are the most common and flexible executable files in the DOS system and are widely used. however, the structure of the EXE file is much more complex than that of the COM file. the EXE file consists of two parts: the header and the load module. the file header consists of a format area and a relocation table. load the module as the program code part, starting from the displacement of H bytes. when the DOS system calls the EXE file, it first creates a program prefix segment (PSP) at the bottom of the memory block, and then reads the loaded module into the memory in the specified area (top of PSP ), DS and Es are initialized to PSP segments. CS, IP, SS, and SP are determined by the file header formatting area and adjusted by the relocation parameter. modify the code data according to the relocation item, and finally pass the control of the program to the target program from CS: IP. (as shown in table 2)
Table 2: EXE file loading and execution
Address content
XXXX: 0000 PSP running ds, es
XXXX: 0100 data
Code example Cs: IP
Stack upstream SS: SP
For an EXE file, computer viruses are mainly attached to the end of the host file. Because it must first obtain control of the program, it must modify the file header. in general, as long as the correct file header is restored, it can achieve the purpose of anti-virus.
When an EXE file is loaded, the system determines the first execution statement based on the Cs: IP parameter in the EXE file header. Therefore, the virus only needs to modify the Cs: IP address pointer to execute the statement first. In fact, most viruses only modify the file header without modifying the original file content. this provides the conditions for completely restoring the original program code.
From the above analysis, we can see that the virus-infected EXE file tail forms a significant level, CS: IP points to the virus body, no matter what measures the virus takes, it will eventually restore all the real parameters of the Host Program in the memory, and return the original program with a long jump command. then, we can directly extract the correct Cs: IP and SS: SP parameter pointers, use it to modify the file header, and then remove the outer virus code, this completely restores the original EXE file.
The problem is how to find the correct entry for the EXE file. judging the real entry of an EXE file is very complicated. However, for a virus based on the DOS system, the language is basically an assembly language, so it has some unique features. after a lot of analysis, we can see that in general, when the EXE virus is executed to the beginning of a real file, its CS and DS must be changed, and the DS content must be the PSP segment address, SS: SP pointer is initialized. For viruses that do not modify the relocation table, CS: IP pointer should be in the relocation area.
Therefore, you can create another tracker. Each time you execute a command, you can determine whether the above conditions are met. If yes, the Cs: IP address code is the image of the original file, the EXE file header can be correctly restored based on the content of each CPU register to prevent viruses.
II. Implementation Scheme
Like the anti-virus of COM files, this theoretical tracker is actually very difficult to work. Therefore, we have to turn to our old friend -- exex function.
Ms dos Function 4B has two important sub-functions: 4b00 for loading and execution, 4b01 for loading not to execute (unpublished document function), 4b00 for executing all executable programs, 4b01 is used for loading in the debug debugger. (for specific parameters of the 4b01 function, see Appendix 1 ).
The key to the problem lies in how to find the first command of the original program, that is, the first command of the original program is interrupted, so we can manually change the first command to the interrupt command. to achieve this function, you only need to use the 4b01 function to simulate the 4b00 function.
Specifically, when the system calls the load execution function 4b00, it first loads the function 4b01 and first tries all the parameters. In this case, the memory image should be shown in table 3.
Table 3: Memory images of EXE infected programs
Address content
Original code area
CS: IP → virus code Zone
Assuming that the First Command of the virus is at the forefront of the virus code, the memory image of the original program should be PSP: 100 ~ CS: IP (the first command of the virus). fill all the regions with the ASCII code CD. in this way, each instruction in the original program is converted into an interrupt instruction int Cd (not to use the INT 3 breakpoint interrupt because most viruses have the function of destroying the one-step breakpoint interrupt). That is to say, no matter from any address of the original program, the first command is int CD. in this way, once the virus code is executed, the Soft Interrupt int CD will be triggered when the long jump command is returned to the original program for execution, the interrupt service program of int CD can obtain the actual initialization Cs: IP and SS: SP pointer of the EXE file header.
In addition, the modified int 21 must be recursive, because some viruses (such as the new century virus) are returned by loading the original program for the second time. Therefore, the memory needs to be filled twice. 4b01 is also called twice.
The efficiency and accuracy of this method are much higher than the manual antivirus method that uses debug and other tools to track and analyze gradually. virus shells can be removed for files infected with known or unknown viruses. unlike rcopy and other Shell programs, this method completely restores the EXE program and does not change any content of the original EXE file. the recovered EXE code should be exactly the same as the original EXE code. in addition, this method adopts the shell stripping reduction method, so it can also be used to clear cross-infection viruses. The method is to shell layer by layer from the outside to the inside, and finally completely restore the innermost host file.
Chapter V Conclusion
The implementation principle of this virus removal method described in this article is very unique. of course, the implementation scheme given in this article cannot clear all computer viruses, but it provides an idea that the previous anti-virus algorithm can only kill one virus, instead, an algorithm can kill a type of virus. based on this idea, I have compiled this general antivirus program in C and assembly languages and tested it with a large number of viruses. of course, viruses are varied. Therefore, a unified antivirus software should be widely tested. the specific efficacy still requires multi-faceted verification. I only hope this idea can play a positive role in the anti-virus field.
References
1. Li Xiangyu <computer virus overview> IDG International Data Group 1990
2. Ray dancan <High Level ms dos program design> Electronic Industry Press 1988
Ray dancan advanced MS-DOS programing Microsoft Press 1988
3. Ray dancan <ms dos encyclopedia> Electronic Industry Press 1990
Ray Duncan the MS-DOS encyclopedia Microsoft Press 1990
Appendix: MS-DOS exec Functions
William Long, 1996 translated from: ms dos Encyclopedia (Ray DUNCAN: The MS-DOS encyclopedia)
The MS-DOS system load, that is, to mount the COM and exe files on the disk into the memory and execute, can be any program using the MS-DOS function (function 4bh, load execution) generated. DOS command interpreter command. com uses exec to mount its external commands, such as chkdsk or other applications. many popular commercial software, such as databases and word processing, use exec to execute helper programs (such as spelling checks) or load command. another copy of COM, which allows the user to run a helper or break into MS-DOS commands without losing the current working context.
When exec is called by a program (parent process) and loads another program (child process), the parent process can use a string of characters, namely, the Environment block, the command line and two file control blocks, to transmit certain information to the sub-process. the sub-process also inherits the msdos standard device of the parent process and the handle of the device opened by other parent processes (unless the enabled operation has the "non-inherited" option ). any operation can be performed by the inheritance handle of the quilt process, such as positioning or file input and output, and also affects the file pointer associated with the parent process handle. sub-processes can also be loaded into another program, so that the system memory overflows.
Because msdos is not a multi-task operating system, the sub-process is not handed over control of the system until the end of the operation. The parent process is suspended at this time. This kind of process operation is also called synchronous execution. when a child process is aborted, the parent process obtains control and can call another system function (INT 21h function 4dh) to retrieve the return code of the child process and check whether the child process is aborted normally, or a major hardware error, such as pressing Ctrl-C.
In addition to sub-processes, exec can also be used to mount the overwriting files of subprograms or applications that cannot be included in their library files because they are written in an assembly or advanced language, this type of overwrite files cannot be run independently. Most of them require "help" work or data in the main program segment.
The exec function only exists in version 2.0 and later versions of msdos. in Version X, the parent process can use the int 21h function 26h to create the program prefix segment of a sub-process, but the process of loading, relocating, and executing code must be completed by itself, instead of relying on the help of the operating system.
How exec works
When the exec function receives a request to execute a program, it first tries to open and locate the specified program file. If the file is not found, exec immediately fails and returns an error code from the caller.
If the file exists, exec open the file, determine its size, and check the first block of the file. if the first two bytes of the block are ASCII code Mz, the file is set to an EXE loading mode. the size of the program code segment, data segment, and stack segment can be obtained from the file header. otherwise, the entire file will be set to a final Mount image (COM program ). the actual file name suffix (COM or EXE) is ignored in this test.
At this point, we know the size of the program to be loaded into the memory. If there is enough space to load the program, exec will allocate two blocks in the memory: one including the environment block of the new program, another includes the code segment, data segment, and stack segment of the program. different types of programs are actually allocated in different sizes. com program to obtain all the free memory in the system (unless the memory space is too early to form a broken block), and the size of the space allocated to the EXE program is controlled by the two fields in the file header, minalloc and maxalloc, it is set by link.
Exec then inserts the environment block of the parent process into the environment block of the child process, and creates a program prefix segment (PSP) at the bottom of the memory block of the child process ). the command line and default file control block are merged into PSP. previous termination address (INT 22 h), Ctrl-C (INT 23 h)
And serious error (INT 24 h) interrupt vector directory stored in the new PSP, stop the address vector is updated, so that when the child process ends or fails, control can return to the parent process.
Then, the actual code and some data of the sub-process are read from the disk file to the program memory block above the new PSP structure. if the subroutine is an EXE file, the file header relocation table is often used to locate the reference segment in the program to reflect its actual loading address.
Finally, the exec function is established as the CPU register and stack of the program, and the control is passed to the program. the entry pointer of COM files is usually the 100 h offset in the program memory block (the first byte after PSP ). the entry address of the EXE file is specified by the file header, which can be anywhere in the program.
When exec is used to load and execute an overwrite file instead of a subroutine, its operation is simpler than the above. for overwriting files, exec does not try to allocate memory or create PSP and environment blocks. It simply loads the file content into the address specified by the calling file, and perform some necessary relocation (if the overwrite file has an EXE file header ). the segment value is also provided by the caller. in addition, exex does not pass the control to the code for the latest file loading, but returns the generated program. The request program is responsible for calling and overwriting at the appropriate location.
Use exec to load the program
When a program loads and executes another program, it must perform the following steps:
1. Check that there is enough free memory to load the code, data, and stack of the sub-process.
2. Create information required by exec and sub-processes.
3. Call the exec function of msdos to run the sub-process.
4. Restore and test the sub-process end and return code.
Allocate memory
Msdos is typically allocated to all available memory of the loaded com or EXE file. an uncommon exception is when an EXE program connected by the/cparmaxalloc switch or modified by exemod Splits a short program block by the data or code it previously resides. therefore, before a program is loaded into another program, it must release all memory not used by its own code data stack.
.
The excess memory is released by calling the memory block re-allocation function of msdos (INT 21 h, function 4ah ). at this time, the es Register sets the PSP segment address of the parent process, and the Bx Register sets the number of memory blocks required by the program itself. If the expected parent process is a com program, when it reduces its memory allocation by less than 64 K, it must move its stack to a safe space.
Prepare exec Parameters
When a program is loaded and executed, two parameters of the exec function must be provided:
1. The address of the subroutine pathname.
2. parameter block address.
The parameter block contains the address of the information required by the subroutine in sequence.
Program name
The pathname of the subroutine must be clear and end-to-end (asciiz), specifying the file name (no non-recognized characters ). if no path is included, search for programs in the current directory. If no drive name exists, use the default drive.
Parameter block
The parameter block includes four data item addresses:
1. Environment Block
2. Command Line
3. Two default file control blocks (FCBs)
The space in the parameter block for the Environment block pointer is only two bytes, including a segment address. this is because the environment block is always arranged in paragraphs (its address can always be divided by 16 ). the value of running h indicates that the environment of the sub-process should be inherited without changing. the remaining three addresses are all dual-byte addresses. They are in the standard intel format, with a low-Text Segment offset and a high-text segment address.
Environment Block
An environment block always starts from a boundary segment and contains a series of strings ending with 0 (asciiz). The format is as follows:
Name = variable all strings end with an appended 0.
If the environment block pointer in the parameter block is provided to an exec call that contains 0, the child process simply needs to copy the environment block of the parent process. the parent process can provide a segment pointer with different or increasing strings. in msdos 3.0
In later versions, exec enables the environment block of the subroutine to have a complete path name. The maximum size of the Environment block is 32 bytes. In this way, such a large amount of information can be recognized by the program.
The initial (or master) system environment block is the command processing program (usually command) after the system is connected or restarted. com. command. com writes the results of the path, Shell, prompt, and set commands to the main environment block of the system. the first two usually use the default values. for example, a system of msdos 3.2 is started from the C drive. the bat file does not contain the path command, config. if there is no shell command in the SYS file, the main environment block writes the following two lines of strings:
Path =
Comspec = C:/command. com
Command. com is looking for a list of these instructions to run the "external" command, and also to find its executable files on the disk so that it can reload its transient part as necessary. when the prompt string exists (the results generated by the previous prompt or set prompt command), command. com to revise the user's prompt display.
Other strings in the Environment block only provide information for special programs, which does not affect the operation of the operating system. for example, the Microsoft C compiler and Microsoft Object connector search for include, Lib, and TMP strings in the Environment block to determine the specified position of the header file, library file, and temporary file. figure 2 shows the hexadecimal display of a typical environment block.
Command Line
The command line is passed to the sub-process. It contains one byte indicating the length of the remaining command line, followed by the ASCII string ending with the ASCII code carriage return (0dh. the carriage return code is not included in the length value. the command line can include all the switches, file names, and other parameters that can be checked by sub-processes to affect program operations. the command line is copied to the PSP 80 h offset of the sub-process.
When command. when com uses exec to run a program, its command line includes commands set by all users except the program name or redirection parameter. i/O redirection is in command. it indicates that the sub-process inherits the activity of the standard device handle. other programs that use exec to run sub-processes must perform some necessary redirection and provide an appropriate command line so that the sub-process can behave like a command. com load is the same.
Indeed saved file control blocks
The two local FCBs in the exec parameter block point to the 5ch and 6ch offsets copied to the sub-process PSP.
Currently, only a few applications use FCB as files and records. this is because FCBs does not support directory tree structures. however, some programs check the saved file control block as a quick way to separate the first two switches or other command line parameters. however, to make itself transparent to sub-processes, the parent process should follow command. com. this makes the analysis file name function (INT 21 h, function 29 H) of msdos easy to use.
If the sub-process does not need these two file control blocks, the correct address in the parameter blocks in the application memory will be initialized to point to two empty FCBs, these empty FCBs are composed of 1 byte 0 and 11 byte ASCII code spaces (20 h.
Run a sub-process
After the parent process constructs necessary parameters, it can call exec by interrupting 21h. The registers are set as follows:
Ah = 4bh
Al = 00 H (Exec sub-function, load and execute the Program)
DS: dx = segment address of the program pathname: Offset address
ES: BX = parameter block segment address: Offset address
After the software is interrupted and returned, the parent process must test the carry flag to check whether the child process has actually run. if the carry is clear, the child process is successfully loaded and controlled. if the carry position is set, the exec function fails and the error code is returned in ax. The cause can be checked. the common cause is:
The specified file is not found.
File found, but not enough memory to load
Other Uncommon Service errors can be felt by the entire system (for example, msdos in disk files or memory is damaged), using Versions later than msdos 3.0, you can call the int 21h function 59 (get the extended error message) to obtain the exec failure cause in more detail.
In general, an invalid address provided to the exec parameter block or Invalid Address of the parameter block itself will not cause an exec error, but this will cause some undesirable consequences for the sub-process.