- Determining the area of the disassembly code is not as simple as that. Often, directives are mixed with data, and it is important to differentiate them. Disassembly executable: The file must conform to some common format of the executable file, such as the portable executable (portable executable, PE) format used by Windows or the executable and link format (ELF) commonly used by many UNIX systems.
- Knowing the start address of the instruction, the next step is to read the value contained in the address (or file offset) and perform a table lookup to correspond the value of the binary opcode to its assembly language mnemonic. Depending on the complexity of the disassembled instruction set, this process can be very simple, or it may require several additional actions, such as identifying any prefixes that can modify the behavior of the instruction and determining the number of operands required for the instruction. For instruction sets with variable instruction lengths, such as Intel x86, to fully disassemble an instruction, additional instruction bytes may need to be retrieved.
- After getting the instruction and decoding any required operands, it is necessary to format its assembly language equivalents and output them in the disassembly code. There are a variety of assembly language output formats to choose from. The two main formats used, such as X86, are the Intel format and the/T format.
- After outputting an instruction, continue to disassemble the next instruction and repeat the process until all the instructions in the disassembly file.
- There are a number of algorithms that can be used to determine where to start disassembly, how to choose the next disassembly instruction, how to differentiate between code and data, and how to determine when to complete the disassembly of the last instruction. Linear scanning and recursive descent are two of the most important disassembly algorithms.
Disassembly Basic algorithm