Command Selector survey (7 end)

Source: Internet
Author: User

6.Simulation

The last category of the command selector is those that decide which instruction to choose by analyzing and comparing the role of an instruction on the target machine. An instruction that has the same effect as a given part of the input program is compatible, so that part can be achieved by this directive (see Figure 6.1). We call these concepts analog (simulation). A key difference between simulation and overlay-based methods is that when the input program cannot observe all the output modes (that is, the pattern is not exactly matching the input graph), the latter usually cannot take advantage of a pattern with multiple outputs. In contrast, simulation-based practices can ignore these invisible effects. In fact, several of these simulation methods internally apply coverage techniques to determine compatibility.


Figure 6.1: An example of how the effect of the addition instruction can be simulated to execute a command selection on an input program.

The idea of simulation can be applied directly to the assembler, that is, comparing whether a machine instruction sequence is equivalent to a single complex machine instruction. If the cost of a single instruction is less than the cost of a long sequence, replacing it will result in an improvement of the program. Such an optimizer, called the Peek Optimizer, is first used as a post-code-generation step (post-code generation Step) to improve code directives, but we will see how this technique can be retrofitted to execute command selection.

6.1.Peek Hole optimization

An early-but still-applied approach to improving the quality of the code is to perform a partial check in the generated code in an attempt to improve the inefficient sequence of instructions. This early in 1965 McKeeman advocated "171" technology, known as Peek Hole optimization, because at any time of its analysis, the optimization routine was usually confined to a very narrow window of the program. The simplest peephole optimizer examines specific linear instruction sequences in assembly code, replacing them with cheaper equivalents. So the optimizer executes after the code has been streamed out, but before the linker runs. The first implementations, like the earliest command selectors, are usually handwritten and specifically designed for a particular target architecture.

In 1979 Fraser "96" demonstrated the first practice of reducing the redirect overhead of such an optimization routine (this is also described in Davidson with Fraser "62" in a longer paper). Instead of coding the target assembly instruction into the optimizer, Fraser developed a peek-hole optimizer, called a po, to extract this knowledge from a symbolic machine description. This description defines its semantics by describing the role of an instruction in the machine register. Fraser says these functions are register transfers, so each instruction has a corresponding register transfer mode or register transfer list (RTL). We will always use the latter in this report. For example, the RTL of the 3 address intermediate addition instruction for this form

Add $r [d], $r [s], #imm

It also sets a 0 symbol Z, which will be

Rd←rs+ IMM; Z←rs + IMM = 0

To handle the jump instruction, this procedure assumes that the machine uses a program counter that is set when a jump is used. It further assumes that the register is not affected by the program's external events.

PO improves the given assembler by synthesizing the effect of the adjacency instruction pair, and then checking for a single instruction that can cause the same result (through a series of string comparisons). If such an instruction is found, replace the command pair. This implementation is simple. Make a first pass through the input program to determine the purpose of each instruction. This pass-through ignores results that have no effect on the behavior of the program (that is, the value of a condition tag is not visible, if there is no subsequent instruction to use it, or if it is rewritten by another directive). Followed by a second pass, it merges the effect of two alphabetical orders and finds a single instruction with a matching effect. If replaced, the optimizer backs back to check the combination of the new directive and its neighbors. Such a single instruction can be merged by cascading substitution, but this requires an instruction that implements the effect of each intermediate replacement step.

Although advanced, PO also has several limitations. The main disadvantage is that it can only handle a combination of two instructions, and in the input code these instructions must be adjacent to the alphabetical order. The instruction pair is also not allowed to cross the mark boundary (that is, a different base block). Later Davidson and Fraser "60" eliminated the restriction of alphabetical order by constructing and manipulating the text flow chart instead of directly on the input. They also extended the set of instructions from the pair to the triples.

There are already many technologies dedicated to improving peephole optimization, and we will simply talk about some of the advances made over the years. Giegerich "109" has developed a formalized approach that eliminates the need for narrow windows. Kessler "144" presents a method to complete the simulation when the compiler is built, thus reducing compilation time. Kessler later extended the practice of "143" to the instruction window of length n, although it was a number of overhead. Massalin "170" developed an algorithm implemented in Superoptimizer, which combines the sequence of machine instructions in detail to find the smallest program that implements the same behavior as a given assembler. Later Granlund and Kenner "118" adopted the work of Massalin, removing jumps from a given assembler program. Neither is guaranteed to be accurate, and therefore requires manual checking of any optimizations to be performed. Recently, "134,135", such as Joshi, has proposed a method to help automate the process of writing special optimization routines. Unlike the previous optimizer, their algorithm finds the optimal implementation of a given program by using the automated theorem proofs of the SAT solution. However, this is a point-and-count overhead.

6.2.command selection via peep hole optimization

In 1984, Davidson and Fraser "60" integrated the idea of peephole optimization into execution command selection. The idea is to first use a macro to expand the outgoing instruction code, and then optimize it by using the Peek Hole optimization. Tree-matching algorithms were still being developed, and this approach used a more sophisticated technique. Now this practice is known as the Davidson-fraser method, shown in Figure 6.2.


Figure 6.2:davidson-fraser Overview of instruction selection methods (from "60")

This command selector consists of two parts: an Extender and a combination. The task of an Extender is to use register delivery to implement an intermediate form of an input program. This assumption assumes a invariants, called machine invariants, that each RTL must be implemented by at least one machine instruction. The task of the Assembly is to combine multiple RTL into a larger RTL that can also be implemented by a machine directive. This step is done through the peephole optimization we discussed earlier. Because the code will be improved by the composition, the Extender can be simplified to a simple one-to-many mapper, converting an IR operation to many consecutive RTL. For example, a memory-to-memory move

$m [@a + 8]← $m [@a + 12]

Can be extended to the following RTL, each containing only one action:

R1←12

R2←@a + R1

r3← $m [T2]

R4←8

R5←@a + R4

$m [T5]←R3

Such mappers are usually implemented using macro extensions, a technique that we have previously known to be simple to implement and re-position (see chapter II).

Davidson and Fraser implemented this practice in their YC compilers-it also performs common subexpression elimination at the RTL level (refer to Davidson and Fraser "61")-but they are not the first to do so. Similarly, the earlier strategy of starting out with inefficient code and then improving it was adopted by Auslander with Hopkins "21" and Harrison "123". However, Davidson and Fraser adhere to the appropriate balance between the redirection and the quality of the code, making their approach more than they had previously tried. This technology was subsequently implemented in the ZEPHYR/VPO system "14" developed by Appel, Amsterdam Compiler Kit (ACK) developed by Tanenbaum, and GCC (discussed in "146,214") in this most famous application.

The advantage of the davidson-fraser approach, compared to the other practices discussed in this report, is its extensible machine instruction support. For example, compared to tree overlay and dag overlay, the command selector with Davidson-fraser technology makes it easier to handle jump and multiple output machine instructions. This is possible because the assembly can parse any combination of instructions, potentially spanning the basic block boundaries. However, in order to be applicable, some directives may require an overly large command window. In addition, the register pass-through list that models the machine instructions excludes loop-based machine instructions because there is no concept of processing intra-instruction loops. Therefore, while the peephole optimizer can theoretically be used for very complex machine instructions, the cost of doing so in practice is usually too high. In addition, it is not certain whether these types of instruction selectors can be used to produce optimal code.

Dias and Ramsey adopt a slightly different approach to the davidson-fraser approach. In Davidson-fraser's practice, machine invariants are enforced by each optimization routine. Instead, Dias and Ramsey adopt a recognizer that checks whether the optimization step violates the invariant (see Figure 6.3). If this is violated, the optimization is rejected and the assembly code is rolled back. This simplifies the optimization routines because they no longer need to be concerned about adhering to this invariants. The "66" Dias and Ramsey show how to automatically generate this recognizer from a declarative machine pattern written in Λ-rtl. The Λ-RTL developed by Ramsey and Davidson "192" is an ML-based advanced, strongly-typed, polymorphic, purely functional register-transfer language that improves the level of abstraction for writing register-pass lists. It is a part of a framework called the computer System Description Language (CSDL), which is part of a ZEPHYR/VPO system dedicated to automating the generation of useful compiler tools. This recognizer is generated by converting RTL from machine description to Linear tree mode. So checking whether an RTL satisfies the machine invariants is simplified as a tree pattern matching problem with an efficient algorithm (refer to chapter III). While this limits the recognizer to inferring RTL on a grammatical dimension, the need for efficiency prevents more expensive parsing (remember that the query recognizer is required for each optimization step to be performed).


Figure 6.3:dias and Ramsey an overview of the Davidson-fraser method adaptation (from "66")

Later Dias and Ramsey "65,193" developed a design that reduced the amount of new target machines, such as X86 and PowerPC, and even each new architecture family, such as register-based and stack-based machines, to manually redirect the workload. The idea lies in a set of pre-defined tiles (tiles) that are specific to a particular target architecture family. A tile represents a simple operation that is required on any machine belonging to the architecture family. For example, stack-based machines require push and pop operations to be tiled, which are not required on a machine with registers. Now, instead of expanding each IR node into a series of RTL in the input program, the Extender expands it into a series of tiles. Because the overlay set is the same for every machine in the same family, the correctness of the implementation only needs to be proven once. In their work, Dias and Ramsey use a greedy (maximum-munch) tree-covering method (see chapter III) to implement the Extender. Assuming that the consolidator can improve the outflow code well, the instruction selection problem is attributed to the selection of the appropriate machine instruction to implement each tile. Fortunately, Dias and Ramsey found that the process could be automated. By using Λ-rtl to describe the overlay set machine directive, Dias and Ramsey developed a technique in which the RTL of machine instructions was merged, making the effect equivalent to a tiled RTL. The intuition is to maintain an RTL pool, and then use the order and algebra laws to iteratively grow it until all the tiles are implemented, or a certain termination condition is reached. The latter is necessary because the dias with Ramsey proves that finding a tile implementation is not possible (i.e. no algorithm can be guaranteed to terminate all possible tiles and find an implementation).

The Davidson and Fraser methods of Dias and Ramsey methods--and the main advantages of many other methods--are based on Λ-RTL's machine description ratios, such as GCC, to be shorter and simpler. In addition, declarative descriptions are more precise and clear, making automatic validation instructions possible (refer to Fernández and Ramsey "89" and Bailey and Davidson "22"). However, since the work of Dias and Ramsey is to promote the re-directivity, it does not necessarily increase the code directives. In addition, Λ-RTL cannot be used to model the loop within a machine instruction.

6.3.Other simulation-based practices

In their research papers, Ganapathi. "108" discusses a code generator called Ugen, which tracks the changes of each IR node on a virtual U Code machine "185". Using a predefined schema (schemata), the code generator exits code that has the same effect on the target machine. So the hard-pointing effort is quarantined as a rewrite outline. Unfortunately, Ugen no more information, because the paper lacks a reference to the work and is not cited in the relevant literature.

Fraser and Wendt "95" developed a method in which the instruction selector is generated in two steps. In the first step, the directive selector contains a series of switch statements and a goto statement that implements a simple macro expansion. The macro expansion begins by replacing only each node in the input DAG with a single machine instruction. Then execute it on a carefully selected training set. Work with the macro expansion to identify the optimization opportunities that have been discovered that can be performed on the generated code, and one can redirect the Peek Hole Optimizer to collect its tracking information and statistics. This is done by a function call embedded in the instruction selector. Based on these results, the directive selector is enhanced with conditional statements and additional GOTO statements that actually contain the selected optimizations, thus eliminating the need for an independent peek optimizer. In addition, because only optimizations that are considered useful are implemented, code directives are improved with minimal overhead. Later Wendt "235" improved this method by providing a specification that allows IR operation and represents the target machine instruction as RTL. The system can automatically derive how to map each IR to machine code, and a previous method must be done manually. This also evolved into a compact language for writing such code generators (refer to Fraser "94").

Hoover and Zadeck "131" developed a system called toast (custom optimization with semantic translation, tailored optimisation and Semantic translation), trying to describe from a given declarative machine, Automatically generates the entire compiler framework. As for the command selection they applied an algorithm that, for each IR node in the input, recursively enumerates all the combinations of machine instructions, and the set effect is equivalent to that node's function. If the action implemented by the selected machine instruction overrides the RTL graph generated by the IR node, the effect is the same. Apply a heuristic to constrain the lookup space by checking that the remaining non-covered nodes are still overwritten.

6.4.Summary

In this chapter we discuss a method that relies on some form of simulation. In this approach, it is common to describe machine instructions using a register pass-through list (RTL) that accurately captures the effect of the instruction. Command selection is performed by flowing out the general effect with the same instructions as the input IR effect. To simplify this task, the code outflow is done in a straightforward, inefficient way, followed by a peek-through-hole optimization, often referred to as the Davidson-fraser method, to improve the code. The Peek Optimizer merges, compares, and replaces polymorphic directives with a single equivalence object, thus having a stronger ability to handle complex instructions than a tree-based or dag-covered instruction selector. This is because there is no limit to the Peek Optimizer to model machine instructions as restricted graphics. Another result is that it also seems to make the analog-based command selector easier to redirect. However, the efficiency of the optimizer is limited by the instruction tree that the optimizer can parse at a time. In addition, in the current RTL-based approach, the register pass-through list for machine instruction modeling cannot capture the characteristics of a loop-based machine instruction.

Another important disadvantage of these practices is that there is no attempt to generate optimal code. This should require, not clear how and whether they can be extended to achieve such a goal. In addition, the code quality based on the Davidson-fraser method relies on the efficiency of the Peek optimizer, making it impossible for these methods to be used in highly constrained target environments such as DSP and embedded systems.

7.Conclusion

In this report, we examine and evaluate all existing and available literature on the selection of directives. The work is divided into 5 parts-macro expansion, tree overlay, dag overlay, graph overlay, and simulation-all with their own pros and cons. In Appendix C, there is a graph showing how the study of each category progresses over time. We've seen how this field begins with a proprietary, handwritten, one-of-a-kind program, gradually replacing it with more formalized techniques, to some extent. Not only are the latest command selectors producing better code quality and extending support for machine directives, but they also improve the compiler's ability to redirect, since a large part of the instruction selector can be generated automatically from a declarative machine description. However, the increased capacity and effectiveness of these new approaches is based on increasing complexity. For example, as the input represents a transition from a tree to a dag, the instruction chooses to change from a task that can be solved in a linear time to a NP-complete problem. Because this is inherently a difficult problem, most methods use heuristics to control complexity-many run in linear time, and it is reported that the approximate optimal quality is usually generated.

Despite these great advances, the most advanced methods still have several major flaws. The most significant drawback is that, as I know, there is no machine instruction that can model internal jumps or loops. This usually includes instructions that implement arithmetic operations, which are handled directly by the compiler intrinsic function (compiler Intrinsics) and the call to the library function that is implemented by the Assembly. While this allows the command selector to partially extend its support for machine instructions, this approach is far from ideal, with the new intrinsic function extension compilers requiring a lot of manpower, and the need to rewrite library functions for each target machine. As a result, many machine instructions are still not fully supported, but require compiler developers to implement custom routines to assist the command selector. In addition, heuristic-based expansion selectors often rely on the assumptions of the target machine, making it very difficult to redirect and generating efficient code for complex architectures such as DSP and embedded systems.

Another recurring constraint is that command selection is usually confined to the base block. The way the entire function is considered-and thus the global command selection-is started, but these are usually limited by other aspects, such as the support of machine instructions.

Finally, in order to achieve global optimal code generation, all three aspects of code generation must be considered. In other words, the optimal command selection is essentially futile if it is done in isolation. For example, it is not possible to make effective use of conditional codes (also known as State tags) without regard to scheduling, because they must be determined not to be prematurely rewritten by other directives. The same applies to VLIW architectures that can execute multiple instructions in parallel. Another scenario is a re-implementation (Rematerialisation), where the selected instruction is allowed to overlap, thus forming a repeating form. This is advantageous in situations where saving a value is later used rather than recalculating it overhead. However, this is useful only if the reduced register pressure does help the register allocator. In a target machine with multiple register categories that require a special instruction set for accessing each register class, the connection between the instruction selection and the register allocation becomes tighter. Having said that, most modern methods only consider command selection in isolation, making it difficult to integrate instruction scheduling and register allocations in a complete and efficient manner.

Despite these problems, several methods based on research techniques in the field of operations studies have shown great promise. The pursuit of the best, these ideas often can also handle more complex machine instructions, and can be extended to consider the entire function. In particular, emerging research is aimed at constrained programming-based approaches (reference Bashford and Leupers "24", Kuchcinski "155", and Casta?eda Lozano, etc. "39") showing that the various aspects of code generation can be integrated into a single constraint model, Because arbitrary limits on the target machine can easily be represented as additional constraints, it is extremely resilient and flexible in terms of heavy directivity. However, the current implementation is usually slower than the heuristic opponent of an order of magnitude, which means that it is still an immature technology that requires further research.

Finally, although the field has progressed considerably since the 1960 's, the directive choice remains-albeit with consensus-a difficult question to understand. In addition, the current embedded system, DSP and application-specific acceleration processor more tightly coupled trend means that the target machine will only become more complex, not simpler. Thus, in this view, command selection is more necessary than ever to increase understanding.


Command Selector survey (7 end)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.