5.diagram-based approach

The command selector for dependency graph overrides is the most powerful code generator available today. By allowing the input and mode to have arbitrary graphics, the command selector is able to accept the entire function as input--this is known as the global instruction selector--potentially handling various machine instructions, including hardware loops and SIMD instructions. Most importantly, the global directive selector can move and overwrite nodes freely across multiple blocks compared to DAG-covered technologies that are confined to a single base block. This increases the chance of applying complex patterns, which can lead to performance gains and reduced power consumption.

Since most graph overlay methods handle pattern-matching problems by applying some common sub-graph isomorphism algorithms, we will start with a simple look at several of these algorithms.

Figure 5.1: An example of a graph overlay problem. A pattern instance is represented by a dashed line and shaded area that contains the pattern's matching nodes.

5.1.Sub-graph isomorphism algorithm

The sub-graph isomorphism problem is to detect whether any one of the graph GA can be rotated, reversed, or mirrored, making it a sub-graph of another graph GB. In such a case, it is known that GA is a sub-graph isomorphism of GB, which is a NP-complete problem "55" that is to be determined. Because the isomorphism of the sub-graphs can be found in many other areas, a great deal of research has been devoted to this issue. In this section, we will take a simple look at two such graph matching algorithms.

An early-known sub-graph isomorphism algorithm developed by Ullmann "226". In a 1976 paper, Ullmann will determine whether a graph of ga = (Va, Ea) is isomorphic to another figure GB = (VB,EB) problem description for finding a Boolean | va| X| Vb| matrix M?, makes

C = M? M? B) T

? I, J, 1≤i≤| va|, 1≤j≤| vb|: (aij= 1)? (CIJ = 1)

Established, where A and b are the adjacent matrices of GA and GB, respectively. In an example such as M, each row contains only one 1, and each column contains a maximum of 1. By initializing each cell M?ij to 1 and then clipping out 1 until you find a solution to brute force to find m?. To reduce the search space, Ullmann developed a process to eliminate some 1 before starting the search. However, the worst-case time complexity of the algorithm is O (n!n2) (according to "57"), and the use of adjacent matrices repels the processing of multi-boundary graphs that may appear in the program representation.

Recently, "57", such as Cordella, proposes an algorithm called VF2, which has been used in several command selectors based on DAG and graph overlays. In particular, it has been used for auxiliary instruction set extraction (instruction set extraction,ise; Refer to "10,50,75" for example). In general, a new pair is added to the set by an iterative approach, and the VF2 algorithm recursively constructs a mapping set of "(n,m) pairs", where N∈ga and M∈GB. The core of the algorithm is a rule that prohibits the addition of a pair that prevents the formation of a valid sub-graph isomorphism mapping from GA to GB. In the worst case scenario, the algorithm has an O (n!n) time complexity, but the report says it has been successfully applied to graphs with more than 1000 nodes, because in the best case, it has polynomial time complexity. In addition, the VF2 algorithm can handle multi-boundary plots (Multi-edge graphs).

Other papers on the isomorphism of the sub-graphs found in this survey include: Guo, etc. "119", Krissinel and Henrick "152", Sorlin and Solnon "213" (a global limit for finding the isomorphism of the sub-graphs), Gallagher "102", Fan, such as "82", fan, such as "83", and Hino, such as "129".

5.2.first batch of methods

In 1994, "165", such as Liem, demonstrated a practice of performing pattern matching and selection on CDFG (Control and Data flow graph) in a heavily quoted paper. Implemented in a code generation system called Codesyn-it is also part of an embedded development environment called Flexware-this is the first known technology to perform global instruction selection while processing data flow, control flow patterns, and control-data-flow blending modes. Although the approach to the paper is limited to tree mode, Liem claims it can easily be extended to DAG mode. Their matching algorithms use O (N2) time in the worst case scenario, similar to the Weingart Tree pattern matching Program (refer to section 3.1), which forms a single tree-like structure. Starting from the root node of the structure, the matching program compares the current node in the CDFG and is in-depth at the branch. Patterns are arranged in such a way that if a match at a higher level in the structure fails, the entire subtrees tree can be deleted, thus reducing the matching time. Choose to do it with dynamic planning. However, Liem does not provide a way to automatically sort the pattern to build the tree, and whether the pattern can be extended to multiple unrelated diagrams also has doubts.

In the same year, "227,228", such as Van Praet, developed another method, which was implemented in chess, a DSP-oriented compiler for the European project segment. In this approach, the pattern is automatically exported from a processor description written in NML--NML is a domain-specific language "85" conceived by Fauth-it specifies a two-part diagram representing the data path of the target processor. In an input CDFG code generation process, the unit in the processor description is divided into packets. These packets are then converted to the corresponding pattern (which can be any shape), followed by matching and selection modes. If you can produce a better result, the pattern allows for overlap. The paper does not discuss the matching process, but the declaration uses a branch-defining algorithm (Branch-and-bound algorithm) to complete the selection. The algorithm also assumes that all schemas have nonnegative, unchanging overhead, which allows the processed graphs to be decomposed into smaller, simpler instances that can be overwritten individually. However, this prevents the use of machine instructions that can be executed in parallel because the choice of instructions does not incur additional overhead. In addition, it is unclear whether the complex processor can be built correctly in their approach.

5.3.unilateral and bilateral coverage technology

Another way to solve the pattern selection problem is to convert it to an equivalent unilateral or bilateral coverage problem. Although the technology based on bilateral coverage first appeared, we will discuss unilateral coverage first, since bilateral coverage is an extension of unilateral coverage.

5.3.1.Single sided coverage

Clark and other "51" published a paper describing code generation methods for non-cyclic computing accelerators (acycliccomputation accelerators). A part of a program can be implemented and executed by a custom accelerator, so performance is improved, and at this level, such accelerators are programmable. The algorithm first uses a modified Ullmann algorithm on the input DAG to prune a sub-graph that is not a sub-graph isomorphism, enumerating all possible sub-graphs (that is, patterns). Once all the schemas have been enumerated, the Pattern selection task is described as a single-sided coverage problem. This problem contains a Boolean matrix M, where the column represents the pattern instance, and the row represents the covered graph node. Therefore, if Mij = 1, that node I is covered by the pattern instance J. This is also shown in Figure 5.2. Therefore, the goal is to select columns so that each node is overwritten at least one pattern at the lowest cost (the algorithm blocks overlap, Clark, and so on, which reduces the power consumption of the generated code). In other words, if n is a collection of nodes in the input graph, p is a collection of pattern instances, then

Must be set up while minimizing the total cost of the selected mode. In the paper (i.e. "58,113"), the author mentions that unilateral coverage is an effective heuristic to solve these problems, and their experiments show that the algorithm shows the linear running time complexity. This method was later extended by the Hormati and other "132" to reduce the interconnection of the accelerator design and the delay of the data center.

Figure 5.2: Example of a single-sided overlay. The 1 marked with an asterisk in the matrix represents a potential, but unchecked, overlay, and 1 with no marks indicates an optimal, selected overlay. Assume that all schemas have the same unit cost.

Single-sided coverage was also used by Martin and other "168", but unlike Clark, they described the problem as a constrained programming model (CP). Then, the command scheduling is combined to find the optimal solution of the CP model. This approach has also been floch and other "91" expansion to adapt to the VLIW architecture.

5.3.2.Bilateral coverage

The matrix of a single-sided overlay problem can be rewritten as a Boolean formula that contains a non-complementary disjunction (conjunctions of uncomplemented disjunctions). We write each disjunction as a clause (clause). As an example, the Boolean matrix in Figure 5.2 (b) can be rewritten as

F = (P1 + p2) (p2+ p3) (P3 + P4) (P4 + p5) P6

So the goal is to meet F with minimal overhead. However, the problem with single-sided coverage is that it does not capture all the constraints necessary for the selection of instructions. Most schemas assume that the input has a type or is located at a specific data location. In the syntax, this is represented by non-terminator characters. Using the same example, let's assume that node N1 can be p1,p?1 by multiple single node modes, and P?? 1 coverage. Let's further assume that the pattern P3, if checked, requires that the node N1 must be overwritten by the pattern P1 and only for that pattern. Such constraints are common to command selection and cannot be represented as part of a single-sided coverage problem. This shortcoming is solved by bilateral coverage. Unlike unilateral overrides, a bilateral overlay allows the disjunction to contain non-complementary and complementary literals (i.e.). Now the constraint we just discussed can be expressed as + P1, that is, if we choose P3, then we must also select P1 to make the clause true. So this clause is called the IMPLIED clause (implication clauses), because it is equivalent to P3? P1.

Rudell "197" [1] leads to the application of bilateral coverage by using a bilateral overlay to solve DAG overlays for integrated circuit synthesis (VLSI synthesis). It is also used by Servít and Yi "206" to solve more or less similar technical transformations (technology mapping) problems.

This practice was later modified by Liao and other "163,164" to handle the command selection. In their papers published in 1995 and 1998 respectively, Liao describes a method that targets a register machine and optimizes the size of the code. To control the complexity, the problem of pattern selection is solved by ignoring data transfer and register splash overhead. After selecting the complex mode, the overwritten node is condensed into a single node, and a second bilateral coverage problem is constructed to minimize the overhead of data transfer. Although these two problems can be solved at the same time, Liao and so on choose not to do so, because the number of required clauses will become great.

Recently, Cong and other "53" have also applied the bilateral overlay to the command selection as part of the configurable processor architecture application-specific instruction generation. However, the authors have only taken advantage of previous studies, and have not contributed to the new knowledge that will further advance the selection of directives.

Although the practice discussed above will confine itself to dags, the pattern selection of the bilateral coverage idea can be extended to any graphical pattern. This, along with the fact that there are many studies that are accurate and approximate to solve bilateral coverage problems, makes it a promising candidate for future instruction selection studies.

1.2.based onPBQPthe Practice

In 2003, Eckstein and other "73" developed the first technology to work directly on the SSA map. The SSA stands for static, single-assignment (static, assignment), which is a program representation where each variable or temporary variable is defined only once. The effect is that the life cycle of each variable is continuous, which simplifies many optimization processes. Therefore, SSA-based IR is used by many modern compilers, including LLVM and GCC. For more information about SSAS, readers are advised to refer to the compiler textbook "8" or "56".

Eckstein such as recognizing that in the case of a dedicated DSP performing fixed-point arithmetic, restricting the selection of the instruction to the base block would result in suboptimal code. For example, in Figure 5.3 (a), we see a small segment of the code that multiplies the fixed-point value. For fixed-point multiplication, a common multiplication feature is that the result is shifted one bit to the left. Therefore, in order to effectively utilize such a multiplication unit, the result of the variable s in line 4 must maintain a left deviation in the loop, and only move back to normal mode before the function returns. This means that the value of s at different points in the function must be in a different form. However, in an instruction selector that is restricted to the base block, such information is difficult to propagate because s is defined and used in different trees (see Figure 5.3 (b)).

[1] Based on the second-hand data from "53,163,164".

Figure 5.3: An example that will benefit the global directive selection (from "73")

In contrast, the global directive selector is able to work with the pattern selection to make these decisions. First, the input program is rewritten as an SSA, from which the corresponding SSA map is exported. Figure 5.4 shows these data structures for the current example. Because the SSA variable can be defined only at one point, use the function to define the loop variable to bypass the restriction. Also, because the SSA diagram cannot contain control flow information, the code outflow must take the input program as an additional input to infer how the machine instructions are arranged.

Assume that the pattern is represented as a linear syntax (refer to the 20-page description). All normal syntax can be easily rewritten as a linear form of syntax. For each node in the SSA graph, ni, we define a Boolean vector XI, which is equal to the number of basic rules that match that node. Here, the match only means that the operator of a basic rule matches this node. It does not necessarily mean that there is a valid rule inference for a child node, so that the parent rule can be actually selected. Instead, this will be taken over by the chain overhead we will soon discuss. So we can assume that this slightly relaxes the matching problem, and then infers the effective match as part of the pattern selection problem. It is further assumed that the weight overhead of the applicable basic rule is given by another vector ci of equal length. The weights are usually estimated, and the relative execution frequency of the node operations. It is necessary to give higher priority to low-cost instructions in the loop because they have a greater impact on performance. Using these definitions, the SSA graph is transformed into a partition Boolean two-time problem (partitioned Boolean quadratic problem,pbqp, also described by Scholz and Eckstein "203"), which is defined to find the assignment to Xi, making

is the smallest, where n is the number of nodes in the SSA graph. Solutions are also limited to those in each vector XI (i.e. xi?1t = 1) with only one 1 solution. PBQP is an extension of the two dispatch problem (the quadratic assignmentproblem,qap), which is a basic combinatorial optimization problem. QAP and PBQP are obviously NP-complete problems, and Eckstein have developed their own heuristic solutions, which are also described in the paper.

Figure 5.4: The SSA form of Figure 5.3 (from "73")

Obviously, this objective function consists of two parts: the accumulated chain cost (this is the first item), and the accumulated basic overhead (This is the second item). The basic overhead is obvious and will not be discussed further. Chain overhead is the overhead of migrating from one basic rule to another through a chain rule. This overhead is given by the overhead matrix CIJ, which represents the chain overhead of migrating from node J to node I in the SSA diagram. Assuming that the edges of the SSAS graph must be sorted, node J is the M child node of node I. The element ckl of this matrix, then, gives the minimum cost of the chain rules necessary to use the non-terminator of the rule L as the M parameter of rule K. If the non-terminator is the same, the overhead is 0. If such a chain rule does not exist, the cost value is set to ∞, which prohibits the selection of such rule combinations. The Floyd-warshall Algorithm "92" (although there are other algorithms) can be used to calculate the chain cost by calculating the transitive closure for all chain rules. Later Sch?fer and Scholz "201" found a way to optimally apply these chain rules.

Savvy readers may notice that the plan assumes that the SSA graph does not contain a heavy edge (multi-edges). This prevents an expression such as Y = x + x from being modeled directly. Fortunately, heavy edges can be removed by introducing new temporary variables and connecting them with value copies. That

Eckstein and so on selected issues to test their implementation. The results show that, compared to the traditional tree pattern matching program, this approach improves code quality 40–60% (up to 90% for a problem). The PBQP solver produces the best results for almost any test case. However, this approach is limited to tree mode, which hinders the use of many complex machine instructions.

This limitation was later removed by Ebner and other "72". First the syntax is extended to allow the rule tuple (tuples of rules) for a given pattern, i.e. a division with a cost of 2-the modulo instruction can be represented as the following complex pattern:

? Lo→div (X:REG1,Y:REG2), Hi→mod (x, y)?

[2] {Emit divmod R (REG1), R (REG2)}

X and Y are rules that allow DIV and mod operations to represent indexes that share the same input. Since then, the basic rules that are subordinate to complex schemas are called proxy rules. Then the PBQP problem was modified to encompass the choice of complex patterns. Essentially, this must introduce a new variable that represents whether a complex pattern is being used, together with the restriction that enforces a selection of all proxy rules that are part of the complex pattern. We also need to prevent the selection of the constraints that lead to a complex pattern of cyclic dependencies. Let's get into the details.

First, the proxy rules derived from the complex pattern of matching node I extend the Vector XI of node I. If two or more proxy rules from different complex schemas are the same, the length of the vectors still increases by only 1 elements.

Next, create a complex pattern instance for each combination of independent nodes, where the matching proxy rules can be combined into that complex rule. Each instance produces a decision vector XL that indicates whether the instance L is selected. Let's set all the XI vectors into a variable class X1 and put all the decision Vector XL into another variable class X2. The connection between X1 and X2 is this, if the XL is a selected set, where I is a node that matches a proxy rule for the L section, then all the corresponding proxy rules in Xi must also be the selected collection. This requires an additional overhead matrix, where the value of one of the corresponding elements is set to 0:

· The XL will be set to no selection, or

· Xi will be set to a basic rule or proxy rule that is not associated with the complex mode instance L.

All other values are set to ∞. However, if the cost of all proxy rules is 0, then the answer is allowed, where all the proxy rules for a complex mode L are selected, but the complex schema itself is not. Ebner, etc. by setting a high overhead m for all proxy modes, and modifying the cost of all complex schemas to costs (l) –|l| M, where |l| is the number of agent rules in L, to solve this problem. This offsets the human overhead of the selected proxy mode, lowering the overall cost of the selected proxy rule and the complex schema to the costs (L).

Finally, if the two complex pattern instances U and V overlap or are selected to cause cyclic data dependencies, we need a cost matrix that prevents them from being checked at the same time. This can be done by setting the value of the element corresponding to the violation of these two conditions to ∞, and the other set to the zero guarantee.

So, if we rename the CIJ overhead matrix in the original PBQP problem, the new target function becomes

where P is the number of complex pattern instances. With ARMV5 as the target processor compared to LLVM 2.1, for a set of selected issues, the PBQP solution improved by an average of 13% for the execution time measurement. The overall impact on compilation time can be negligible.

Another technique for extending the first PBQP implementation to general graphics mode is shown by Buchwald and Zwinkau "35". Buchwald and Zwinkau set the command selection as a formalized graphical conversion problem, many of which were already present, and machine directives were represented as rewrite rules rather than grammatical rules. In addition to the extended mode support, the formalized basis can be validated by the resulting instruction selector to handle all acceptable inputs. If the validation fails, you can automatically deduce the necessary, missing rewrite rules. After finding all applicable rewrite rules for the SSA graph (which is equivalent to the matching pattern) [1], a corresponding PBQP instance is formed and solved as before. Buchwald and Zwinkau also found and dealt with these scenarios, as the information was not propagated enough, and the implementation of the heuristic solver, such as Eckstein, might not be able to find a solution. In the paper, however, Buchwald and Zwinkau mention that their current scalability is not good when the number of overlapping patterns increases.

Compared to other technologies, the PBQP-dependent approach looks very promising in terms of machine instruction support, code quality, and runtime. However, because the PBQP-based command selector is a fairly new invention, the number of references and applications in the industrial-level compilers are still very small. It is also unclear whether they can handle all types of machine instructions, especially those with special limitations.

5.5.other diagram-based practices

Yu and Hu "245" show a technique for pattern selection, which relies on a recursive, heuristic search method based on the method objective analysis (Means-end analyses) [2]. This is said to be strong enough to take the entire function as input, but the author does not enter the details. This paper is just for completeness.

A non-traditional technique was proposed by Visser "229", and he applied the theory "147" of simulated annealing to code generation. Although interesting, the instruction selection problem is unfortunately reduced to one by one mapping between the IR node and the selected machine instruction, so it is useless in practice.

5.6.Summary

In this chapter, we consider a number of command selection techniques that rely on graph overrides. Compared to code generators that work on a tree or DAG, the command selector based on the graphics overlay is the most powerful because the program input and machine instructions can be arbitrary graphics. This makes it possible to include the entire function, including the control flow, to match and select a more complex machine directive that cannot be modeled by a tree or DAG.

However, because the sub-graph isomorphism is a NP-complete problem, the optimal map overlay requires two NP-complete problems, not "just" a dag-covered one. Because this has worsened the already enormous challenge, most likely, we will see this in compilers that can afford the best or approximate optimal code quality for very long compilation times (in embedded systems with extremely high performance, code size, or power requirements).

[1] In fact, by splitting the node into two nodes to break the loopback, the SSA diagram is first converted to a DAG.

[2] Similar ideas were previously used by Newell and Ernst "178" for tree-based command selection (refer to page 7th).

Directive Selector Survey (6)