Directive Selector Survey (5)

Source: Internet
Author: User
Tags dashed line echo command

4.DAGcover

As we saw in the previous chapter, one intrinsic disadvantage of relying solely on a tree is that it does not correctly model common subexpression. This means that an expression must be either partitioned into a forest or repeated in every subtrees tree. Neither is a good solution, because they all lead to suboptimal code. One solution to this problem is not to model the expression as a tree, but to have a directed acyclic graph (DAG). Dags allow nodes to have multiple out-of-office edges, so that the values of the intermediate nodes can be shared and reused. The execution command selection can use the same pattern matching and selection method as the tree (see Figure 4.1). Therefore, the directive selectors in most modern compilers typically work on an expression dag rather than on a tree. In particular, this also makes it possible to model multiple output machine instructions, such as Divmod instructions for both the quotient and the remainder.

Figure 4.1:dag An example of an overlay problem. A pattern instance is represented by a dashed line and the shaded background of the node that contains the pattern.

Unfortunately, the cost of profitability in terms of general and modelling capabilities is a significant increase in complexity; Although selecting the optimal pattern on a tree can be done in linear time, the same action for Dags is a np-complete problem. Bruno and Sethi "34", Aho and other "4" have been shown in 1976 years, although they are more oriented to optimal instruction scheduling and register allocation, rather than command selection. In 1995 proebsting "186" gave a very brief proof of the optimal command selection, which was later re-elaborated by Koes and Goldstein "150" in 2008. We will give the gist of the Koes and Goldstein proof in this report. Note that this does not require a task involving pattern matching-as we will see, if the pattern contains a tree, this can still be done in polynomial time.

Theorem 1. The optimal (at this level, the lowest cost) command selection is the np-complete problem.

Prove. The proof is based on the problem that the SAT (Boolean gratification) problem is generalized to the optimal instruction selection. The SAT question is whether a Boolean formula that determines the form of a combined paradigm (CNF) is a desirable value of T. A CNF expression is an equation with the form "(X11∨x12∨ ...) ∧ (x21∨x22∨ ...) ∧ ...", and the variable x can also be reversed by ¬x. Because this problem has been proved to be NP-complete "59", all polynomial time can be summed up to any other problem P must also remain NP-complete.

First we convert an sat question into a DAG. Intuitively, if we can cover this DAG in a unit-cost pattern, so that the entire cost equals the number of nodes (assuming each variable and operator causes a new node), then there is a truth assignment for those variables, and the result of the formula evaluating it is t. To do this, the nodes in the DAG can be of type {∨,∧,¬,v,-,?}. Defines type (n) as the return type of node n, and is called the node type-with the box node (box nodes) and the Stop node (stop nodes). We also define children (N) To return the child node set to N. For each Boolean variable "X∈s", create two nodes V1 and V2, making type (v1) = V with type (v1) =-, and a forward edge (V1, v2) (meaning n1→n2). By creating two nodes O1 with O2, so that type (O1) = op with type (O1) =-, as well as an edge (O1, O2), we also handle each of the two-tuple Boolean operator op∈s. We also created (A, O1) with (b, O1) so that both A and B are the box nodes that the operation entered (that is, type (a) = Type (b) =-). Because Boolean OR and and both satisfy the commutative law, the order in which the edges are arranged is irrelevant in terms of operator nodes. Obviously, for the inverse we only create one such edge. As a result, only the box node can have multiple fathers (and multiple out-of-office edges). We can build the corresponding DAG in linear time only by traversing the Boolean. An example of a SAT problem switching to a command selection question is given in Figure 4.2 (b).


(a) SAT mode. All patterns have a unit cost. Briefly, the ∧ operation of (b) an SAT problem is converted to

The pattern is outgoing, but it can be easily deduced from the ∨ pattern. An example of a command selection problem

Figure 4.2:sat to the induction of the instruction selection (from "150")

We build a set of tree pattern p, which allows us to infer how variables should be set to satisfy the expression. This pattern set is shown in Figure 4.2 (a). This forms the pattern, if the value is assumed to be set to T, so that it has a box node as a leaf. Also, if the result of the operation is an F evaluation, the pattern has a box node as the root node. One way to look at it is if an operation in the pattern consumes a box node, the value must be set to T, and if the operator produces a box node, the result must be a value of f. To force the entire expression to be a value of T, the only pattern that contains a Stop node consumes a box node.

In addition to the node type that can appear in the DAG expression, the schema node can also use an additional node type, which we call the anchor node (anchor nodes). Now we say that a tree pattern with the root node PR matches a node v∈v, where V is the node set in the expression dagD = (v,e) , when and only if:

1. Type (v) = Type (pr),

2. |children (v) | = |children (pr) | and

3. Cv∈children (v), Cr∈children (PR): (Type (CR) =) ∨ (CR match CV).

In other words, the structure of the tree pattern must correspond to the structure of the sub-graph, except that the anchor node can be matched at any node. We also introduced two new definitions: Matchset (v), which returns a tree-pattern collection that matches V in P, and Mpi→v (VP), which returns the DAG node corresponding to the node VP of the matching pattern pi v∈v. Finally, we say that a dag D = (V, E) is covered by a mapping function f:v→2t from the DAG node to the schema, and only if the? V∈v:

1. P∈f (v) = P matches V and P∈p,

2. In-degree (v) = 0=> | (v) | > 0, and

3. P∈f (v),? vp∈p s.t. Type (VP) = = = = = |f (mpi→v (VP)) | >0.

The first qualifier enforces that each selected pattern is a match and is a true pattern. The second constraint forces a stop to the matching of the node, while the third qualifies the match for the remainder of the DAG. An optimal cover for d = (v,e) is to overwrite D and make

Minimum. where cost (t) returns the price of the mode p.

Now suppose that if the optimal coverage has a total cost equal to the number of non-box nodes in the DAG, then the corresponding SAT problem can be satisfied. This should be clear, because all the tree patterns in P actually cover a non-box node and have an equal unit cost; If each non-box node in the DAG is exactly covered by a pattern, we can easily see which patterns are selected to overwrite those nodes and derive the value of the Boolean variable. Finding these patterns can be done in linear time by performing a simple depth-first traversal of the DAG.

So we've shown that an example of the SAT problem can be solved by summing up the polynomial time to an example of the optimal instruction selection problem. Therefore, the optimal command selection problem is np-complete.

4.1.DAGTree Pattern match on the

"4", such as Aho, provides a code generation method for working on a DAG in the first batch of literature. In their paper, published in 1976, Aho a few simple, greedy heuristics, as well as an optimal code generator that conforms to the Exchange law, a single register machine generating code. However, their approach assumes that there is a one by one correspondence between the program node and the machine instruction, so the problem of optimal instruction selection is actually ignored.

4.1.1.go toDagof

Because the command selection on the DAG is very difficult, the first way to handle such input is simply to dag them into a forest. This can be done in two ways. The first approach is to split the edges from the shared node (i.e., reuse), which results in a set of trees that are not connected. An implicit connection between trees by forcing a shared node to write to a specific location (such as memory), subsequent trees will read it in as input to maintain. This technique is used for example Dagon, a technology wrapper developed by Keutzer "145" (technology binder). The second method is to copy the subtree for each reuse, which also results in a set of trees that are not connected. Each tree can then be processed separately using the traditional tree cover method. Both of these concepts are shown in Figure 4.3.

Although going to a DAG makes it possible to apply known practices to command selection on a DAG, it has an important drawback: too aggressive fragmentation hinders the application of complex patterns because it produces many small trees, and too aggressive replication produces inefficient code because many operations in the final code do not have to be repeated unnecessarily. So both of these practices lead to suboptimal code. In addition, the intermediate result of a split Dag must be temporarily saved, which is cumbersome for memory-register promiscuous architectures. Finally, the question was investigated by Araujo and other "17".


Figure 4.3: Examples of de-dag. The square nodes and dashed lines in (b) show only data dependencies, not the actual part of the diagram.

Fauth and other "86,174" propose an attempt to mitigate this defect by balancing duplication and division. When implementing the CBC code generator, they apply a heuristic algorithm that favors replication until replication is considered too expensive. At this point, the algorithm turns to split. By comparing the cost of two solutions, choose the one with the least cost to decide whether to replicate or split. The cost is calculated by a weighted sum of the number of tree nodes and the number of nodes expected to execute on each execution path (a rough estimate of code size versus execution time, respectively). Once this is done, each tree of results is passed to an improved Iburg that supports extended matching criteria. However, the experimental data is limited at best, which makes it difficult to compare the efficiency of the algorithm with simple and straightforward de-dag.

4.1.2.extend tree parsing toDAG

In 1999 Ertl "80" published a paper that extends tree parsing ideas to dags without first decomposing the DAG into a tree.

The idea is to first perform a bottom-up on the DAG as if it were a tree, and use the traditional DP method described in chapter three to compute the minimum cost per node for each non-terminator. So each node is labeled on the same cost, as if the DAG was first copied to a tree. However, Ertl recognizes that if several patterns expect to generalize a child node to the same non-Terminator, the non-Terminator can be shared among these patterns. This allows the code to flow out only once for a given non-terminator combination during code outflow. This effectively reduces the size of the code and improves performance because the number of repetitive operations is reduced. Figure 4.4 shows an example of this. There, we see that the addition operation will be implemented two times because it is covered by two separate patterns, each of which is about a different non-terminator. However, the Reg node is two times to the same non-terminator, so it can be shared between different schemas.

The method of Ertl is better than simply de-DAG, and it is linear in execution time by maintaining the appropriate collection of visitors. However, although the title of the paper claims that it is not optimal except for a certain set of tree syntax. Therefore, the paper also describes a method of detecting such a grammar. It is intuitive to check that each local optimal decision is also globally optimal. This can be achieved by using the idea of Burg, Ertl implements a checker called Dburg, which performs an inductive proof on all possible input DAG collections.

Koes and Goldstein "150" have recently extended the idea of Ertl through a heuristic that splits the input dag on points that estimate that replication will be detrimental to code quality. Like Ertl, the algorithm first performs a bottom-up on the DAG, ignoring node sharing. It then calculates two costs at the shared node (that is, multiple pattern overlaps): an estimate of the overlap cost, the pattern overlap (i.e., replication), and the CSE cost, which disperses the shared nodes into an estimate of the different trees. If the CSE cost is lower than the overlap cost, the shared node is marked as fixed. Once all the shared nodes have been processed, the shared nodes that are marked as fixed are divided into different trees. The second bottom-up DP Pass is then performed on the entire input graph, followed by a top-down from the machine instruction. Koes and Goldstein also tested their implementations, a linear time algorithm called Noltis, which was implemented as a standard in a ILP (integer linear programming, an integer linear programming) and found in 99.7% of test cases, Noltis generates the optimal scheme. However, like Ertl, two practices are confined to the processing tree mode, which repels the choice of complex machine directives that must be modeled as DAG patterns.


Figure 4.4: A dag tree resolution with optimal mode selected. Shadows represent relationships between patterns, and the text on the edge represents the non-terminator to which each pattern is normalized. Note that the Reg node is actually overwritten by two modes (represented by a double dash pattern), both of which are attributed to the same non-Terminator, and therefore shared.

4.1.3.based onDAGMethods of other trees (Other trees-on-dagapproaches)

The LLVM "219" compiler infrastructure uses a technique different from the one discussed above. There is very little documentation on how the instructions in LLVM are made; the only source found is Bendersky's blog "26". The directive selector is basically a greedy dag-to-Dag rewrite, where the basic block of machine-independent dags represents being rewritten as a machine-related representation. Jumps caused by conditional statements or loops are part of the basic block, which allows the jump pattern to be treated like other modes. patterns, which are limited to trees, are represented by a machine description format that allows the public attribute to be extracted as an abstract instruction. The machine description is then unfolded as a complete pattern by a tool called Tablegen. These patterns are then processed by a matching program generator, which first performs a dictionary ordering on the pattern-initially based on complexity (which is the same as the size of the pattern and certain constants), then increments by the cost, and then increments by the output mode size. Each pattern is then converted to a recursive matching program, which is essentially a small program that checks whether the pattern matches a given node in the input dag. These matching programs are then compiled into a byte-code form and assembled into a matching program table that is queried during the instruction selection. These tables are arranged so that the pattern is checked in this dictionary order. Once a match is found, the pattern is greedily selected, and the matching sub-graph in the input dag is normalized to the output of the matching pattern (usually a single node). Although powerful and widely used, LLVM's command selector has several drawbacks. The main drawback is that any pattern that cannot be handled by Tablegen must be handled manually by a custom C function. Because the pattern is limited to tree form, this excludes all multi-output modes. In addition, the greedy scheme compromises the quality of the code.

4.2.DAGon theDAGPattern Matching1.1.1.will beDAGpattern splits into trees

A common practice for matching dag patterns on a DAG is to decompose the pattern into multiple trees, perform a tree pattern match, and then assemble the matching tree's combined rewrite into the original DAG form. Usually this can be done in O (N2) time.

Leupers and Marwedel "156,161" developed a technique that partially solves the disadvantage of a dynamic programming approach that does not match patterns that contain multiple disjoint sub-patterns. Published in 1996, their paper describes a method of dealing with patterns that consist of multiple disjoint nodes that perform a single operation, which the author calls complex patterns. Such patterns are first decomposed into their respective components, and then the pattern set is converted to the associated tree pattern, which is exported from a complex pattern that contains a single node. Then pattern matching consists of two phases: in the first phase, the Iburg is applied to find all the matching trees of the input dag, and then the second stage passes an integer planning (IP) model "47,204", which takes the instruction selection and scheduling restrictions as linear inequalities (which are discussed later), The single-node mode is reorganized into a complex mode instance. However, because this IP problem is in the worst case resolution time is a number of levels, the solution generated by leupers and Marwedel is not necessarily optimal.

Leupers "158" expands this model to handle SIMD instructions. The model presented in this paper assumes that the SIMD directives cover two inputs, collectively a SIMD pair, but this idea can easily be extended to SIMD instructions with n inputs. For SIMD pairs, the linear inequalities are as follows:


Equation 4.1 emphasizes that each node in the input dag is covered by a rule (R (NI) represents a rule set that matches on a node ni, or a pattern set, Xij is a Boolean variable that represents whether the node NI is covered by the regular RJ). Equation 4.2 emphasizes that the rules for all subnodes of Node ni are selected to the same non-Terminator (NK represents the M child node of the parent node Ni, whereas RM (NK) represents the first m non-terminator rule set that matches the first M sub-node and is normalized to the rule RJ). Equation 4.3 and Equation 4.4 Emphasize the application of an SIMD instruction to actually overwrite an SIMD pair, and any node can be overwritten by up to one such instruction (Yij is a Boolean variable that indicates whether NI and NJ are encapsulated in a single SIMD instruction, while Rhi (NI) and Rlo (NI ) indicates that the node NI operates a set of rules for the high-low part of the register). Yij and Yji are also used in additional qualifiers to maintain the availability of the scheduler (which we will not discuss here). The goal is to maximize the use of SIMD instructions, which is represented by the following objective function:

Where G is the node set in the input Dag, and S (NI) = Rhi (Ni) ∪rlo (NI)? R (NI). This paper contains some experimental data suggesting that the use of SIMD directives reduces code by 75% for selected test cases and target platforms. However, this assumption assumes that each individual operation of the SIMD instructions can be represented as a single node in the input dag. It is therefore unclear whether this method can be generalized to more complex SIMD instructions and whether it can be extended to larger input programs. Later, Tanaka and other "216" expanded the work, taking into account the cost of data transfer while choosing SIMD instructions. This is achieved by introducing a transport node and a transport mode (Tanaka, etc., called packets and unpacking).

Bashford and Leupers "24" applied the same idea, but after pattern matching, they represented the pattern selection problem as a constrained programming model (constraint programming (CP) models) "196". For each node in the input DAG, a Decomposition register transfer (factorised register transfer (FRT)) is formed based on the pattern that the node matches. A formal definition of a frt is

(Op, D, [U1,..., Un], F, C, T, CS]).

OP is the operation of the node. D is the set of registers to which the result can be written. Similarly, the UI is the set of registers that the operation can read into input i. F,c and T represent extended resource information (the Extended resource Information (ERI)), which specifies the cost (C) of which function unit will be executed (F) by which machine instruction (T). The last variable, CS, is a limited set of patterns that can be applied. For example, if a function unit A in U1 can only write register r, then this can be represented as U1 = a? D = R. The qualification can be arbitrarily complex to ensure that the generated code is correct. Because the optimal command selection is np-complete, Bashford and leupers apply a heuristic that divides the input dag into smaller fragments at the shared node and then executes the instruction selection in isolation on each fragment. This practice of applying the CP-to-command option looks promising, because arbitrary limits on the target architecture are only represented as additional qualifications; indeed, the purpose of this approach is to target unconventional DSP architectures. However, it is unclear whether it can scale to a larger input dag and a pattern set.

Influenced by these ideas, Scharwaechter and other "202" developed another way to process multiple output machine instructions. First, when defining instruction syntax for the target machine, the Allow rule has multiple non-terminators on the left-hand side. In their paper, Scharwaechter distinguishes between a non-terminator rule and a rule that contains multiple non-terminator, which are called rules and complex rules, respectively. The definition of non-terminator within a rule is called a simple rule, and a non-terminator definition within a complex rule is called a split rule (split rules). This can also be shown in the following form:


During pattern matching, the matching program only works on simple and split rules, and maintains a mapping between the split rule instance and the IR node. After all such matches are found, the split rule instance is merged into the appropriate complex rule instance to exclude a combination of data flow violations. Then, whenever a simple rule and part of a complex rule is matched to the same node and normalized to the same non-Terminator, a cost comparison is made using simple rules or complex rules. Selecting a complex mode has an impact on the available modes for the remainder of the input dag. This is because the intermediate results of the nodes in the complex schema cannot be re-used for other patterns, so replication may be required. However, it is well known that replication results in additional overhead. So the directive selector only chooses the complex mode in one case, and if a complex pattern replaces a set of simple patterns, the cost savings are greater than the cost of replication. After these cost comparisons have been completed, some of the remaining complex patterns may overwrite the same nodes and must therefore be removed. This problem is solved by formulating the maximum weighted independent set (MWIs) problem, in which a group of nodes is selected from a non-weighted graph so that the nodes that are not selected are adjacent. In addition, the weight of the selected node and must be the largest. In verse 4.3 we will discuss the idea in more depth. In the MWIs diagram, each complex pattern forms a node, and if two patterns overlap, an edge is introduced between the two nodes. Weights are calculated as a negative number of split rule overhead in complex mode (the paper is ambiguous about how the splitting rule overhead is calculated). Because this problem is known to be NP-complete, the author applies a greedy heuristic called "199", which is known as the GWMIN2 Sakai. Finally, split rules that are not merged into complex schemas are replaced by simple rules before the code is streamed out. Scharwaechter is implemented in Cburg as an extension of olive and runs some experiments by generating code for a MIPS architecture. Cburg generates code that takes advantage of complex instructions, and then compares it with code that uses only simple instructions. The results show that Cburg shows near-linear complexity, and in this case the resulting code performance increased by 25%, and the size was reduced by 22%. However, this code generator is not guaranteed to be optimal. Later, Ahn and other "2" by including the complex mode of scheduling dependency conflict, thus forming a candidate interference diagram, expanding the work. Then the MWIs problem is solved in this picture. The Ahn also incorporates a feedback loop with register allocation to facilitate register merging (register coalescing).

In the two approaches described, such as scharwaechter and Ahn, complex rules can contain only a few simple rules that are not associated (that is, there are no shared nodes between simple rules). In a 2011-year paper, it is the modification and extension of "202", Youn, etc. "243" to deal with this problem by introducing index subscripts for the operands of complex rules. However, these subscripts are restricted to the input nodes of the pattern. Therefore, a completely arbitrary DAG mode is still not supported.

Arnold and Corporaal "18,19,20" propose another way to process the DAG pattern, decomposing the connected single-node mode into multiple partial-tree modes at the output node (see Figure 4.5). Using a proprietary O (N2) algorithm, match the tree pattern on the input dag to find all pattern instances. After the match, the algorithm attempts to combine the incomplete pattern instances into a complete complex pattern instance properly. Implement each pattern instance by maintaining a pattern node that maps to an array of overwritten nodes in the input DAG, and then checks whether two incomplete modes belong to the same original DAG mode and there is no conflict. In other words, in the original DAG mode, no two schema nodes that correspond to the same node will overwrite different nodes in the input dag. The choice is done with traditional dynamic planning, which is optimal if the original schema contains only trees, but produces suboptimal overrides in the case of DAG mode. This is because the best selection pattern may overlap, resulting in duplication of operations.


Farfeleder, such as "84", proposes a similar approach by applying an extended version of Lburg for tree-pattern matching, followed by a pass-through attempt to merge matching patterns from multiple output machine instructions. However, the second pass contains a specialized (Ad-hoc) function and is not automatically deduced from the machine description.

4.2.2.directlyDAGon theDAGMatch

During the study of this report, there was no way to locate a DAG directly on the DAG, and only for the DAG. The reason for this speculation is that performing a DAG match is equally or almost as complex as a sub-graph isomorphism. While the latter is more powerful-for example, it allows the graph to have a looping edge, restricting such a matching program to the DAG is unnecessary and unreasonable. Indeed, as we will see in this chapter, several DAG-oriented methods apply a sub-graph isomorphism algorithm to pattern matching.

4.3.the regression mode is selected as the maximum independent set problem

Some practices explain the problem of pattern selection as a maximum unrelated set (MIS) problem; We have seen this idea in applications such as Scharwaechter and Ahn (refer to section 4.2.1). In this section we will discuss this technique in more detail.

For patterns that cover one or more nodes in the input graph, a corresponding conflict or interference diagram can be formed. Figure 4.6 shows an example. In this matching DAG, shown in (a), we see patterns P1 overlapping with P2, while patterns P4 overlap with P2 and p3. For each pattern, we create a node in conflict diagram C, shown in (b), and create an edge between any of the two overlapping patterns. By selecting the maximum set of vertices from C so that no selected nodes are adjacent in C, we get a set of patterns so that each node in the input DAG is overwritten and does not overlap. As expected, this problem is a NP-complete problem.


Figure 4.6: An example of a matching dag and corresponding conflict graphs

As for the optimal mode selection, in the conflict diagram, we can attach the mode cost as a weight to each node, and expand the MIS problem to a maximum weight independent set (maximum weighted independent set,mwis) problem, that is, select the maximum vertex set S, So that it satisfies the MIS and maximizes σs∈sweight (s). Because MWIs has a maximized target, we just assign the negative number of the pattern cost to each weight.

Kastner, such as "140", developed a method for using this technique, aiming at instruction set generation, with a hybrid reconfigurable system. Thus the pattern is not given as an input and is generated as part of the problem itself. Once a pattern set is generated, a generic sub-graph isomorphism algorithm from "71" is used to find all matches on the input dag (we will discuss the sub-graph isomorphism in the fifth chapter). Then with a corresponding MIS problem, inspired to solve. Because the matching algorithm does not require the input to be a DAG, the practice can be extended to any graph-based input. In an extended version of this paper, Kastner and other "139" improved the matching algorithm in the fast rejection of non-similar sub-graphs. However, the resulting code may be suboptimal, and it is unclear how to handle unconventional schemas with complex constraints.

Brisk and other "33" also used the idea of MIS to execute the command selection on the architecture with the echo instruction. The echo instruction allows you to perform a program that uses the LZ77 algorithm "247" compression. Basically, a string can be shortened by using a pointer to replace the common substring in the string (that is, the original string can be rebuilt simply by copying the paste). With the use of the echo command, this idea can be used for machine code. An echo instruction is a small token that invokes a partial execution of the program before it, which reduces the size of the code. Note, however, that this does not result in a jump or function call and is therefore more efficient. This also means that the pattern set is not fixed, but must be determined as part of the problem. Like Kastner and so on, brisk and so on are also finding these patterns through a sub-graph isomorphism, but another algorithm called VF2 is applied "57". Although this algorithm is O (nn!) in the worst case scenario, the author reports that for most of the input dags in the experiment, it runs efficiently. The pattern is formed by aggregating the adjacent nodes in the DAG. The pattern that minimizes the size of the code is selected, and the input dag is updated by replacing the modal instance of the loop with a new node that represents the use of the ECHO directive. This process is repeated until no new pattern is found that is superior to a user-defined value criterion.

4.4.other based onDAGthe Practice

Hanono and Devadas "120,121" propose a technique similar to the wess of the grille diagram (refer to section 3.8). Implemented in a relocatable code generator Aviv, the algorithm takes an input DAG and multiplies each operation node by the number of functional units that can run the operation. Data flows over special split nodes converge. If a transfer operation is required between two functional units, the algorithm injects a transport section that is recorded in this overhead. Then the instruction selection is reduced to the original input dag, looking for a path from the root node to each leaf. In this case, Hanono and Devadas apply a greedy heuristic to find these paths. However, this technique focuses on the optimization of VLIW schema instruction scheduling, thus simplifying the command selection by assuming that the DAG node corresponds to machine instruction one by one.

Sarkar and other "200" developed a greedy method for command selection, which is designed to minimize register pressure (register pressure) to facilitate scheduling and register allocation. Therefore, the overhead of each machine instruction is not the number of execution cycles, but rather the register pressure caused by it (the paper does not explain in detail how these costs are expressed in a formula). The instruction selection is done by entering the Dag into the forest (which is extended to a graph with additional data dependencies) and then applying the traditional tree overlay method to each tree. When determining where to perform these splits, a heuristic is applied. Once selected, the nodes in the input graph that are covered by the complex pattern are attributed to the Super node and check whether the map contains any loops. If so, this overlay is illegal because it contains a circular data dependency. Sarkar and so on in Jalape?o--IBM developed a register-based Java Virtual machine to implement and test their approach-the display has a 10% performance improvement over the small problem set than the default command selector.

Bednarski and Kessler "25" consider an integrated approach to code generation using ILP. Unlike their previous paper ("141,142"), which mainly considers instruction scheduling and register allocation, this method applies novel techniques to command selection and is worth recalling; in particular, they integrate pattern-matching issues into their ILP models. We will only briefly describe how the model was built (the suggested readers are interested in referring to the paper, where there is a detailed explanation). Roughly speaking, the ILP model assumes that for a given input Dag G, a sufficient number of pattern instances are generated (this uses a heuristic that calculates the upper limit). For each pattern instance p, the model contains a calculated variable (solution variables):

· A pattern node in P is mapped to an input node in G;

· Maps the edge of a pattern in p to an input edge in G;

· Determine if p is used (remember that we may have additional pattern instances, which means that they cannot be selected all).

Therefore, in addition to the typical linear inequalities that implement overrides, the model also includes an equation that ensures that the pattern instance is effectively matched. Using IBM CPLEX Optimizer "220" to parse the ILP model, Bednarski and Kessler implemented this approach in the Optimist framework. To evaluate, they compare their implementations with an integrated DP approach (also developed by them; see "141"), discovering that the ILP method significantly reduces the time of code generation and produces code of the same quality. However, for several test cases-the largest input DAG contains only 33 nodes the--ilp method cannot produce any code anyway. So without further improvement, this technique does not seem to apply to product quality compilers. The same practice was also adopted by Eriksson and other "79".

4.5.Summary

In this chapter we investigate several methods that rely on dags in that way. There are several benefits to working with dags rather than trees. The most important, common subexpression and multiple output machine directives can be directly molded, which makes the DAG overlay the technology most used in the command selection of modern compilers.

However, the overhead of converting a tree to a DAG makes the optimal instruction selection no longer complete in linear time, because this problem is NP-complete. At the same time, not all types of input and mode can be represented as dags. For example, the loop of the program causes the loop edge, which restricts the DAG overlay to the base block scope. This obviously excludes complex machine instructions that implement cyclic computations, but more importantly, it suppresses the opportunity to optimize performance by saving variables or temporary objects in different data forms and at different locations within the function. We'll see an example of this in the next chapter. Finally, although some methods handle schemas that contain unconnected sub-schemas, such as SIMD directives, they typically limit each sub-pattern to a very simple dag (usually a single node).


Directive Selector Survey (5)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.