LLVM 3.0 type System rewrite

Source: Internet
Author: User

Original address: http://blog.llvm.org/2011/11/llvm-30-type-system-rewrite.html

One of the most common IR (and therefore compiler API) changes in LLVM 3.0 is the complete re-implementation of the Llvmir type. This change has been delayed for a long time (initial type system self-LLVM1.0 continuation), which makes the compiler faster, greatly simplifies a key subsystem of vmcore, and eliminates some of the design points that are often confusing and inconvenient for IR. This article explains why these changes are made and how the new system works.

The target of the type system

The Llvmir type subsystem is a fairly straightforward part of IR. The type system consists of 3 main parts: Basic types (like double precision and integer type), derived types (like struct,array and vectors), and a mechanism for handling type forward declarations (opaque).

The type system has several important requirements to constrain its design: we want to be able to use efficient pointer equality checks to determine the equality of structured types, we want to allow late improvements to types (such as when linking, a module should be able to complete the type in another module), we want to easily express many different source languages, And we want a simple and predictable system.

The only really difficult part of an IR type system is the forward declaration of the processing type. To understand this, note that the type System designer is represented by a complex graph of type (usually annular). For example, a simple integer single-linked list might be declared like this:

%intlist= type {%intlist*, i32}

In this case, the type diagram includes a structtype that points to a integertype and a pointertype. And this pointertype refers back to this structtype. In many real-world programs, this diagram is complex and highly cyclic, especially for C + + applications.

In this context, let's start by discussing how LLVM 2.9 and earlier versions work.

The old type system

The LLVM 2.9 type system consists entirely of simple and straightforward fragments, using an instance of Opaquetype to represent the forward type. When this type is parsed later (for example, at link time), a process called "type improvement" updates all pointers to the old Opaquetype to point to the new definition, modifies the type diagram online, and then deletes the original Opaquetype. For example, if a module contains:

%T1 = Type opaque

%t2 = Type%t1*

Then T2 is a pointertype that points to a opaquetype. If we parse%t1 to {} (an empty struct), then modify%t2 to point to a pointtype of the empty structtype. For more information on this, please refer to the Type Parsing section of the LLVM2.9 Programmer's Manual.

The result of the modification and the type of movement

Unfortunately, despite the conceptual simplicity, there are several problems with this type of system. To ensure that the pointer equality check works for a structured type equality check, Vmcore is required to re-uniqueness the type once it is modified during type resolution. This may not seem like a big deal, but a single type of improvement can cause hundreds of other types to be changed, and improvements to hundreds or thousands of types are fairly common (for example, when linking an application to LTO). This performance is unacceptable, especially since the unique ring graph requires a complete graph homomorphism check, and our previous implementation algorithm is inefficient.

Another problem is that not only the type needs to be updated: anything that contains a pointer to the type to be updated, or it will get a dangling pointer. This problem is confirmed in various ways: for example, each value has a pointer to a type. To make the system more efficient, value::gettype () actually performs a deferred "Federated lookup" (Unionfind) step to ensure that it always returns a canonical, unique type. This makes Value::gettype () (a very common call) more expensive than the actual cost.

One of the worse things about this "type update" problem is when you're working through the LLVM API and building IR. Because the type can be moved, it is easy to get dangling pointers, resulting in many confusion as well as llvmapi a large number of corrupted clients. The fact that a type improvement is required to construct a simple recursive type complicates the problem. We try to simplify them through class patypeholder and Patypehandle, but they only work when you use them correctly, and they are often poorly understood.

Type-Unique amazing behavior

Many of LLVM's clients, once functioning, will soon encounter a surprising side of type uniqueness: Type names are not part of the type system, they are a "secret" data structure. The name is not considered during type uniqueness, and you can get multiple names for a type (this leads to a lot of confusion). For example, consider:

%T1 = Type opaque

@G1 = External Global%t1*

%t2 = type {}

@G2 = External Global%t2*

If%T1 is later parsed to {}, then%t1 and%t2 will both be the same empty struct type name, and the type formerly known as "%t1*" will be unified with the type formerly known as "%t2*", and IR will now output:

%t1 = type {}

@G1 = External Global%t1*

%t2 = type {}

@G2 = External Global%t1*

Note that G2 now has the type "%t1*"! This is because the name in the type system is just a secret hash table, so asmprinter will select one of the many names in a type when it is output. This is "right", but people who do not know the type system are very confused, not beneficial behavior. It also makes it difficult to read a C + + compiler's. ll, because many structurally identical types have different names that are common.

Type up Reference

The last one (I don't want to talk too much about it) is that before we get to a situation where the type can have no name at all. Although this is not a problem from the type system diagram angle, it is not possible to print the type if they are ring-shaped and nameless. The solution to this problem is a system called type-up referencing.

Type-up referencing is an elegant solution that allows asmprinter (and parser) to be able to represent arbitrary recursive types in a limited space without a name. For example, the%intlist example above can be represented as "{\2*,i32}". It also allows for the construction of some good (but surprising) type, like "\1*" is a pointer to itself.

Despite certain beauty and elegance, type-up references have never been well understood by most people, leading to a lot of confusion. It is important to be able to strip names from a Llvmir module (i.e.,-strip), but it is equally important for compiler developers to understand the system.

Prepare for the new type system

With all these questions, I realized that LLVM needed an updated, simpler type system. However, the new LLVM can read the old. BC and. ll files are just as important. To enable this, a large rewrite is carefully phased: LLVM2.9 's asmprinter is enhanced to publish an opaque (opaque) type as an ordinal type instead of using an upward reference. Therefore, instead of publishing the%intlist example with an upward reference, LLVM2.9 publishes it as:

%0 = type {%0*, i32}

LLVM 3.0 's plan is to reduce the compatibility of LLVM 2.8 (and earlier) files, so this makes the "Upgrade" logic required by LLVM3.0 much simpler.

LLVM 3.0 Type System

For most users, the new type system of LLVM3.0 is much like the 2.9 type system. For example, a. bc file generated by LLVM2.9 and a. ll file can be read by the bit-code reader and the. ll parser and automatically updated to 3.0 (although LLVM3.1 will reduce compatibility with 2.9). This is because the type system maintains almost everything: the basic is the same as the derived type, only Opaquetype is removed, and structtype is enhanced.

In short, not an improved type system that alters the type in memory (requiring re-uniqueness to move/update pointers), LLVM3.0 uses a type system that is very similar to C, which is based on type completion (completion). Basically, instead of creating an opaque type and then replacing it, you now create a structtype with no body and later specify its body. To create the%intlist type, now you write like this:

Structtype *intlist =structtype::create (somellvmcontext, "intlist");

Type *elts[] = {pointertype::getunqual (intlist), int32type};

Intlist->setbody (ELTs);

This is concise, far better than 2.9 of the way. Although there are several non-obvious derivative designs for this design.

Only structured types can be recursive

In the previous type system, a opaquetype can be parsed into any type, allowing the odd thing like this "%t1= type%t1*", which is a pointer to itself. In a new type system, only the IR structured type can be missing the principal, so it is not possible to create a recursive type that does not involve a struct.

A structure that is literally identified with the body

In the new type system, there are actually two different structured types: a "literal" structure (such as "{I32,I32}") and a "recognized" structure (such as "%ty= type {i32, i32}").

The identified structure is the kind we want to discuss: they can have names, and they can specify their bodies after the type is created. The identified structures are not unique, which is why they are made up of structtype::create (...) Produce. Since the recognized type is potentially recursive, asmprinter always outputs them using the name (or a number like%42 if the identified struct has no name).

The literal structure type acts like the old IR structured type: They never have a name and are unique to the structure: this means that they must have a principal element at the time of construction, and they cannot be recursive. When output by Asmprinter, they are always output without a name inline. The literal structured type is made up of structtype::get (...). Methods are created, reflecting that they are unique (this call may or may not actually be assigned a new Structtype).

The structured type that we expect to be recognized will be the most common, and the front end will only produce a literal structured type in special cases. For example, it is reasonable to use literal structured types for simple cases where tuples, complex numbers, and other names can be arbitrary and make IR more difficult to read. The optimizer does not lie in which way, so if you are the front-end author, use what you like to see in your IR roll-print.

The identified struct with the name has a 1-1 mapping

The previous type names were saved as a "secret" hash table, and now they are part of an intrinsic function of the type, and only the types that are named are the structures that are recognized. This means that LLVM3.0 will not appear in the previous two seemingly different structures with the same name output, confusing behavior. When a type name is stripped from a module, the identified struct is simply anonymous: they are still "recognized", but they do not have a name. As with other anonymous entities in Llvmir, they are asmprinter output in numerical form.

The structure name is unique at the llvmcontext level

Because Structtype::create always returns a new recognized type, when you create two types with the same name, we need to do something, the solution is to vmcore detect the conflict and automatically rename the following request by adding a suffix to the type: when you request a "foo" type, You can actually get the type name "foo.42". This is consistent with other IR objects like directives and functions, and at the llvmcontext level, the name is unique.

Linker "link" type and re-typed IR object

An interesting aspect of this design is that it makes the task of the IR linker a little more complicated. Consider what happens when you link up these two IR modules:

X.LL:

%A = type {I32}

@G = External Global%A

Y.LL:

%A = type {I32}

@G = Global%A Zeroinitializer

The first thing the linker does is load the two modules into the same llvmcontext. Because two types named "a" must be of different types, and because there can only be one type named%a, we actually get these two modules in memory:

X.llmodule:

%A = type {I32}

@G = External Global%A

Y.llmodule:

%a.1 = type {I32}

@G = Global%a.1 Zeroinitializer

It is now clear that @g objects have different types. When linking these two global variables, it is now up to the linker to re-engineer the type of the IR object into a consistent set of types and rewrite it to a consistent state. This requires the linker to compute the same type of collection and to solve the vmcore problem (if interested, refer to the Remaptype logic in lib/linker/linkmodules.cpp) that was commonly resolved in the past.

Putting this logic into the IR linker, rather than vmcore, is better than the previous design at many levels: now the cost of merging and uniqueness is only assumed by the IR linker, not by all the bit code reading codes and other IR creation code. The code is easier to understand and optimized for algorithms because we only merge two full graphs at a time-not one type at a time. Finally, by removing some complex logic, the size of the vmcore is reduced.

Identify the Magic IR type in the optimizer (or later)

As in LLVM 2.9, the type name is not really designed to be used as semantic information in IR: If you use-strip to remove all foreign names from IR, we expect everything to continue working. However, for research and other purposes, it is sometimes a convenient practice to disseminate information from the front to the Llvmir by type name.

In LLVM 3.0 This will work reliably (as long as you do not run strip or other operations with the same effect) because the recognized type is not unique. Note, however, that you can add suffixes and take this into account when writing your code.

The ability to identify a particular type of a more robust one in the optimizer (some other point after the front end is running) is to use a named metadata node to find the type. For example, if you want to find type%foo, you can generate an IR like this:

%foo = type {...}

...

!magic.types =! {%foo Zeroinitializer}

So to find the "foo" type, you'd better find the "magic.types" named metadata to get the type of the first element. Even if the type name is stripped, or the type is automatically renamed, the type of the first element will always be correct and stable.

There seems to be some confusion around named metadata: Unlike instruction-level metadata, they are not discarded or invalidated by the optimizer (as long as they do not point to functions or other IR objects modified by the optimizer). In general, passing information from the front-end to an optimizer or back-end, named metadata is much stronger than fiddling with the type name.

Conclusion

In general, the new type system solves several problems that have existed in LLVM IR for a long time. Some of the code you are upgrading llvm2.x to 3.x, chances are you will encounter some problems. Hopefully this will help answer some common questions about why we make this change and how it works!

LLVM 3.0 type System rewrite

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.