C # lexical Analyzer (vii) Summary

Source: Internet
Author: User
Tags definition

In the previous six articles, I have introduced in detail the algorithm related to the lexical analyzer. They are more concerned about the implementation of the details, the feeling may be more messy, this article on the overall description of how to define the lexical analyzer, and how to implement their own lexical analyzer.

The second section is a complete description of how to define the lexical analyzer, and can be used as a lexical analyzer guide. If you don't care about the actual implementation of the lexical analyzer, you can just look at the second section.

I. Change of Class Library

First, I need to explain some of the changes I made to the class library. Lexical analysis of the interface, and the original writing "C # lexical Analyzer" when compared to the series, has occurred a little change, it is necessary to do a description.

1. Identifier of the lexical unit

The original definition of the lexical unit (token) is a token structure that uses an int attribute as the identifier of the lexical unit, which is also a common practice for many lexical parsers.

But then when I do the grammar analysis, it feels so inconvenient. Because it is not yet supported to generate lexical analyzer code from a definition file, you can define the lexical analyzer only within the program. and int itself is not semantic, as a lexical unit identifier to use, not only inconvenient but also error prone.

It was later tried to use a string as an identifier, although it solves the semantic problem, but it is still prone to error, and implementation can be complex (you need to save a string dictionary).

A simple, semantic solution is to use enumerations. Enumeration names provide semantics, enumerated values can be converted to integers, and can also provide compile-time checks that completely avoid spelling errors, so the lexical unit is now defined as the Token<t> class, and many of the classes associated with it also take generic parameter T.

2. Namespaces

The previous namespaces were Cyjb.compiler and Cyjb.Compiler.Lexer, and now they are changed to Cyjb.compilers and Cyjb.Compilers.Lexers, after all, the namespace name is more appropriate for plural use.

3. Lexical Analyzer Context

Before the word parser context switch, you can use the context's index, label, or Lexercontext instance itself. But now only can switch through the label, so that the implementation is simpler, the use will not be too much impact.

4. Representation of the DFA

The expression of the DFA in the original Lexerrule class is somewhat simple and rough, and it is difficult to understand the representation of the DFA for those who do not understand the specific implementation. It would be easier to understand the interfaces in the Lexerrule<t> class now.

Second, the definition of lexical analyzer

This section is a guide to the use of the Cyjb.compilers class library method parser, and contains a complete documentation, examples, and related considerations. The source of the class library can be found from the Cyjb.compilers project, please see the wiki for the class library document.

1. Define the identifier of the lexical unit

As mentioned earlier, the enumeration type is currently used as the identifier for the lexical unit, and the fields in this enumeration type can be arbitrarily defined without any restrictions. However, for the convenience of the parsing section, the enumeration value must be an integer starting from 0, and the enumeration values are best contiguous because the discontinuous enumeration values cause the parsing part to waste more space.

Use a special value-to represent the end of a file (Endoffile), which can be from token<t>. Endoffile field is obtained, also can pass Token<t>. The Isendoffile property gets whether the lexical unit represents the end of the file.

The calculator is still used here as an example, and the following code defines an enumeration as an identifier:

When used, it is obviously more convenient than an integer.

2. Define the context of the lexical analyzer

All definitions of the lexical analyzer begin with the Cyjb.compilers.grammar<t> class, so you first need to instantiate an instance of a grammar<t> class:

The context of the lexical analyzer, which can be used to control whether the rule takes effect. There are two types of contexts: include or exclude.

If the current is an inclusive context, all rules for the current context are activated, and all rules that do not specify any context are activated.

If the current exclusion context is present, only all the rules for the current context are activated, and no other rule is activated.

Use the following methods to define the lexical parser context for both the exclusion and the containing type, which is the label of the context:

The default lexical Analyzer context is "Initial", through which you can switch to the default context. It is important to note that for implementation reasons, the context must precede all non-terminal definitions.

For example, the following code defines an inclusive context, Inc., and an excluded context exc.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.