Parser Composition Sub

Source: Internet
Author: User

This article quoted: http://www.ibm.com/developerworks/cn/java/j-lo-compose/

Ward Cunningham once said that clean code clearly expresses what the code writer wants to express, while the graceful code is a step further, and the graceful code looks like it exists specifically for the problem to be solved. In this article, we will show a modular parser design, implementation process, the final code is graceful, very extensible, like to parse a particular syntax exists. We will also select an example in the H.248 protocol to implement its parser using the combined parser described above. In this process, readers can not only appreciate the beauty of the code, but also learn about functional programming and the knowledge of building a DSL.

DSL Design basics

When we implement a function in a programming language (such as the Java language), we are actually directing the computer to accomplish this function. But what language can bring to us is not limited to that. More importantly, the language provides a framework for organizing our thinking and expressing the computational process. The core of this framework is how simple concepts can be combined into more complex concepts while preserving the closure of combinatorial methods, that is to say, the combination of complex concepts for the combination of means and simple concept indistinguishable. Referring to "Structure and interpretation of computer Programs" (see Reference resources) in the book, any strong language is achieved by the following three mechanisms:

    • Atom : The simplest and most basic entity in a language;
    • combinatorial means : The method of combining atoms to form more complex entities;
    • abstract means : The method of naming complex entities, the named complex entities can be combined with the same way as atoms to become more complex entities.

A common programming language like Java, because of its concern is the general way to solve the problem. As a result, all three of these mechanisms are also common-level. There is a semantic gap between this generic approach and the concepts and rules in specific problem areas when solving specific problems, so the concepts and rules that are very clear and clear in some problem areas may become less clear when implemented. As programmers, using clean code to achieve functionality is only a primary requirement, more importantly, to improve the level of common language, to build a specific problem domain language (DSL), the process is a key point is to find and define the problem domain-oriented atomic concepts, Combination of methods and abstract means . The DSL does not have to be as complete as the universal language, and its goal is to clearly and intuitively express the concepts and rules in the problem domain, and the result is to turn the universal programming language into a dedicated language for solving specific problems.

We have built a DSL for interface layout in the article "Designing and implementing a DSL with a Java-based interface layout" (see reference resources), which presents the idea. In this article, we will take the structure of the parser as an example of how to build a DSL for string parsing, a DSL that has the power to scale, and the reader can define its own combination of methods according to its own needs. In addition, readers can appreciate the elegance of functional programming from this article.

Resolver Atom

What is a parser? What is the simplest parser? It is generally assumed that the parser is to determine whether the input string satisfies the given grammar rules and, if necessary, extracts the corresponding syntax unit instances. Conceptually, there's nothing wrong with this understanding. But to define the DSL for parsing, you need a more precise definition, which means we have to define the exact type of the parser concept. In Java, we use interface to define the parser type, as follows:

Interface Parser {public     Result parse (String target);

Where target is the string to parse, result is the result of parsing, as long as the object that satisfies this interface semantics, we call it a parser instance. Result is defined as follows:

Class Result {     private String recognized;     Private String remaining;     Private Boolean succeeded;     Private Result (string recognized, string remaining,         Boolean succeeded) {         this.recognized = recognized;         this.remaining = remaining;         this.succeeded = succeeded;     }     public Boolean is_succeeded () {         return succeeded;     }     Public String get_recognized () {         return recognized;     }     Public String get_remaining () {         return remaining;     }     public static result Succeed (string recognized,         string remaining) {         return new result (recognized, remaining, true);     }     public static result fail () {         return new result ("", "", false);}     }

Where the recognized field represents the part that the parser knows, remaining represents the remainder of the parser after parsing, succeeded indicates whether the resolution was successful, and Result is a value object. With the exact definition of the parser, we can then define the simplest parser. Obviously, the simplest parser is a parser that doesn't parse anything, returns the target string as it is, and we call it zero, which is defined as follows:

Class Zero implements Parser {public     Result parse (String target) {         return Result.succeed ("", Target);}     }

The Zero parser must parse successfully without any syntactic unit recognition and return the target string directly. Let's define another very simple parser item, as long as the target string is not empty, item will take the first character of the target string as its recognition result and return the success, if the target string is empty, it will fail, and item is defined as follows:

Class Item implements Parser {public     Result parse (String target) {         if (target.length () > 0) {             return result . Succeed (Target.substring (0,1),                 target.substring (1));         }         return Result.fail ();     } }

Zero and Item are the only two atoms in our parser DSL, and in the next section, we'll define a combination of parser methods.

Parser Composition Sub

We defined the Item parser in the previous section, which unconditionally resolves the first character in the target string, and if we want to be able to become conditional parsing, we can define a SAT combination that receives a conditional predicate (predicate) and a parser, and generate a compound parser, whether the resolution of the composite parser succeeds depends on whether the parsing result of the original parser satisfies the given conditional predicate, the conditional predicate and the SAT are defined as follows:

Interface Predicate {public     Boolean satisfy (String value);} class SAT implements Parser {     private predicate pre;     private Parser    Parser;         Public SAT (predicate predicate, Parser Parser) {         this.pre = predicate;         This.parser = parser;     }     Public Result Parse (String target) {         result r = parser.parse (target);         if (r.is_succeeded () && pre.satisfy (r.get_recognized ())) {             return r;         }         return Result.fail ();     } }

If we want to define a parser that parses a single number, then we can define a isdigit conditional predicate and combine the isdigit and Item by the SAT with the following code:

Class IsDigit implements predicate {public     Boolean satisfy (String value) {         char c = value.charat (0);         Return c>= ' 0 ' && c<= ' 9 ';     } }

The parser digit for resolving unit numbers is defined as follows:

Parser digit = new SAT (new IsDigit (), New Item ());

We can use the same method to combine a single letter, a single capital letter, a single lowercase letter and other parsers.

Next, we define an OR group , which receives two parsers, and uses these two parsers to parse a target string, as long as there is a successful resolution, the resolution is considered successful, if two failed, then the failure, the code is defined as follows:

Class OR implements Parser {     private Parser p1;     Private Parser p2;     Public OR (Parser p1, Parser p2) {         this.p1 = p1;         THIS.P2 = p2;     }     Public Result Parse (String target) {         result r = p1.parse (target);         Return r.is_succeeded ()? R:p2.parse (target);     } }

We can define a new parser Digit_or_alpha, if the target character is a number or a letter then the parser succeeds, otherwise it fails. The code is as follows:

To determine whether a conditional predicate is a letter:

Class Isalpha implements predicate {public     Boolean satisfy (String value) {         char c = value.charat (0);         Return (c>= ' a ' && c<= ' z ') | | (c>= ' A ' && c<= ' Z ');     } }

Parser for parsing a single letter:

Parser alpha = new SAT (new Isalpha (), New Item ());

Digit_or_alpha Parser Definition:

Parser Digit_or_alpha = new or (digit, alpha);

Below we define a sequence of combined sub-SEQ, the combination of two parser, the first parser applied to the target string, if successful, the second parser is applied to the first parser after the identification of the remaining string, if the two parsers are resolved successfully, then the SEQ The combination of this composite parser successfully resolved, as long as there is a failure, the composite parser fails to parse. When the parsing succeeds, the recognition result is connected by the recognition result of the two parsers.

In order to be able to connect the parsed results identified in two result, we have added a static method to the result class: Concat, which is defined as follows:

public static result Concat (result R1, result R2) {     return new result (         r1.get_recognized (). Concat (r2.get_ Recognized ()),         r2.get_remaining (), true); }

The order combination sub-SEQ is defined as follows:

Class SEQ implements Parser {     private Parser p1;     Private Parser p2;     Public SEQ (Parser p1, Parser p2) {         this.p1 = p1;         THIS.P2 = p2;     }     Public Result Parse (String target) {         result r1 = P1.parse (target);         if (r1.is_succeeded ()) {             Result r2 = p2.parse (r1.get_remaining ());             if (r2.is_succeeded ()) {                 return result.concat (R1,R2);    }      }      return Result.fail ();  }  }

Now, if we want to define a parser that identifies the first letter, followed by a number, it can be defined as:

Parser alpha_before_digit = new SEQ (alpha, digit);

Next we define the last combination in this article:oneormany. The combined child receives a parser and a positive integer value, and the resulting composite parser will parse the target string continuously with the original parser, the input of each parse is the remaining string after the last parse, and the maximum number of times resolved is determined by the input positive integer value. If the first parse fails, then the composite parser fails to parse, otherwise, it will be resolved to the maximum number of times or encountered resolution failure, and all the successful analysis of the recognition results connected as a composite parser recognition results,Oneormany The composition is defined as follows:

Class Oneormany implements Parser {     private int max;     Private Parser Parser;     Public Oneormany (int max, Parser Parser) {         This.max = max;         This.parser = parser;     }     Public Result Parse (String target) {         result r = parser.parse (target);         Return r.is_succeeded ()? Parse2 (r,1): Result.fail ();  }     Private result Parse2 (result pre, int count) {         if (count >= max) return pre;         Result r = Parser.parse (Pre.get_remaining ());         Return r.is_succeeded ()?             Parse2 (Result.concat (pre,r), count+1): Pre;     }  }

Using this combination, we can easily define a parser for identifying a string consisting of at least one, up to 10 letters, as follows:

Parser One_to_ten_alpha = new Oneormany (10,alpha);

The composition of this article is defined here, but the reader can, according to their own needs, use the same method to easily define the other combinations that meet their requirements.

The means of abstraction

If the structure of the DSL provides only a few atoms and combinations, and the result of the combination cannot be combined again, the scalability and applicability of the DSL will be greatly discounted. Conversely, if we can also provide an abstract way to name the combined results, the named composite entities can participate in the same composition as atoms, then the ability to expand the DSL will be very strong and the applicability will be greatly increased. Therefore, abstract means are crucial in the construction of a DSL.

A keen reader may have discovered that, for our analytic DSL, we have already used abstract methods in the preceding subsections. For example, we have used abstract methods in the definition of complex parsers such as Alpha,digit,digit_or_alpha and alpha_before_digit to name them, and then we can use this abstract name again to participate in the composition. Since our parser is defined based on the interface mechanism in the Java language, the existing abstract support mechanism for interface in the Java language is fully applicable to our analytic DSL. Therefore, we do not need to define our own specific abstract means, directly in the Java language can be.

It is believed that readers have seen the powerful power of combination and abstraction from the example in the previous section. In the next section, we will give a more specific example: the construction of the NAME syntax parser in the H.248 protocol.

A h.248 instance

In this section, we will implement the construction of a parser to identify the NAME syntax in the H.248 protocol, based on the parser atoms and the composition defined earlier.

H.248 is a communication protocol that the media Gateway Controller uses to control the media gateway. The h.248 protocol is a text-based protocol based on the ABNF (extended BNF) grammar, which defines the components and specific contents of the h.248 message. The specifics of the h.248 protocol are not discussed in this article, and interested readers can get more from the reference resources. We only focus on the NAME syntax definition, as follows:

NAME = Alpha *63 (Alpha/digit/"_") Alpha =%x41-5a/%x61-7a   ; A-Z, a-zdigit =%x30-39             ; digits 0 through 9

Let us first explain some of these rules, *63 is actually an example of the n*m modification rule, which means that at least n is the maximum of M, and when n equals 0 o'clock, it can be written briefly as *m. As a result, *63 represents a minimum of 0, up to 63. /representation or rule that represents an optional entity on both sides. () indicates that one of the entities must have one. -Indicates the range. Therefore, DIGIT represents a single number, ALPHA denotes a single letter (uppercase or lowercase), (alpha/digit/"_") means either a letter, a number, or an underscore. *63 (alpha/digit/"_") indicates a minimum of 0, up to 63 letters or numbers, or underscores. Two entities are written together in order to denote a sequential relationship, with Alpha *63 (alpha/digit/"_"), starting with a letter, at least 0 at a later, and up to 63 letters or numbers or underscores. More rules can be found in reference resources.

Based on the previous content, it is easy to directly express the parser used to parse this syntax rule. As follows:

Class H248parsec {public     static Parser Alpha () {         return new SAT (new Isalpha (), New Item ());     }     public static Parser digit () {         return new SAT (new IsDigit (), New Item ());     }     public static Parser underline () {         return new SAT (new Isunderline (), New Item ());     }     public static Parser Digit_or_alpha_or_underline () {         return new or (alpha (), New or (digit (), underline ()));     } Public     static Parser zero_or_many (int max, Parser Parser) {         return new or (new Oneormany (Max,parser), New Zero () );     }     public static Parser name () {         return new SEQ (Alpha (),             zero_or_many (+,             digit_or_alpha_or_underline ()) );     } }

As can be seen, our code and protocols are basically exactly the same syntax, we define our own analytic-oriented DSL, the common language of Java into a language for ABNF parsing, in line with Ward Cunningham on the definition of beauty code. Finally, we use this parser to do some experiments on NAME syntax recognition, as shown in the following table:

Parser Composition Sub

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.