Using Java to implement modular parser __java

Source: Internet
Author: User
Tags lowercase

http://www.ibm.com/developerworks/cn/java/j-lo-compose/index.html
implementing modular parser with Java

Sun Yu, Deng

Introduction: Ward Cunningham once said that clean code clearly expresses what the code writer wants to express, while graceful code goes further, and beautiful code looks like it's a problem to be solved. In this article, we will show the design and implementation of a modular parser, the final code is graceful and extensible, as if it were to parse a particular syntax. We will also select an example in the H.248 protocol to implement its parser using the combined parser described above. In this process, readers can not only appreciate the beauty of the code, but also learn some knowledge of functional programming and the construction of DSL. -->

Release Date: June 24, 2010
Level: Advanced
Recommendation: 2 (view or add a comment ) average score (total 12 ratings)

The basics of DSL design

When we implement a feature in a programming language (e.g., the Java language), we are actually directing the computer to accomplish this function. But what language can bring to us is not limited to that. More importantly, language provides a framework for organizing our thinking and expressing computational processes. The core of this framework is how to combine simple concepts into more complex concepts while preserving their closure of combinatorial methods, that is to say, complex concepts combined are identical for combinatorial approaches and simple concepts. References to "Structure and interpretation of Computer Programs" (see Reference resources) in the words of the book, any powerful language is achieved through the following three mechanisms: atoms : The simplest of languages , the most basic entity; a method of combining atoms together to form more complex entities; abstract means : A method of naming complex entities, which can be combined together into more complex entities, as with atoms.

A general-purpose programming language such as Java is concerned with the general approach to solving problems. Therefore, the three mechanisms provided by this system are also common-level. There is a semantic gap between this generic approach and the concepts and rules in specific problem areas when solving specific problems, so clearly defined concepts and rules in some problem areas may become less clear when implemented. As a programmer, using clean code to implement functionality is only a primary requirement, and more importantly, to promote the level of common language, build a language for a specific problem domain (DSL), the key point in this process is to find and define the problem-oriented domain of the atomic concept, A combination of methods and abstract means . This DSL does not have to be as complete as the common language, its goal is to clearly and intuitively express the concepts and rules in the domain of the problem, and the result is to turn a common programming language into a specialized language for solving specific problems.

We have built a DSL for interface layout in the context of the design and implementation of a Java-based interface layout DSL (see reference resources), which presents these ideas. In this article, we'll take the parser's construction as an example of how to build a DSL for string parsing, a DSL with powerful scalability that allows readers to define their own combinations based on their needs. In addition, the reader can also enjoy the elegance of functional programming from this article.

Back to the top of the page

Parser Atom

What is a parser. The simplest parser is what. It is generally believed that the parser is to determine whether the input string satisfies the given grammatical rules and, if necessary, to extract the corresponding syntax unit instances. Conceptually, there is no problem with this understanding. But to define a DSL for resolution, it requires a more precise definition, which means we define the exact type of the parser's concept. In Java, we use interface to define the parser type, as follows:





Where target is the string to parse, result is the resolution, and as long as the object that satisfies this interface semantics, we call it a parser instance. The definition of result is as follows:

class result 
{
Private String recognized;
Private String remaining;
Private Boolean succeeded;

Private Result (string recognized, string remaining,
Boolean succeeded) {
This.recog nized = recognized;
This.remaining = remaining;
this.succeeded = succeeded;
}

Public boolean is_succeeded () {
return succeeded;
}

Public String get_recognized () {
return recognized;
}

Public String get_remaining () {
return remaining;
}

public static result succeed (string recognized,
string remaining) {
Retu RN new result (recognized, remaining, true);
}

public static result fail () {
return new result ("", "", false);
}
}

Where the recognized field represents the part known to the parser, remaining represents the remainder after parsing the parser, succeeded indicates whether the parse succeeded, and result is a value object. With the exact definition of the parser, we can then define the simplest parser. Obviously, the simplest parser is a parser that doesn't parse anything, and returns the target string as it is, which we call zero, defined as follows:






}

The Zero parser must parse successfully, without any grammatical unit recognition and directly return the target string. Let's define another very simple parser item, as long as the target string is not empty, the item takes the first character of the target string as its recognition result and returns a success, and if the target string is empty, it returns a failure, and the item is defined as follows:











Zero and Item are the only two atoms in our parser DSL, and in the next section we define the composition of the parser.

Back to the top of the page

Parser Assembly Child

We defined the Item parser in the previous section, which unconditionally resolves the first character in the target string, and if we want to be able to become conditional parsing, we can define a SAT combination that receives a conditional predicate (predicate) and a parser, and generates a composite parser, whether the composite parser can parse successfully depends on whether the original parser's resolution satisfies a given conditional predicate, and the conditional predicate and SAT are defined as follows:









Private Parser














If we want to define a parser that resolves a single number, then we can define a isdigit conditional predicate and combine the isdigit and Item through the SAT code as follows:







}

The parser digit for resolving unit numbers is defined as follows:


We can use the same method to combine a single letter, a single capital letter, a single lowercase letter and other parsers.

Next, we define an OR combination , which receives two parsers, and uses the two parsers to parse a target string, which, if there is a successful resolution, considers the parse successful and if two fails, the code is defined as follows:















}

We can define a new parser digit_or_alpha, and if the target character is a number or a letter, the parser will parse successfully or it will fail. The code is as follows:

Conditional predicate that determines whether it is a letter:








Parser for parsing a single letter:


Digit_or_alpha Parser Definition:


Let's define a sequential combination of SEQ , the combination receives two parsers, first applying the first parser to the target string, and, if successful, applying the second parser to the remaining strings after the first parser is identified, and if both parsers are resolved successfully, then the SEQ The composite parser is combined to parse successfully, as long as there is a failure, the composite parser fails to parse. When the parse succeeds, its recognition result is connected by the recognition result of the two parsers.

In order to be able to connect the analytic results that have been identified in the two result, we add a static method to the resulting class: Concat, which is defined as follows:






The sequential combination of sub-SEQ is defined as follows:




















}

Now, if we want to define a parser that identifies the first letter and then a number, it can be defined as:


Next we define the last combination in this article: oneormany . The combination receives a parser and a positive integer value, the composite parser generated by the original parser will continuously parse the target string, the input for each parse is the remaining string after the last parse, and the maximum number of resolutions is determined by the input positive integer value. If the first parsing fails, the composite parser fails to parse, otherwise, it is resolved to the maximum number of times, or the resolution fails, and all successful resolved recognition results are concatenated as the result of the composite parser's recognition, Oneormany The composition is defined as follows:























Using this combination, we can easily define a parser that is used to identify a string of at least one, up to 10 letters, as follows:



The composition of this article is defined here, but the reader can easily define the other combinations according to their own needs, using the same method.

Back to the top of the page

The means of abstraction

If in the construction of a DSL, only some atoms and combinations are provided, and the result of the combination cannot be combined again, the scalability and applicability of the DSL will be greatly discounted. Conversely, if we can provide an abstract means to name the combined results, and the named composite entity can participate in a combination of atoms, the scalability of the DSL will be very strong and the applicability will be greatly increased. Thus, abstract means are essential in the construction of a DSL.

A keen reader may have found that, for our analytic DSL, abstract means have been used in the previous section. For example, we have used abstract means to name the Alpha,digit,digit_or_alpha and the definitions of composite parsers such as alpha_before_digit, and can then participate in the combination directly using this abstract name. Since our parser is defined based on the interface mechanism in the Java language, the existing abstract support mechanism for interface in the Java language is fully applicable to our parse DSL. Therefore, we do not need to define our own specific abstract means, directly using the Java language can be.

Believe that the reader has seen from the examples in the previous section the powerful power of combination and abstraction. In the next section, we will give a more specific example: the construction of the NAME parser in the h.248 protocol.

Back to the top of the page

A h.248 instance

In this section, we implement the construction of the parser used to identify the NAME syntax in the H.248 protocol, based on the parser atoms and combinations defined earlier.

H.248 is a communication protocol that the media Gateway Controller uses to control the media gateway. The h.248 protocol is a text-based protocol based on the description of ABNF (extended BNF) grammar, which defines the components and contents of the h.248 message. The specifics of the h.248 protocol are not discussed in this article, and interested readers can get more from the reference resources. We only focus on the definition of the NAME syntax, as follows:

NAME = ALPHA *63 (Alpha/digit/"_")
ALPHA =%x41-5a/%x61-7a ; A-Z, A-Z
DIGIT =%x30-39 digits 0 through 9

Let's first explain some of these rules, *63 is actually an instance of the n*m modification rule, which means that at least N is the most m, and when n equals 0 o'clock, it can be written briefly as *m. As a result, *63 represents a minimum of 0, up to 63. /representation or rule that represents an optional entity on both sides. () indicates that one of the entities must have one. -Represents a range. Therefore, DIGIT represents a single number, ALPHA represents a single letter (uppercase or lowercase), (alpha/digit/"_") means either a letter, a number, or an underscore. *63 (alpha/digit/"_") indicates that at least 0, up to 63 letters or numbers or underscores. The two entity order is written together to represent a sequential relationship, and ALPHA *63 (alpha/digit/"_") indicates that the letter begins with a minimum of 0, with a maximum of 63 letters or digits or underscores. More rules can refer to reference resources.

The parser used to parse this grammatical rule can be easily expressed directly from the previous content. As follows:





























As you can see, our code and the syntax descriptions in the protocol are basically exactly the same, by defining our own resolution-oriented DSL, we turn Java, the lingua-generic language, into a specialized language for ABNF parsing, in line with the definition of Ward Cunningham's code for beauty. Finally, we use the parser to do some experiments on NAME syntax recognition, as shown in the following table:

Input string Success Signs Identify the results remaining strings
"" False "" ""
"_u" False "" ""
"2U" False "" ""
U True U ""
"U{" True U "{"
"U2{" True "U2" "{"
"U_{" True "U_" "{"
"U123_{" True "U123_" "{"
"USER001" True "USER001" ""
"User001{" True "USER001" "{"
"a0123456789
0123456789
0123456789
0123456789
0123456789
0123456789
0123456789 "
True "a0123456789
0123456789
0123456789
0123456789
0123456789
0123456789
0123 "
"456789"

Resources

to learn the reference: "Structure and interpretation of Computer Programs".
Reference DeveloperWorks article: design and implementation of a Java-based interface layout DSL.
Reference: "Monadic parsing in Haskell".
Reference: "Megaco Protocol Version 1.0".
Reference: "Augmented BNF for Syntax SPECIFICATIONS:ABNF".
Technology Bookstore: Browse for books on these and other technical topics.
DeveloperWorks Java Technology Zone: Hundreds of articles on various aspects of Java programming.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.