Crazy geeks: design a new programming language with the fastest speed

Source: Internet
Author: User

I plan to write interesting columns recently and it is difficultArticle. This series is called Crazy geeks. Most of these articles are closely related to computers. This includes making compilers, operating systems, Android control circuit boards, and robots (controlled by devices such as Android and IOS.

Source code download

Warm up before you officially start the crazy geeks series. A simple but fun design with the shortest timeProgramming LanguageCshell (you don't have to worry about implementing the cshell parser. Basically, you don't need to know how to compile it, but in the future, there will be a lot of compilation principles involved ). It can be guessed from the cshell name that it is a C-style language and can be interpreted and executed like a shell (Dynamic Language ). Of course, this language cannot be as powerful as the C language or shell, because the C language compiler implementation is not complex (because it is a structured programming language, there is no class or interface, implementation is much easier than the Java compiler), but it is still unlikely to be completed within a short period of time (one to two days ). However, although the cshell implementation in this article is simple, it can still implement someAlgorithm. The cshell language supports output values, variables, conditional statements (IF), for loops, auto-addition, auto-subtraction, +,-, *,/operations, and functions (recursion is supported ). Since cshell is a dynamic language, variables do not need to be declared, but global and local variables are supported. Of course, arrays (arrays of integer and string types) are also supported ), therefore, cshell can easily implement algorithms such as Bubble sorting and factorial.

before discussing the design principles and implementation process of cshell, Let's first look at some Programs Written in cshell . The work done by these programs alone is too simple, but this time it is completely different. This time we use our new language to implement these algorithms. For example, recursive factorial calculation and Bubble sorting are cool !! Let's go!

// Simple variable output xx = 45; _ OK = 64; print (XX); a1 = 65; print (A1); // array demonstration $ arr = [1, 2, 3, 4, 5, "AA"]; // an array that matches a value with a string. $ indicates the global variable print ($ ARR ); // print all elements of the output array ($ arr [2]); // output the 3rd elements of the array // triple for loop $ x = 0; // global variables // I, j, and Z are all local variables. For (I = 0; I <10; I = I + 1) cannot be accessed outside the for loop) {for (j = 0; j <10; j = J + 1) {for (Z = 0; Z <10; Z = z + 1) {$ x = $ x + 1 ;}} print ($ X); // output 1000 // calculate the factorial of 10, recursive operations involving functions and if statement def JC (n) {If (n = 0) {return 1;} else if (N = 1) {return 1;} else {return JC (n-1) * n ;}} print ("10! "); Print (JC (10); // calculate the 10 factorial (3628800) // bubble sort (descending) $ arr =, -56,12]; $ Len = length ($ ARR); // double loop bubble sort for (I = 0; I <$ Len; I ++) {for (j = 0; j <$ len-1; j ++) {if ($ arr [J] <$ arr [J + 1]) {x = $ arr [J + 1]; $ arr [J + 1] = $ arr [J]; $ arr [J] = x ;}}print ($ ARR); // output [12, 7, 5, 5, 4, 3, 1,-56]

How to design and implement programming languages

There are many ways to design a programming language. Of course, the general practice is to learn the compilation principle, and then start from the lexical analyzer step by step, followed by the lexical analyzer, semantic analysis, intermediateCodeGeneration, intermediate code optimization, and target code generation. If the language needs to run with runtime, you also need to compile a virtual machine that can run the target code (a program that explains the target code, for example, JVM is the virtual machine that parses the Java bytecode file ). It seems a little dizzy. In addition, it is estimated that many programmers who come from the Division are confused about the compilation principles. Even if the compilation principle is well learned, it is difficult to compile a complicated compiler or parser based on the theory of the Compilation Principle (especially by adding object-oriented functions ). This is because a complex compiler has a lot of code that is almost impossible to be written manually. For example, if you use LL (*) for syntax analysis, computing a large number of first and follow collections is terrible. Even if the code is compiled, it would be a nightmare to modify the code to add or modify New syntaxes for the language. Therefore, most complex industrial-level programming languages are completed in a semi-automated manner.

The so-called semi-Automation means that it is impossible to generate the compiler in an automatic way, but only the core part of the compiler can be generated in an automatic way: lexical analyzer and syntax analyzer. The basic practice is to specify the lexical and syntax structure and necessary information through the DSL (domain-specific language), and then the compiler's Compiler (the program that generates the compiler) the Lexical and syntax parser will be automatically generated based on the DSL. Of course, you can add the semantic part of the Code through the DSL, so that the generated program has the semantic parsing function directly.

Many world-class enterprises, such as Google, Microsoft, Intel, and IBM, will have their own CC (compiler), but for individuals or small enterprises, it is more difficult to develop a set of CC programs (this is more difficult than developing a set of compilers ). Therefore, we can use open-source free CC. For example, javacc, Lex, YACC, and anlr. Among them, javacc only supports the Java language, Lex is the generator of the lexical analyzer, and YACC is the generator of the syntax analyzer. These two support the C language, while anlr supports multiple languages, such as Java, C #, Ruby, C/C ++, and JavaScript. Therefore, this article uses anlr to design and implement the cshell language.

CshellHow to build a language

Despite the fact that cshell relies on anlr for implementation, a lot of code needs to be compiled by itself. Therefore, this article only introduces the core code and implementation principles. For more detailed code, refer toSource code.

Readers who have learned the compilation principles will first think that the design language should first perform lexical analysis and then perform Syntactic Analysis Based on the lexical analysis results. Fortunately, both of them can be automatically completed using anlr.

The so-called lexical analysis is to divide the language text into the least hope word (called a token ). For example, the following is a cshell code.

 
For (I = 0; I <$ Len; I ++ ){}

If you want to perform lexical analysis on this code, it will be broken down into the following series of tokens:

"For", "(", "I", "=", "0", ";", "I", "<", "$ Len", "; "," I "," ++ "," {","}"

Of course, if you want to program and implement this analysis, you need to use finite automaton (DFA) for processing. Although the program is not complex, it is still troublesome. It is much easier to have anlr. Generally, you only need to define the token rules. Some tokens are put together with syntax rules, and some are separate lexical rules. For example, the above Code contains two variables (I and $ Len), where I local variables and $ Len are global variables, both of which belong to the identifier category, therefore, you can define a lexical rule that specifically identifies symbols.

ID: '$ '? (Letter | '_') (letter | '0'... '9 ')*;

The ID is the name of the lexical rule, and the first letter of the name must be in uppercase. Letter indicates 26 lower-case letters and 26 upper-case letters. "?" It indicates either or not. "*" is the star closure, indicating that the number of times is repeated 0 to n.

Letter: ('A'... 'Z' | 'A'... 'Z ')

From the lexical rules of ID, it can be seen that ID may start with "$" or may not start with "$ ". Whether there is "$" or not, the next character must be a letter or underline, followed by any letter, letter, or number. For example, ABC, _ xyz123, and $ _ 23 are regarded as IDS. Anlr automatically generates Java code based on this rule.

Similar methods are used for other token analysis. For example, the following rules can be used to identify strings.

String :'\"'.*'\"';

". *" Indicates any character sequence. That is, in cshell, a string is any Character Sequence in two double quotation marks.

After lexical processing, it is the corresponding syntax. the lexical analysis result is the token sequence, which is the input for formal syntax analysis. That is to say, syntax analysis is similar to lexical analysis, except that the input of lexical analysis is a single character sequence and the output is a token sequence. The input of syntax analysis is a token sequence, and there may be many types of outputs, or no output. The corresponding actions (semantic processing) are executed during the analysis ), it is also possible to generate an AST (Abstract syntax tree) and then further optimize it. In this example, the ast method is used. That is to say, the cshell source code is converted into an AST after syntactic analysis. The purpose is to remove some impurities, for example, in a for loop, only identifiers and operator numbers such as I, $ Len, and ++ are useful, but parentheses are useless, these auxiliary symbols are used to distinguish for statements from other statements.

Here we only look at the syntax rules of a slightly simple if statement.

Statement: 'if' ('expr') 'slist elseif_statement_all else_statement?

Here, slist is another production formula, indicating the part between If and else if.

Slist // original content: ': 'nl (statement) +'. 'nl

: NL * '{'nl * (statement) * Nl *'} 'nl *-> ^ (BLOCK statement *)

;

NL indicates empty rows. The ^ (BLOCK statement *) Part indicates the AST, where the block is the root node of the ast. From this point, we can see that the AST has filtered out the left and right braces in the slist, only statement with practical significance is left.

From the definition of statement and slist, it can be seen that the if statement must start with "if", and anlr returns the if statement as a token to the syntax analyzer. Followed by the left parenthesis of if, the contact is the expression (expr, another generative formula), then the execution body of the IF Statement (slist), followed by the elseif part, the remaining part is similar to the definition of the IF part. Please refer to the anlr/cshell in the source code. g file.

Then, after compiling the DSL required by anlr, what should we do next? Next we need to make the semantic part by ourselves. This part of content is very complicated. The basic idea is to return variables, keywords (for, if, etc.) through syntax analysis, and then decide how to do it by the semantic part. For example, for a variable, it is usually used to define a symbol table (using a map object). The variable name is the key of the map, and the variable is first stored in the map object. If a variable is encountered, it is first searched in the map object. If not found, the variable is defined (storing the variable and variable values in the map object). If it is found, directly remove the variable value. As for how to handle the for and if statements, we need to use syntax analysis to generate the ast.

The interpreter class is the core class for analysis. An exec method is provided for the class. You need to pass the ast root node into this method. That is to say, the process of executing cshell code is the process of traversing the ast, AST is a multi-Cross Tree, which requires the breadth-first traversal. The exec method code is as follows:

// Cshellast indicates the ast node type. A common Java class public object exec (cshellast) {try {Switch (AST. getType () {Case cshellparser. block: // process block Operation block (AST); break; Case cshellparser. assign: // assign (AST); break; Case cshellparser. length: // return length (AST); Case cshellparser. array: // arraystat (AST); break; Case cshellparser. return: Ret (AST); break; Case cshellparser. print: Print (AST); break; Case cshellparser. if: // process if statement ifstat (AST); break; Case cshellparser. for: forloop (AST); break; Case cshellparser. call: return call (AST); Case cshellparser. add: Return add (AST); Case cshellparser. PREV: Case cshellparser. suffix: Return incanddec (AST); Case cshellparser. sub: Return OP (AST); Case cshellparser. mul: Case cshellparser. div: Return OP (AST); Case cshellparser. EQ: Return eq (AST); Case cshellparser. LT: Return LT (AST); Case cshellparser. GT: Return gt (AST); Case cshellparser. INT: Return integer. parseint (AST. gettext (); Case cshellparser. CHAR: return new character (AST. gettext (). charat (1); Case cshellparser. float: Return float. parsefloat (AST. gettext (); Case cshellparser. string: String S = ast. gettext (); Return S. substring (1, S. length ()-1); Case cshellparser. ID: Case cshellparser. array_element: return load (AST); default: // catch unhandled node types throw new unsupportedoperationexception ("unable to process" + AST. gettext () + "<" + AST. getType () + ">") ;}} catch (exception e) {listener. error ("cause of exception:" + AST. tostringtree (), e);} return NULL ;}

The following only shows the implementation code of an ifstat method for processing if statements.

Private void ifstat (cshellast AST) {// the following code must be from the current ast node (indicating the if statement root node) the subnode of the get // the components of the IF statement // obtain the cshellast (cshellast) AST in the two parentheses of the IF statement. getchild (0); // obtain the if condition. If it is true, the code block cshellast ifblock = (cshellast) Ast to be executed. getchild (1); // obtain the elseif part (including the conditional expression and the block to be executed) cshellast elseifall = (cshellast) ast. getchild (2); // obtain the code block cshellast elseblock = (cshellast) Ast to be executed in the else part. getchild (3); // call the exec method execution expression again in recursive mode, and return the value Boolean c = (Boolean) exec (expr); // if it is true, execute if block if (C. booleanvalue () {exec (ifblock); // recursively execute if block} else {// determine the number of elseif parts, cshell supports an infinite number of else if statements if (elseifall. getchildcount ()> 0) {list <cshellast> Children = elseifall. getchildren (); // judge whether the expression after else if is true for (cshellast child: Children) {expr = (cshellast) child. getchild (0); ifblock = (cshellast) child. getchild (1); C = (Boolean) exec (expr); // if the condition of an else if is true, execute the code block after the else if statement, // finally return, if (C. booleanvalue () {exec (ifblock); Return ;}}// Finally, the else statement is executed (because the preceding conditions are both false) // determine whether an else statement exists (a maximum of one else clause is allowed) if (elseblock. getchildcount () = 1) {exec (cshellast) elseblock. getchild (0); // execute else block }}}

The cshell Code Analyzer entry class is cshell, which calls the interprefer. Process Method to the reader's cshell language source code. Bubble. CS is the source code file of the cshell language, which can be replaced with other source code files. After the process method is called, corresponding operations are performed according to the specific cshell code. For example, print (...) The statement outputs the corresponding string.

 
Public class cshell {public static void main (string [] ARGs) throws exception {inputstream input = NULL; input = new fileinputstream ("Source/bubble. CS "); interpreter interp = new interpreter (); interp. process (input );}}

It doesn't matter if the reader doesn't quite understand anlr. This article is just a reference, not to explain it. I only hope that the readers can understand anlr and the process of designing a language. In a series of subsequent articles, we will explore the compilation principles and the use of anlr in depth. By designing your own proprietary language, you can significantly improve productivity. For example, you can abstract common work into some statements, and it takes several hours to complete the script execution, work that can be completed in days.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.