Yesterday we began to design a new language, developed a basic development structure, today we will first to understand, two very useful tools, the compiler front-end build artifact--lex&yacc, these two tools under Linux called Flex and Bison.
Lex is the lexical analyzer to build tools, we install the corresponding word laws and regulations are written, then we can generate a corresponding lexical analyzer, automatically help us to divide the token, and word breaker, has been the basis of the compilation system task.
Today, let's try to write a BNF parser that parses our BNF into a data format that the code can recognize, in the form of BNF, presumably:
{{ do _init()}}# definition list <definelist ; = <definelist ; <define ; | E <define ; = <constdef ; | <vardef ; | <functiondef ; ; # Variables and Constants Definitions <constdef ; = "Const" const_def_run ; < vardef ; {{ isconstdef = false }} ; <vardef> = "int" <iddeflist> ";"; <iddeflist> <iddeflist> "," <iddef> | <iddef> ;
Here is our original input file, we want to parse it into an in-memory tree structure, for our compiler to use, the current task is to read into such text, parse into a tree-like structure.
As we can see here, this way of defining is to extend the BNF paradigm, where the script for semantic actions is added, and {{ }}
scripts can be used alone to do some initialization work.
Using lexical analysis programs to process words
Let's take a look at how the basic Lex scanner is written:
/* SCANNER.L*/%{#include <stdio.h>#include "parser.hpp"#define Save_token yylval.str = Yytextextern"C" intYywrap () {return 1; }%}/* Show the line number*/%optionYylineno%%"/*"([^\*]|( \*)*[^\*/])*(\*)*"*/";#[^\n]*\n;/* Ignore line comment * /[\t\v\n\f];/ * Ignore blank token * /\"(\\.| [^\\"])*\"Save_token; return STRING;{{"([^}]|\} [^}]) *"}}"Save_token;returnSCRIPT;"E" return ' E ';":" return ': ';"<" return ' < ';">" return ' > ';"[" return ' [';"]" return '] ';"=" return ' = ';"|" return ' | ';";" return '; '; [A-za-z_][a-za-z0-9_]* Save_token;returnID;%%
The scanner here is based on a number of regular matching principle, note that the entry here is ordered, the higher the priority of the element higher, if the first can match, then will not match the following, but there is a longer match, take longer.
Here's a very magical match:
"/*"([^\*]|(\*)*[^\*/])*(\*)*"*/"
Be able to match /* comment */
such annotations.
\"(\\.|[^\\"])*\"
Ability to match C-style strings.
"{{"([^}]|\}[^}])*"}}"
Be able to match {{ somefunction() }}
a script like this.
If you do not understand these regular style, I would like to look at the article I wrote, I believe it will be of some help to you:
"Lex recognizes C-style strings and annotations"
The lexical analysis program uses a lot of built-in variables and functions, and I'll explain what these words mean:
extern"C"intreturn1; }
yywrap
is a function to handle multiple files, which is called when a file is processed to the end. If you have multiple files and want to be processed by Lex in succession, you can open a file list and then plug the corresponding file into the input stream of Lex, and return to 0,lex as if it has not been processed, and yywrap
once it returns 1 o'clock, it means that all the tasks have been completed and can be ended.
#define SAVE_TOKEN yylval.str = yytext
This SAVE_TOKEN
macro, which uses a built-in variable in a YACC, yylval
is a union type, and in a moment you will find a definition in YACC's file definition %union
, which is its type definition. The specific declaration of this section can be found in the header file generated by the YACC parser.hpp
.
%option yylineno
This is a parameter setting that enables Lex's error mechanism to determine the specific line number of the token, although it will certainly consume a bit of resources, but debug is also very important, and when used, you yylineno
can know the line number of the currently recognized location as long as the variable is referenced externally:
externint yylineno;
Using YACC to recognize grammar
In order to correctly identify the entire BNF syntax, and structured parsing of them, we wrote a YACC program as follows:
/ * PARSER.Y * /%{#include <stdio.h>extern intYylex ();extern intYylineno;extern Char* YYTEXT;voidYyerror (Const Char*s);%}%Union{Char*str = NULL;} %token <str> ID STRING Script%startList%%/* General List of mixed BNF and scripts */List: Item |Listitem;/* Can be BNF or script */Item:bnf_item | SCRIPT;/ * The definition of a line BNF * /Bnf_item:symbol' = 'Bnf_list'; ';/ * section behind the BNF * /Bnf_list:symbol_list | Bnf_list' | 'Symbol_list;/ * A list of BNF items * /Symbol_list:symbol | Symbol_list symbol;/ * BNF symbol available * /Symbol:' < 'Name' > '|' ['Name'] '|' E '| STRING | SCRIPT;/ * Name, and you can define the instance name * /Name:id | Id': 'ID;voidYyerror (Const Char* s) {fprintf(stderr,'%s \ n ', s);fprintf(stderr,"line%d:", Yylineno);fprintf(stderr,"error%s \ n", Yytext);}
This procedure is to define the structure of the entire BNF grammar, and according to what rules they, here we do not add semantic action, we will add it in the next time as a usable parser.
Adding a main processing function
The source files that our Yacc and Lex write can be translated into C + + code, but only have a basic handler function, and if you want to process the file, you have to write the open part of the file and redirect the file to the Yyin input stream.
#include <stdio.h>#include "parser.hpp"#include "help_message.h"externFile* Yyin; File* file_in;intMainintargcConst Char*argv[]) {printf("Welcome to use the xscript!\n");if(ARGC <=1)printf(Help_message);Else{/* Open the file and change the yyin stream. * / Const Char*file_in_name = argv[1];if(File_in=fopen (File_in_name,"R")) ==null) {printf("error on open%s file!", file_in_name); GetChar ();return 1; } Yyin = file_in; Yyparse ();/ * You should close the file. * /Fclose (file_in); }return 0;}
Well, the main functions are all finished, I also want to give this project a name, first called Xscript bar, meaning the changeable script, hope to become a custom grammar translation language.
Then everyone should ask, since all finished, then how to build it? Here we use CMake to build the entire project, now CMake is also more convenient to support direct calls to the Linux version of Lex and YACC, we only need to add two CMake modules to achieve project construction:
cmake_minimum_required(VERSION2.8)Project(scanner)SET(Cmake_cxx_compiler_env_var"CXX")SET(Cmake_cxx_flags"-std=c++11")include_directories(includeBuild src)# Bison and FlexFind_package(BISON)Find_package(FLEX) flex_target (SCANNER src/scanner.l${cmake_current_binary_dir}/scanner.cpp) Bison_target (PARSER src/parser.y${cmake_current_binary_dir}/parser.cpp) add_flex_bison_dependency (SCANNER parser)# src files and make exeFile (Glob_recurse source_files"${cmake_current_source_dir}/src/*.cpp")add_executable(Scanner${source_files} ${bison_parser_outputs} ${flex_scanner_outputs})
The document organization of the project is now:
Lr_scanner | ---- Build | ---- src | ---- Main.CPP | ---- Help_message.h | ---- Parser.y | ---- Scanner.L | ---- cmakelists.txt
OK, under the build path:
cmake ..make
Can be compiled through, but the operation has no effect, this is just because the semantic action is not executed, and so we add a good semantic action, the effect is not the same, and the current parser, as long as you give the syntax is not correct, he will be in the corresponding location error, or very convenient.
Create a new language (2)--Build a simple analysis program with LEX&YACC