C++11 uses regex to easily implement the lexical analyzer Mini-lexer

Last Update:2017-10-15 Source: Internet

Author: User

Tags lexer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently read the <c++ Primer>, found that the C++11 standard library has a regular expression (I have been behind the compiler for many years, and now have c++17), just I want to masturbate a compiler, simply first a lexical analyzer, similar to flex. Project code is small, altogether less than 400 rows, but if you do not use the regular expression library, write Nfa,dfa to meet the most basic regular grammar is also very easy, but to meet the POSIX standard syntax, it is very annoying, such as the curly brace {} processing, \d{3} represents 3 digits and so on. So why not use flex directly? One is because Flex does not support C + + well, and the second is to practice practiced hand in C + +. The following is a brief introduction to the project implementation:

Github:https://github.com/yuandong-chen/mini-lexer

The project is a total of three CPP files and three corresponding HPP header files. As Flex knows, we can define our own regular expression macros such as: A [A-z], and then apply this macro A to build more complex regular expressions such as: {a}+ (OK) \$ to build our token structure. You can then add an action to the token, such as {a}+ (OK) \$ {printf ("ok!\n"), return 1, and then split the string within the text according to their priority (order of precedence) for a series of tokens. The external interface is Yylex (), Yytext,yyleng,yyline,yyin,yyout and so on. So we put this series of implementation into three steps, the first file to implement the regular expression macro access, we use map<string, string> to store, the header file is as follows:

1 #pragmaOnce2#include <map>3 4 namespaceMinilex {5     6     classMacrohandler7     {8     Private:9STD::MAP&LT;STD::string, std::string>Macrotab;Ten  One      Public: AMacrohandler () =default; -~macrohandler () =default; -STD::stringExpandmacro (ConstSTD::string&macroname); the         voidAddmacro (std::p air<std::string, std::string>macro); -     }; -}

Where Addmacro is a macro definition such as {string ("A"), String ("[A-z]")} Such a pair structure, EXPANDMACRO macro definition, but the outermost parentheses, if not found in the map to return an empty string. Concrete implementation is not affixed, is completely above the description, the specific reference project code can be.

The second file deals with expressions such as {a}+ (OK) \$, which have been given a string of s, and we split the string according to the existing regular expression rules and the corresponding order, with the header file as follows:

1 #pragmaOnce2#include <functional>3#include <list>4#include <regex>5#include"macrohandler.hpp"6 7 namespaceMinilex {8     9     classRegularexpTen     { One     Private: A         BOOLSuccess =false; -STD::stringCurrentmatch =""; -Std::unique_ptr<macrohandler>MACROHP; theSTD::LIST&LT;STD::p air<std::unique_ptr<std::regex>, std::string> >Regularexps; -      Public: -STD::stringExpandregularexp (ConstSTD::string&rexp); -STD::stringExtractmacroname (ConstSTD::string& Rexp,int&index,intmax); +STD::stringExpandmacro (ConstSTD::string&macroname); -      Public: +Regularexp (std::unique_ptr<macrohandler>&&Macrorp); A~regularexp () =default; at         voidADDREGULAREXP (std::stringrexp); -         voidREMOVEREGULAREXP (std::stringrexp); -STD::stringEat (std::string&txt); -         BOOLIseaten () {returnsuccess;}; -STD::stringMatchpattern () {returnCurrentmatch;}; -     }; in  -}

Notice that there is also a addregularexp, which we can call Addregularexp ("{a}+ (OK) \$"); We can use the EXPANDREGULAREXP function to fully expand the regular expression, in this function will find and replace the macro definition, if a is defined as {B}, then the function will be recursively parse the macro definition, if a a{a} such a recursive macro, the function will be infinite recursive call and throw stack overflow error. The REMOVEREGULAREXP function is used to dynamically remove rules, such as when we add a "[A-z]+ (OK) \$" expression, but when the text is split to a certain extent, for some reason (such as undef, etc.) we do not need this rule, we can remove the expression. The Eat function is used to eat a string, the eaten string is returned, and the leftover is placed in the argument. The Matchpattern function tells us which regular expression (or rule) is matched. Implementation because of the use of c++11 regex, very simple, you can view the project specific code.

The third file is used to implement Yylex () and other external interfaces, where the implementation of this function is posted Yylex:

1 intMinilex::yylex () {2STD::stringeaten;3STD::stringpattern;4         if(Yyin.eof () &&linebuffer.empty ())5         {6             return 0;7         }8 9         if(Linebuffer.empty ())Ten         { One std::getline (Yyin, linebuffer); Ayyline++; -         } -  theeaten = reup->eat (linebuffer); -Yyleng =eaten.size (); -         if(!reup->Iseaten ()) -         { +std::cerr<<"STOP, cannot interpret STRING:"<<linebuffer<<Std::endl; -             return 0; +         } A  atPattern = reup->Matchpattern (); -  -         /*You is required to modify the following code for your own purposes*/ -         if(Pattern = = std::string("{digit}+")) { -std::cerr<<"recognize:"<<yyleng<<' '<<eaten<<Std::endl; -             return 1; in         } -         Else if(Pattern = = std::string("{alpha}+")) { tostd::cerr<<"recognize:"<<yyleng<<' '<<eaten<<Std::endl; +             return 2; -         } the         Else if(Pattern = = std::string("{Equal}")) { *std::cerr<<"recognize:"<<yyleng<<' '<<eaten<<Std::endl; $             return 3;Panax Notoginseng         } -         Else if(Pattern = = std::string("{CAL}")) { thestd::cerr<<"recognize:"<<yyleng<<' '<<eaten<<Std::endl; +             return 4; A         } the         Else if(Pattern = = std::string(".")) { +std::cerr<<"unrecognize:"<<yyleng<<' '<<eaten<<Std::endl; -             return 5; $         } $  -         return 0; -}

A small problem here is that you can't properly match a regular expression such as [A-z] "\ n" [0-9], because we read it one line at a time, and then we read the next line and match the next token, but I don't think anyone would define such a strange cross-line token. In addition, I did not go further to implement the read configuration document generation code, but let the programmer directly to modify our source code, I think it is more free, you can even change to our source files, rather than reckoned a bunch of weird configuration syntax.

C++11 uses regex to easily implement the lexical analyzer Mini-lexer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More