Lex and Yacc are good lexical analyzers and parsers under UNIX, and under Linux, these two tools are both flex and bison, and are often used by C + + as a tool for building character analysis programs.
This article is not an introductory article, so let's say you already know the basic syntax for Lex and YACC
For an introductory article, refer to IBM's: "Yacc and Lex QuickStart"
Here we discuss some of its interesting uses and considerations
Recognition of strings
Conventional regular and matching problems are hard to come by, so here's a question, how do you recognize strings in C?
We know that the string is usually like this
"some \"string\" problem.\n"
But we will find that it contains a transfer character and quotation marks, how to simply write a regular formula as follows:
\"[^"]*\"
Will cause the quotation mark expression ability is not complete, cannot satisfy the C language request.
So we consider the inside of the expression part apart, first of all, although there is no quotation marks, but can let it have \"
, so we changed the regular formula as follows:
\"(\\"|[^"])*\"
OK, then we can use this \"
escape quotes, but how do you think so, that is a bit impatient, because there is a very important situation, that is, the latter half can actually contain \
, but in fact, we \
are actually escape character, to be paired with the use of, alone is not correct, so we should add a limit to it and not let it \
happen at random, then our regular becomes this:
\"(\\.|[^"\\])*\"
Well, this is the regular formula for our C-language string recognition.
Recognition of annotations
Well, solve the difficult problem of string recognition, then, found another situation, C language has two kinds of comments, how to correctly identify them?
// hello world/** * hello world */
First one is easier to implement, similar to the above method, as long as there is no line break in the comment:
//[^\n]*
But the following is a more complex, and of course, a simple way to implement
"/*"([^\*]|(\*)*[^\*/])*(\*)*
This regular is very complicated, let's break it down and explain
"/*" ( [^\*] | (\*)* [^\*/] )* (\*)* "*/"
( [^\*] | (\*)* [^\*/] )*
This paragraph is looking for the non- *
content, or the *
later is *
not /
the part, this is allowed, someone asked, why *
can't you follow *
?
This is because once can be followed *
, the next match will not limit the /
beginning of the match, in order to avoid this situation, to make restrictions, but also because there may be a continuous end of the *
situation, so in the back to add a continuous*
Here, in fact, the use of other regular engine, there are simple solutions, specifically, you can refer to this English blog: "Finding Comments in Source Code Using Regular Expressions"
In addition, in the practical use of Lex, there is also a convenient way, that is, the use of fixed C code, processing comment Discard, the method is as follows:
"/*" comment();%%comment(){ char c, c1;loop: while‘*‘0) putchar(c); if‘/‘0) { unput(c1); goto loop; } if0) putchar(c1);}
Interesting questions in Lex