Flex is a lexical analyzer generator, which is one of the most common tools used by compilers and interpreter programmers. Flex's program consists mainly of a series of regular expressions with directives called action codes. When matching input,Flex translates all regular expressions into deterministic, poor automata, which makes it very efficient for lexical analyzers generated by the lexical analyzer generator, such as Flex, to match the input pattern . Of course, some people accuse flex of being inflexible, limited in functionality, and unable to solve many problems, such as the ambiguity of JavaScript, C + +, and the fact that the lexical analyzers of many programs (such as Python's interpreters) are generated by manual code rather than by flex. None of this is within the scope of this article, and I would like to take a very rough look at how fast flex matches regular expressions by using a word count program (Word counter) compiled for Flex and hand written C code.
Test methods: Run the program with Flex C code and manual C code, set up three different sizes of files, and use the shell's time instruction to test how long the two programs are running.
Below the source code, first flex source FLEXWC.L:
1 /*A Word count program implemented with Flex. 2 FLEXWC.L3 */4 5%{6 /*define global variables, count characters, number of words, and number of lines, respectively*/7 intchars=0;8 intwords=0;9 intlines=0;Ten%} One%% A[a-za-z]+ {++words; Chars+=strlen (Yytext);}/*regular Expressions Match any word and use String functions to count the current number of characters in the input stream*/ -\ n {++chars; ++lines;}/*New line starts when a newline character is encountered*/ -. {++chars;}/*Other characters*/ the%% - -MainintargcChar**argv) - { +Yylex ();/*invoking a routine generated by Flex*/ -printf"%8d%8d%8d\n", Lines,words,chars);/*Print line number, number of words, number of characters*/ +}
Compile Instructions:
1 $Flex flexwc.l2 $gcc -o flexwc-o3 FLEXWC
Here is the C code, MANWC.C:
1 /*manual C-code word count Program2 MANWC.C3 */4 5#include <stdio.h>6#include <ctype.h>/*using Isalpha ()*/7 8 /*global variables, which record the number of characters, number of lines, and number of words in each file*/9 intI_char=0, i_line=0, i_word=0;Ten One intMainvoid) A { - intinword=0; - /*when Inword==1 indicates that the inside of a word is being processed (that is, a word is not over) the otherwise means the end of a word*/ - Charch; - while((Ch=getchar ())!=eof)/*file not closed*/ - { +++i_char;/*increase the number of characters*/ - if('\ n'==CH)/*Number of rows*/ +++I_line; A if(Isalpha (CH) &&!inword)/*the beginning of a word*/ at { -inword=1; -++I_word; - } - if(!isalpha (CH) && Inword)/*the end of a word*/ - { ininword=0; - } to } +printf"%8d%8d%8d\n", I_line,i_word,i_char);/*Print the number of lines, words, and characters of a file*/ - return 0; the}
Compile Instructions:
1 $gcc -o manwc-o3 manwc.c
The first Test, instructions and results are as follows:
1$ Time./AUTOWC <foo.txt2 4 0 73 4 Real 0m0.014s5 User 0m0.000s6 SYS 0m0.000s7 8$ Time./MANWC <foo.txt9 4 0 7Ten One Real 0m0.024s A User 0m0.000s -SYS 0m0.000s
Note that the input stream redirection ' < ' is used here because Flex's default input stream and C code's input stream are stdin.
The following second test, instructions and results are as follows:
1$ Time./AUTOWC <lex.yy.c2 1823 6705 458873 4 Real 0m0.008s5 User 0m0.004s6 SYS 0m0.000s7 8$ Time./MANWC <lex.yy.c9 1823 6705 45887Ten One Real 0m0.008s A User 0m0.004s -SYS 0m0.000s
The following third Test, instructions and results are as follows:
1$ Time./AUTOWC <maintainers2 9721 46364 2695843 4 Real 0m0.013s5 User 0m0.012s6 SYS 0m0.000s7 8$ Time./MANWC <maintainers9 9721 46364 269584Ten One Real 0m0.019s A User 0m0.020s -SYS 0m0.000s
(This result is greatly affected by the machine, only as a reference)
After three small-to-large tests, you can see that the lexical analyzer generated by Flex matches regular expressions almost always faster than manual C code . Can meet, when the pattern becomes more complex, the code generated by flex will execute more efficiently than the pure manual C code. This is because flex handles the internal format of regular expressions (that is, deterministic poor automata) so that matching regular expressions is almost unrelated to the size of the problem (that is, not the more complex the pattern matches the longer the time, but there are exceptions), and with manual C code to handle such problems, is always inclined to analyze the character flow step-by-step, for example, the '/' symbol in C language, when reading the first '/', is not sure what exactly is (there are two semantics), only to continue to read the next character: if it is a number, that '/' is the division operator, if it is '/' that is a line comment, You can ignore the rest of the bank's content, and if it is ' * ' then you have to decide where the "* *" appears, then ignore the middle comment, and if you can't find the matching "* *", you need to make an error. Flex allows you to define explicit regular expressions for these three ways: "/", "//", "/*" (the last case that matches the/**/annotation also uses the knowledge of the starting state, which is not described here). In short, remember one point: one match is almost always faster than matching one character at a time, matching multiple times , and of course there are exceptions.
Summary: Flex is a format file processing tool that generates code that can be used not only by the programmer of the compiler interpreter, but also for all format files that need to be parsed, such as quickly finding a particular format in a file, automatic typesetting, code auto-indentation, syntax coloring, and so on. It all depends on what you can do!
Comparison of the regular expression matching speed of Flex with manual code