Comparison of the regular expression matching speed of Flex with manual code

Last Update:2015-04-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

　　Flex is a lexical analyzer generator, which is one of the most common tools used by compilers and interpreter programmers. Flex's program consists mainly of a series of regular expressions with directives called action codes. When matching input,Flex translates all regular expressions into deterministic, poor automata, which makes it very efficient for lexical analyzers generated by the lexical analyzer generator, such as Flex, to match the input pattern . Of course, some people accuse flex of being inflexible, limited in functionality, and unable to solve many problems, such as the ambiguity of JavaScript, C + +, and the fact that the lexical analyzers of many programs (such as Python's interpreters) are generated by manual code rather than by flex. None of this is within the scope of this article, and I would like to take a very rough look at how fast flex matches regular expressions by using a word count program (Word counter) compiled for Flex and hand written C code.

Test methods: Run the program with Flex C code and manual C code, set up three different sizes of files, and use the shell's time instruction to test how long the two programs are running.

Below the source code, first flex source FLEXWC.L:

1 /*A Word count program implemented with Flex. 2 FLEXWC.L3 */4 5%{6 /*define global variables, count characters, number of words, and number of lines, respectively*/7 intchars=0;8 intwords=0;9 intlines=0;Ten%} One%% A[a-za-z]+ {++words; Chars+=strlen (Yytext);}/*regular Expressions Match any word and use String functions to count the current number of characters in the input stream*/ -\ n {++chars; ++lines;}/*New line starts when a newline character is encountered*/ -. {++chars;}/*Other characters*/             the%% -  -MainintargcChar**argv) - { +Yylex ();/*invoking a routine generated by Flex*/ -printf"%8d%8d%8d\n", Lines,words,chars);/*Print line number, number of words, number of characters*/ +}

Compile Instructions:

1 $Flex flexwc.l2 $gcc -o flexwc-o3 FLEXWC

Here is the C code, MANWC.C:

1 /*manual C-code word count Program2 MANWC.C3 */4 5#include <stdio.h>6#include <ctype.h>/*using Isalpha ()*/7 8 /*global variables, which record the number of characters, number of lines, and number of words in each file*/9 intI_char=0, i_line=0, i_word=0;Ten  One intMainvoid) A { -     intinword=0; -     /*when Inword==1 indicates that the inside of a word is being processed (that is, a word is not over) the otherwise means the end of a word*/ -     Charch; -      while((Ch=getchar ())!=eof)/*file not closed*/ -     { +++i_char;/*increase the number of characters*/ -         if('\ n'==CH)/*Number of rows*/ +++I_line; A         if(Isalpha (CH) &&!inword)/*the beginning of a word*/ at         { -inword=1; -++I_word; -         } -         if(!isalpha (CH) && Inword)/*the end of a word*/ -         { ininword=0;  -         } to     } +printf"%8d%8d%8d\n", I_line,i_word,i_char);/*Print the number of lines, words, and characters of a file*/ -     return 0; the}

Compile Instructions:

1 $gcc -o manwc-o3 manwc.c

The first Test, instructions and results are as follows:

1$ Time./AUTOWC <foo.txt2        4       0       73 4 Real 0m0.014s5 User 0m0.000s6 SYS 0m0.000s7 8$ Time./MANWC <foo.txt9        4       0       7Ten  One Real 0m0.024s A User 0m0.000s -SYS 0m0.000s

Note that the input stream redirection ' < ' is used here because Flex's default input stream and C code's input stream are stdin.

The following second test, instructions and results are as follows:

1$ Time./AUTOWC <lex.yy.c2     1823    6705   458873 4 Real 0m0.008s5 User 0m0.004s6 SYS 0m0.000s7 8$ Time./MANWC <lex.yy.c9     1823    6705   45887Ten  One Real 0m0.008s A User 0m0.004s -SYS 0m0.000s

The following third Test, instructions and results are as follows:

1$ Time./AUTOWC <maintainers2     9721   46364  2695843 4 Real 0m0.013s5 User 0m0.012s6 SYS 0m0.000s7 8$ Time./MANWC <maintainers9     9721   46364  269584Ten  One Real 0m0.019s A User 0m0.020s -SYS 0m0.000s

(This result is greatly affected by the machine, only as a reference)

After three small-to-large tests, you can see that the lexical analyzer generated by Flex matches regular expressions almost always faster than manual C code . Can meet, when the pattern becomes more complex, the code generated by flex will execute more efficiently than the pure manual C code. This is because flex handles the internal format of regular expressions (that is, deterministic poor automata) so that matching regular expressions is almost unrelated to the size of the problem (that is, not the more complex the pattern matches the longer the time, but there are exceptions), and with manual C code to handle such problems, is always inclined to analyze the character flow step-by-step, for example, the '/' symbol in C language, when reading the first '/', is not sure what exactly is (there are two semantics), only to continue to read the next character: if it is a number, that '/' is the division operator, if it is '/' that is a line comment, You can ignore the rest of the bank's content, and if it is ' * ' then you have to decide where the "* *" appears, then ignore the middle comment, and if you can't find the matching "* *", you need to make an error. Flex allows you to define explicit regular expressions for these three ways: "/", "//", "/*" (the last case that matches the/**/annotation also uses the knowledge of the starting state, which is not described here). In short, remember one point: one match is almost always faster than matching one character at a time, matching multiple times , and of course there are exceptions.

Summary: Flex is a format file processing tool that generates code that can be used not only by the programmer of the compiler interpreter, but also for all format files that need to be parsed, such as quickly finding a particular format in a file, automatic typesetting, code auto-indentation, syntax coloring, and so on. It all depends on what you can do!

Comparison of the regular expression matching speed of Flex with manual code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Comparison of the regular expression matching speed of Flex with manual code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Comparison of the regular expression matching speed of Flex with manual code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support