Finding Comments in Source Code Using Regular Expressions

Source: Internet
Author: User
Tags lexer

Many text editors has advanced Find (and replace) features. When I ' m programming, I-like-to-use a editor with regular expression search and replace. This feature was allows one to find text based on complex patterns rather than based just on literals. Upon occasion I want to examine all of the comments in my source code and either edit them or remove them. I found that it is difficult to write a regular expression that would find C style comments (The comments this start with /* and End with */) because my text editor does not implement the "Non-greedy matching" feature of regular expressions.

First Try

When first attempting this problem, most people consider the regular expression:

/\*.*\*/

This seems the natural-do it. /\*finds the start of the comment (note that the literal * needs to being escaped because has * a special meaning In regular expressions), .* finds any number of any character, and \*/ finds the end of the expression.

The first problem with this approach is a .* does not match new lines.

/* First Comment First  comment-line two*//* Second Comment * *
Second Try

This can is overcome easily by replacing the . with [^] (in some regular expression packages) or more generally wit H (.|[\r\n]) :

/\*(.|[\r\n])*\*/

This reveals a second, more serious, problem-the expression matches too much. Regular expressions is greedy, they take as much as they can. Consider the case in which your file has both comments. This regular expression would match them both along with anything in between:

Start_code (); / * First comment */More_code ();/* Second comment */end_code (); 
Third Try

To fix this, the regular expression must accept less. We cannot accept just any character with a . , we need to limit the types of characters that can is in our expressions:

/\*([^*]|[\r\n])*\*/

This simplistic approach doesn ' t accept any comments with a in * them.

/* * Common multi-line comment style. */* Second Comment * /
Fourth Try

This is where it gets tricky. How does we accept a without accepting the that's part of the * * end comment? The solution is to still accept no character that's not * , but also accept a and * anything that follows it Prov IDed that it's isn ' t followed by a / :

/\*([^*]|[\r\n]|(\*([^/]|[\r\n])))*\*/

This works better but again accepts too much in some cases. It would accept any even number of * . It might even accept the that was supposed to end of the * comment.

Start_code (); /**** * Common multi-line comment style. ****/more_code ();/* Another Common multi-line comment style. */end_cod E (); 
Fifth Try

What we tried before would work if we accept any number of the * followed by anything other than a * or a / :

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*/

Now the regular expression does not accept enough again. Its working better than ever, but it still leaves one case. It does not accept comments that end in multiple * .

/**** * Common multi-line comment style. //**** * Another common multi-line comment style. * *
Solution

Now we just need to modify the comment end to allow any number of * :

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

We now have a regular expression, we can paste into the text editors that support regular expressions. Finding our comments is a matter of pressing the Find button. You might is able to simplify this expression somewhat for your particular editor. For example, in some regular expression implementations, assumes the and all the [^] [\r\n] can is [\r\n] removed from the Expression.

This was easy to augment so that it would also find // style comments:

(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)

Tool Expression and Usage Notes
Nedit (/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)
Ctrl+f to find, put in expression, check the Regular expression check box.
[^]Does not include new line
Grep (/\*([^*]|(\*+[^*/]))*\*+/)|(//.*)
Grep-e "(/\* ([^*]| ( \*+[^*/]))*\*+/)| (//.*) "<files>
Does not support multi-line comments, would print out each line that completely contains a comment.
Perl /((?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:\/\/.*))/
Perl-e "$/=undef;print<>=~/(?: \ /\*(?:[^*]| (?:\ *+[^*\/]))*\*+\/)| (?:\ /\/.*)/g; "< <file>
Prints out all the comments run together. The (?: notation must is used for non-capturing parenthesis. Each / must being escaped because it delimits the expression. is used so, the file is not a matched line by line $/=undef; Like grep.
Java "(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)"
System.out.println (Sourcecode.replaceall (?:/ \\*(?:[^*]| (?:\ \*+[^*/]))*\\*+/)| (?:/ /.*)”,””));
Prints out the contents of the string SourceCode with the comments removed. The (?: notation must is used for non-capturing parenthesis. Each \ must is escaped in a Java String.
An easier methodnon-greedy Matching

Most regular expression packages support non-greedy matching. This means, the pattern would only be matched if there are no other choice. We can modify our second try to use the Non-greedy matcher *? instead of the greedy matcher * . With this new tool, the middle of our comment would only match if it doesn ' t match the end:

/\*(.|[\r\n])*?\*/

Tool Expression and Usage Notes
Nedit /\*(.|[\r\n])*?\*/
Ctrl+f to find, put in expression, check the Regular expression check box.
[^]Does not include new line
Grep /\*.*?\*/
Grep-e '/\*.*?\*/' <file>
Does not support multi-line comments, would print out each line that completely contains a comment.
Perl /\*(?:.|[\r\n])*?\*/
Perl-0777ne ' Print m!/\* (?:. | [\ r \ n]) *?\*/!g; ' <file>
Prints out all the comments run together. The (?: notation must is used for non-capturing parenthesis.
/Does not has to be escaped because ! delimits the expression.
-0777 is used to enable Slurp mode And-n enables automatic reading.
Java "/\\*(?:.|[\\n\\r])*?\\*/"
System.out.println (Sourcecode.replaceall ("/\\*" (?:. | [\\n\\r]) *?\\*/”,””));
Prints out the contents of the string SourceCode with the comments removed. The (?: notation must is used for non-capturing parenthesis. Each \ must is escaped in a Java String.
Caveatscomments Inside Other Elements

Although our regular expression describes C-style comments very well, there is still problems when something
Appears to was a comment but was actually part of a larger element.

/*some_code ();//* *

The solution to-is-write regular expressions that describe each of the possible larger elements, find these as wel L, decide what type of an element is, and discard the ones, was not comments. There is tools called lexers or tokenizers that can help with this task. A lexer accepts regular expressions as input, scans a stream, picks out tokens the match the regular expressions, and CLA Ssifies the token based on which expression it matched. The greedy property of regular expressions was used to ensure the longest match. Although writing a full lexer for C was beyond the scope of this document, those interested should look at Lexer generators such as Flex and JFlex.

Finding Comments in Source Code Using Regular Expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.