Parsing C # files with regular expressions (updated)

Source: Internet
Author: User
Tags constructor contains documentation empty reference regular expression thread
Presumably many readers have written a program that coloring the program code by syntax. And that was a difficult thing for a while. You need to write a lot of code analysis syntax--and this is often the hardest part. Until the advent of regular expressions (Regular Expression), we are free from heavy work. Regular expressions provide a range of methods (standards, patterns) that enable us to efficiently create, compare, and modify strings, as well as quickly parse large amounts of text and data to search for, remove, and replace text patterns [1]. The dotnet Framework provides System.Text.RegularExpression namespaces to implement the functionality they promise.
1. Regular expression [2]

First of all, I would like to briefly introduce the regular expression.

The regular expression was first proposed by the mathematician Stephen Kleene in 1956, based on the results of the incremental study of natural languages. Regular expressions with complete syntax are used in the format matching of characters and are later applied to the field of molten information technology. Since then, regular expressions have evolved over several periods, and the standards are now approved by ISO (International Standards organization) and identified by the Open Group organization.

A regular expression is not a private language, but it can be used to find and replace text in a file or word character. It has two criteria: a basic regular expression (BRE), and an extended regular expression (ERE). ere included the BRE function and other concepts.

A regular expression is implemented by advanced XSH,EGREP,SED,VI and by programs on UNIX platforms. They can be adopted in many languages, such as HTML and XML, which are usually just a subset of the entire standard. With the development of the programming language of the regular expression porting to the cross platform, its function is becoming more and more complete, and its use is becoming more and more extensive.

2. Related expressions

I can only say so much about regular expressions-it is a very small knowledge system and it is impossible to explain it in words. Here I only introduce the knot matching string related to C # syntax parsing. For more information, see the collection Regular Expression specification [the Open Group] of this blog site. In addition, if you already have a good understanding of regular expressions, you can skip over each of the explanations below to complete the full text as quickly as possible.

i> string "(\?) *?"

The regular expression is in addition to. $ ^ {[(|) * +? \, other characters match itself. In the above equation, the quotation mark on both sides refers to the quotation marks on either side of the string. "\ \" denotes a "\" character. Followed by "?" Represents a match of 0 or one character. "." matches any character except \ n.

"()" indicates that a matching substring is captured. The capture using () is automatically numbered starting from 1, according to the order of the opening parenthesis. The first capture that captures the element number zero is the text that matches the entire regular expression pattern. The "*" Behind the parentheses indicates that one or more such substrings exist. That is, "*" is acting on "(\)" Of

“?” Allows an empty string to be captured as well.

Ii> Verbatim string @ "(" "|.) *?"

Matches a string similar to @ "Hello" "World" "!".

With the use of | (vertical bar) The character separates any one term match; for example, Cat|dog|tiger. Use the left-most successful match.

iii> XML elements in C # Document Information///\s*<.*>

Matches the C # Automation XML document. "\s" represents any whitespace character. Note that you should not change the case as you please. Because the regular expression is case sensitive, in its wildcard characters, uppercase and lowercase characters often represent the exact opposite meaning. For example, "\s" represents any non-white-space character. (The following "\z" is also the case)

Iv> the contents of the C # Document information///\s? *

v> Empty Line ^\s*\z

' ^ ' Specifies that the match must appear at the beginning of the string or at the beginning of the line. and "\z" indicates that the specified match must appear before the end of the string or \ n before the end of the string.

vi> C # annotation//.*

vii> C # keyword (abstract|where|while|yield) {1} (\.| (\s) +|;|,|\ (|\[) {1}

Space is limited, there are only a few keywords listed here (C # has at least 80 keywords ^_^). It is important to note that the parser matches the first success item on the left. Therefore, a word with a relationship should pay attention to the order: the containing person is placed before the included person. For example: (in|int) parsing it would not be able to find int, so it should be (int|in).

In addition, all parentheses (\{|\[|\ (|\}|\]|\)).


3. Related class and its members [3]

[Serializable]

public class Regex:iserializable

Represents an immutable regular expression.



The Regex class contains several static methods that allow you to use regular expressions without having to explicitly create a Regex object. Using a static method is equivalent to constructing a Regex object, using the object once and then destroying it.

The Regex class is immutable (read-only) and has intrinsic thread security. You can create a Regex object on any thread and share it between threads.

The above is excerpted from Microsoft's development documentation. We also need to use several of its members:

Searches the specified input string for a regular expression match specified in the Regex constructor.

Public Match Match (

String Intput

)


For the Match class

[Serializable]

public class Match:group

Represents the result of a single regular expression match. For more information about Group, see the Microsoft development documentation.


We'll use it to the following members

The zero-based start position of the captured substring found in the original string.

public int Index {get;}



The length of the captured substring.

public int Length {get;}



The actual substring captured by the match.

public int Value {get;}



Gets a value that indicates whether the match was successful.

public bool Success {get;}



Gets a collection of groups that are matched by regular expressions.

Public virtual groupcollection Groups {get;}



Start at the end of the previous match (that is, the character after the last match character)

Returns a new match that contains the next matching result.

Public Match NextMatch ();


and the corresponding members of the group class (among the Match members listed above, the first four attributes are inherited by the group class, so these members will not be listed one by one).

The matching string must be specified when an instance of the Regex class is initialized. You can use the constructor to create an instance, use it, and then destroy it. Or use a static method directly, which is equivalent to creating an instance. However, after testing, I found that the static method is slightly slower than the compiled Regex object. Take a look at the following set of test data:




4. Writing code

We now need to analyze the C # language elements that are listed in section three. What I'm taking is line-by-row analysis (if you want to take a multiline analysis, the related expression needs to be modified [4]).

Using System.Text.RegularExpression;

Some Other codes ...

First, create a Regex instance (in the case of a string parsing).

Regex doublequotedstring = new Regex ("\" (\\\\?) *?\"" );

Then go to match the string.

Match m;

for (M = Doublequotedstring.match (strsomecodes); m.success M.nextmatch ()) {

foreach (Group g in m.groups) {

Do some drawings

}

}


The only thing left is to write the coloring code.

5. Source code





Note:

[1] "be able to ... Text Patterns "quote from regular expression language elements in the. NET Framework General Reference

[2] Introduction to regular expressions Here's a brief introduction to regular expressions refer to the relevant content from ZDNet China technology and development.

[3] The signatures and comments for the classes and functions that appear in this section are from Microsoft documentation.

[4] For more than a few lines of analysis see the. NET Framework General Reference Regular Expression language element



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.