The application of regular expression in Web page processing four

The application of regular expression in Web page processing four _ regular expression

Last Update:2017-01-18 Source: Internet

Author: User

Tags chr closing tag html tags

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular Expressions (Regular Expression) provide an efficient and convenient method for string pattern matching. Almost all advanced languages provide support for regular expressions, or provide a ready-made code base for invocation. This paper takes the common processing tasks in ASP environment as an example to introduce the application techniques of regular expressions.

First, verify the format of password and email address

Our first example demonstrates one of the basic functions of a regular expression: An abstract description of arbitrarily complex strings. It means that the regular expression gives the programmer a formal string description method, with little code to describe any string pattern that the application encounters. For example, for people who are not technically employed, the requirement for a password format can be described as follows: The first character of the password must be a letter, the password is at least 4 characters and no more than 15 characters, and the password cannot contain characters other than letters, numbers, and underscores.

As a programmer, we must convert the natural language description of the password format to other forms so that the ASP page can understand and apply it to prevent illegal password input. The regular expression describing this cipher format is: ^[a-za-z]\w{3,14}$. In ASP applications, we can write the password verification process as a reusable function, as follows:

Function Testpassword (strpassword)
Dim RE
Set re = new REGEXP
Re. IgnoreCase = False
Re.global = False
Re. Pattern = "^[a-za-z]\w{3,14}$"
Testpassword = Re. Test (strpassword)
End Function

Here we compare the regular and natural language descriptions of the password format to see:
The first character of the password must be a letter: The regular expression description is "^[a-za-z]" where "^" denotes the beginning of the string, and the hyphen tells RegExp to match all the characters in the specified range.
The password is at least 4 characters and is no more than 15 characters: the Regular expression description is "{3,14}".
The password cannot contain characters other than letters, numbers, and underscores: the regular expression description is "\w".

A few notes: {3,14} indicates that the preceding pattern matches at least 3 characters, but not more than 14 (plus the first character is 4 to 15 characters). Note that the syntax in curly braces is extremely strict and does not allow spaces to be added on either side of the comma. If a space is added, it will have an effect on the meaning of the regular expression, resulting in an error in the password format validation. Additionally, the "$" character is not appended to the end of the regular expression above. The $ character causes the regular expression to match the string to the end, ensuring that no other characters are appended to the legal password.

Similar to the password format test, check the legality of the email address is also a very common problem, with regular expressions for simple email address test can be implemented as follows:

<%
Dim RE
Set re = new REGEXP
Re.pattern = "^\w+@[a-za-z_]+?\." [A-za-z] {2,3}$ "
Response.Write Re. Test ("aabb@yahoo.com")
%>

Second, the extraction of specific parts of the HTML page

The main problem with extracting content from HTML pages is that we have to find a way to identify exactly what part of the content we want. For example, the following is a snippet of HTML code that displays a news headline:





     
     
      
       
        
        
 other content ...  
        
      
     
     





     
     
      
       
        
        
 Iraq War!  
        
      
     
     





     
     
      
       
        
        
 other content ...

Observing the above code, it is easy to see that the news title is displayed by the table in the middle, and its class attribute is set to headline. If the HTML page is very complex, one of the additional features provided by Microsoft IE from 5.0 can only view the HTML code of the selected part of the page, please visit the http://www.microsoft.com/Windows/ie/ Webaccess/default. ASP for more information. For this example, we assume that this is a table with the unique class property set to headline. Now we're going to create the regular expression, find the headline table with the regular expression and include the table in our own page. The first is to write code that supports regular expressions:

<%
Dim Re, strhtml
Set re = new RegExp ' Create regular Expression object
Re. IgnoreCase = True
Re. Global = False ' End lookup after first match
%>

Here's what we're going to extract: here we want to extract the entire

structure, including the text of the closing tag and the news title. Therefore, the starting character of the lookup should be the

start tag: Re. Pattern = "

This regular expression matches the opening tag of the table and can return all the content (except the newline) between the start tag and the "headline". Here's how to return the matching HTML code:

' Put all the matching HTML code into the Matches collection
Set matches = Re. Execute (strhtml)
' Show all matching HTML code
For each Item in matches
Response.Write Item.value
Next
' Show one of the
Response.Write Matches.item (0). Value

Run this code to handle the HTML fragment shown earlier, and the regular expression returns one match at a time as follows:

The code to get the rest of the table is also fairly simple: Re. Pattern = " . Which: "(. | \ n) "" * "after" matches 0 to many arbitrary characters; "? "Makes the" * "matching range minimized, which matches as few characters as possible before finding the next part of an expression. The is the closing tag for the table.

“?” A qualifier is important to prevent an expression from returning code from another table. For example, for a snippet of HTML code given earlier, if you delete this "?" The return content will be:





     
     
      
       
        
        
 Iraq War!  
        
      
     
     





     
     
      
       
        
        
 other content ...

The returned content contains not only the

tag of the headline table, but also the Someotherstory table, which shows the "?" here. is essential.

This example assumes some rather idealized premises. In practice, the situation is often much more complicated, especially if you have no influence on the writing of the source HTML code you are using, it is especially difficult to write ASP code. The most effective way is to spend more time analyzing the HTML near the content to be extracted, testing regularly to make sure that the extracted content is what you need.

In addition, you should focus on and deal with situations where regular expressions do not match any content of the source HTML page. Content can be updated very quickly, not just because someone else has changed the format of the content to make their own page of the silly error.

Iii. parsing Text data files
There are many formats and kinds of data files, XML documents, structured text and even unstructured text are often used as data sources of ASP. One example we'll look at here is a structured text file that uses qualifiers. Qualifiers, such as quotes, represent the indivisibility of parts of a string, even if the string contains delimiters that separate records into fields. The following is a simple structured text file:

Surname, name, telephone, description
Sun, Goku, 312 555 5656, ASP very good
Pig, eight commandments, 847 555 5656, I'm a movie producer.

This file is very simple, its first line is the title, the following two lines are comma-delimited records. To parse this file is also very simple, simply split the file into rows (according to the newline symbol), and then separate the records by field. However, if we add commas to the contents of a field:

Surname, name, telephone, description
Sun, Goku, 312 555 5656, I like ASP, as well as VB and SQL
Pig, eight commandments, 847 555 5656, I'm a movie producer.

The problem occurs when parsing the first record, because it appears that the last field in a comma-delimited parser contains two fields of content. To avoid this type of problem, fields that contain delimiters must be surrounded by qualifiers. Single quotes are a commonly used qualifier. After adding the single quote qualifier to the above text file, its contents are as follows:

Surname, name, telephone, description
Sun, Goku, 312 555 5656, ' I like ASP, and VB and SQL '
Pig, eight commandments, 847 555 5656, ' I'm a filmmaker '

Now we can be sure which comma is the delimiter, and which comma is the field content, that is, simply enclose the comma inside the quotation marks as the contents of the field. What we're going to do next is implement a regular expression parser that determines when to divide fields by commas and when to treat commas as field content.

The problem here is slightly different from what most regular expressions are facing. Usually we look at a small part of the text to see if it can match the regular expression. But here, only after we have considered the entire line of text can we reliably determine what is within the quotation marks.

Here is an example that illustrates the problem. Extract half a line of content from a text file: 1, sandy beach, Black, 21, ', Dog, cat, duck, ',. In this example, because there are other data on the left of "1", it is extremely difficult to parse out its contents. We don't know how many single quotes are in front of this piece of data, so it's impossible to tell which characters are within the quotation marks (which cannot be split when parsing text within quotation marks). If the data fragment has an even number of (or no) single quotes before it, then "' Dog, Cat, duck, '" is a string that is delimited by quotation marks and indivisible. If the number of quotes in front is odd, then "1, Beach, Black, 21," is the end of a string and is indivisible.

Therefore, the regular expression must parse the entire line of text, considering how many quotes can be made to determine whether the character is inside or outside a quotation mark, that is:, (? = [^ ']* ' [^ ']* ') * (?! [^ ']* ']). The regular expression first finds a quotation mark, and then continues to look for and guarantee the number of single quotes following the comma, or even, or 0. The regular expression is based on the following judgment: If the number of single quotes after the comma is even, the comma is outside the string. The following table gives a more detailed description:

,	Looking for a comma
(?=	Continue looking forward to match the following pattern:
(	To start a new model
[^']*'	[non-quote characters] 0 or more, followed by a quotation mark
[^']'[^'])	[non-quote characters] 0 or more, followed by a quotation mark. After combining the preceding content, it matches the quote pair
)*	End mode and match the entire pattern (quotation marks) 0 or more times
(?!	Forward lookup, excluding this mode
[^']*'	[non-quote characters] 0 or more, followed by a quotation mark
)	End mode

The following is a VBScript function that takes a string argument, and returns an array of results based on the comma delimiter in the string, the single quote qualifier for dividing the string:

Function Splitadv (strinput)
Dim Objre
Set Objre = new REGEXP
' Set the RegExp object
Objre.ignorecase = True
Objre.global = True
Objre.pattern = ", (? = [^ ']* ' [^ ']* ') * (?!) [^']*'))"
The Replace method replaces the comma we want to use with Chr (8), Chr (8) that \b
' character, the \b may appear extremely small in the string.
' Then we'll save the string to an array based on \b.
Splitadv = Split (Objre.replace (strinput, "\b"), "\b")
End Function

In a word, using regular expressions to parse text data files has the advantages of high efficiency and short development time, and it can save a lot of analytical files and time to extract useful data according to complex conditions. There will still be a lot of traditional data available in a rapidly evolving environment, and mastering how to construct efficient data analysis routines will be a valuable skill.

Four, string substitution

In the last example, we'll look at the substitution function of VBScript regular expressions. ASP is often used to dynamically format text obtained from a variety of data sources. Using the powerful features of VBScript regular expressions, ASPs can dynamically change matching complex text. Highlighting parts of a word by adding HTML tags is a common application, such as highlighting search keywords in search results.
To illustrate the implementation approach, let's look at an example of all the ". NET" in a string that is highlighted. This string can be obtained from anywhere, such as a database or other Web site.

<%
Set regEx = New RegExp
Regex.global = True
Regex.ignorecase = True
' Regular expression pattern,
' Look for any word or URL that ends with. NET.
Regex.pattern = "(\b[a-za-z\._]+?\.net\b)"
' String used to test the replacement function
StrText = "Microsoft has established a new website Www.ASP.NET. "
' Invoke the Replace method of the regular expression
' represents inserting matching text into the current position
Response.Write Regex.Replace (StrText, _
"$1  ')
%>

There are several important areas in this example that must be noted. The entire regular expression is placed in a pair of parentheses, which is used to intercept all matching content for later use, which is referenced in the replacement text. Similar interception can be used for up to 9 each time, respectively, by the $ $ reference. The Replace method of the regular expression differs from the Replace function in VBScript itself, which requires only two parameters: the text being searched, and the text to replace.
In this example, to highlight the search for ". NET" strings, we enclose the strings with bold tags and other style attributes. With this search and replace technology, we can easily add the ability to highlight search keywords for site search programs, or automatically add links to other pages for keywords that appear on the page.

Conclusion

It is hoped that some of the regular expression techniques introduced in this article will enlighten you on when and how to apply regular expressions. Although the examples in this article are written in VBScript, In ASP.net, however, the expression is also useful, which is one of the primary mechanisms for server-side control form validation and is exported through the System.Text.RegularExpressions namespace to the entire. NET Framework. (

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More