Use regular expressions to write better SQL

Source: Internet
Author: User
Tags add end expression functions sql string version oracle database

The regular expression attribute in Oracle Database 10g is a powerful tool for working with text data

A new feature of Oracle database 10g greatly improves your ability to search for and process character data. This feature is a formal expression, a representation of a text pattern. For a long time it has appeared in many programming languages and a number of UNIX utilities.

Oracle's formal expression is implemented in the form of various SQL functions and a WHERE clause operator. If you are unfamiliar with regular expressions, this article will give you an idea of this new, incredibly powerful, yet seemingly mysterious feature. Readers who are already familiar with regular expressions can learn how to apply this functionality in an Oracle SQL language environment.
What is a regular expression?
A regular expression consists of one or more character literal and/or meta characters. In the simplest form, regular expressions consist only of character words, such as regular expression cat. It was read as the letter C, followed by the letters A and T, which matched strings such as cat, location and catalog. Metacharacters provides an algorithm to determine how Oracle handles characters that make up a regular expression. When you understand the meaning of various meta characters, you will realize that regular expressions are very powerful for finding and replacing specific text data.

Validating data, recognizing the presence of duplicate keywords, detecting unnecessary spaces, or parsing strings are just a subset of the many applications of regular expressions. You can use them to verify the format of phone numbers, postal codes, e-mail addresses, social Security numbers, IP addresses, file names, and path names. In addition, you can look for patterns such as HTML tags, numbers, dates, or anything that conforms to any pattern in any text data, replacing them with other patterns.
Use regular expressions with Oracle Database 10g
You can use the latest introduced Oracle SQL regexp_like operators and Regexp_instr, REGEXP_SUBSTR, and Regexp_replace functions to play the role of regular expressions. You will appreciate how this new feature complements the LIKE operator and the INSTR, SUBSTR, and REPLACE functions. In fact, they are similar to the existing operators, but now add a powerful pattern-matching function. The data being searched can be either a simple string or a large amount of text stored in the database character columns. Regular expressions allow you to search, replace, and validate data in a way that you have never thought of before, and provide a high degree of flexibility.
Basic examples of regular expressions
Before using this new feature, you need to understand the meaning of some meta characters. a period (.) matches any character in a regular expression (except for line breaks). For example, a regular expression a.b matches a string that first contains the letter A, followed by any of the other individual characters (except for line breaks), followed by the letter B. String Axb, Xaybx, and ABBA are all matched, because this pattern is hidden in the string. If you want to accurately match a three-letter string that starts with a and ends with B, you must position the regular expression. The caret (^) metacharacters indicate the beginning of a line, while the dollar sign ($) indicates the end of a line (see table 1). Therefore, the regular expression ^a.b$ matches the string aab, ABB, or AXB. This is compared to the similar pattern matching a_b provided by the LIKE operator, where (_) is the word wildcard wildcard.

By default, a single character or list of characters in a regular expression matches only once. To indicate a character that appears multiple times in a regular expression, you can use a quantifier, which is also called a repeating operator. If you want to get a matching pattern that starts with the letter A and ends with the letter B, your regular expression looks like this: ^a.*b$. * Metacharacters Repeat the preceding metacharacters (.) to indicate a match 0 times, one time, or more. The equivalent mode of the LIKE operator is A%b, which uses the percent sign (%) to indicate that any character appears 0 times, one time, or more than once.

Table 2 shows a complete list of repeat operators. Note that it contains special duplicate options that enable greater flexibility than the existing like wildcard characters. If you enclose an expression in parentheses, this effectively creates a subexpression that can repeat a certain number of times. For example, regular expression B (an) *a matches BA, Bana, Banana, yourbananasplit, etc.

Oracle's regular expression implementations support POSIX (Portable operating system Interface) character classes, as listed in table 3. This means that the type of character you are looking for can be very special. Suppose you want to write a like condition that finds only non-alphanumeric characters-the WHERE clause as a result might inadvertently become very complex.

The POSIX character class must be contained in a list of characters indicated by square brackets ([]). For example, the regular expression [[: Lower:]] matches a lowercase alphabetic character, while [[: Lower:]]{5} matches five consecutive lowercase alphabetic characters.

In addition to the POSIX character class, you can place individual characters in a list of characters. For example, regular expressions ^ab[cd]ef$ match strings ABCEF and Abdef. C or D must be selected.

Most metacharacters in a list of characters are considered text, except for the caret (^) and hyphen (-) characters. Regular expressions can look complex because some metacharacters have multiple meanings that vary with context. ^ is such a meta character. If you use it as the first character of a word characters, it represents the characters of a word. Therefore, [^[:d igit:]] looks for a pattern that contains any non-numeric characters, and ^[[:d Igit:] To find a matching pattern that starts with a number. A hyphen (-) indicates a range in which the regular expression [a-m] matches any letter between the letter A and the letter M. However, if it is the first character in a character line (as in [-AFG]), it represents a hyphen.

An example earlier describes the use of parentheses to create a subexpression that allows you to enter alternate options by entering replacement metacharacters, which are separated by vertical bars (|).

For example, regular expression T (a|e|i) n allows three possible characters to be replaced between letters T and N. Matching patterns include words such as Tan, ten, tin, and Pakistan, but do not include teen, mountain, or tune. As an alternative, regular expression T (a|e|i) n can also be represented as a list of characters t[aei]n. The meta characters are summarized in table 4. Although there are more meta characters, this concise overview is sufficient to understand the formal expressions used in this article.
Regexp_like operator
The regexp_like operator introduces you to regular expression functionality when used in an Oracle database. Table 5 lists the syntax for Regexp_like.

The following SQL query's WHERE clause shows the regexp_like operator, which searches the ZIP column for a pattern that satisfies the regular expression [^[:d igit:]. It retrieves those ZIP column values in the ZIPCODE table that contain any non-numeric character rows.

SELECT zip from ZipCode WHERE regexp_like (Zip, ' [^[:d igit:]] ') zip-----ab123 123xy 007ab abcxy

The example of this formal expression is composed only of metacharacters, and more specifically POSIX character class digit separated by colons and brackets. The second set of square brackets (as shown in [^[:d Igit:]) includes a list of character classes. As mentioned earlier, you need to do this because you can only use the POSIX character classes to build a list of characters.
Regexp_instr function
This function returns the starting position of a pattern, so its function is very similar to the INSTR function. The syntax for the new REGEXP_INSTR function is given in table 6. The main difference between the two functions is that regexp_instr lets you specify a pattern rather than a specific search string; So it provides more functionality. The following example uses REGEXP_INSTR to return a string to the starting position of the five-bit ZIP code mode in CA 91234, Joe Smith, 10045 Berry Lane, San Joseph. If the regular expression is written as [[:d igit:]]{5}, you will get the starting position of the number instead of the ZIP code, since 10045 is the first time that five consecutive digits appear. Therefore, you must position the expression at the end of the line, as shown in the $ metacharacters, which displays the starting position of the postal code, regardless of the number of numbers.

SELECT regexp_instr (' Joe Smith, 10045 Berry Lane, San Joseph, CA 91234 ', ' [[:d igit:]]{5}$ ') as Rx_instr from dual Rx_instr ----------45
Writing more complex patterns
Let's expand on the ZIP code pattern in the previous example to include an optional four-bit digital pattern. Your pattern may now look like this: [[:d igit:]]{5} (-[[:d igit:]]{4})? $. If your source string ends with a 5-bit postal code or a 5-bit + 4-bit postal code, you will be able to display the starting position of the pattern.

SELECT regexp_instr (' Joe Smith, 10045 Berry Lane, San Joseph, CA 91234-1234 ', ' [[:d igit:]]{5} (-[[:d igit:]]{4})? $ ') as STA Rts_at from dual Starts_at----------44

In this example, the subexpression in parentheses (-[[:d igit:]]{4}) is pressed? The repeat operator's instructions are repeated 0 times or once. In addition, attempting to achieve the same results with traditional SQL functions is even a challenge for SQL experts. To better illustrate the different components of this regular expression example, table 7 contains a description of a single literal and metacharacters.
Regexp_substr function
The REGEXP_SUBSTR function, similar to the SUBSTR function, is used to extract part of a string. Table 8 shows the syntax for this new function. In the following example, the matching pattern [^,]* string will be returned. The regular expression searches for a comma followed by a space, and then, as instructed by [^,]*, searches for 0 or more characters that are not commas, and finally finds another comma. This pattern looks a bit like a comma-delimited string of values.

SELECT regexp_substr (' field, second field, third field ', ', [^,]*, ') from dual Regexp_substr (' FIR----------------- -, second field,
Regexp_replace function
Let's first look at the traditional replace SQL function, which replaces a string with another string. Suppose your data has unnecessary spaces in the body, and you want to replace them with a single space. With the Replace function, you need to accurately list how many spaces you want to replace. However, the number of extra spaces may not be the same throughout the body. The following example has three spaces between Joe and Smith. The parameters of the Replace function specify that you want to replace two spaces with a single space. In this case, the result leaves an extra space between Joe and Smith in the original string.

SELECT REPLACE (' Joe Smith ', ', ') as replace from dual replace---------Joe Smith

The Regexp_replace function advances the substitution function a step further, and its syntax is listed in Table 9. The following query replaces any two or more spaces with a single space. () A subexpression contains a single space that can be repeated two or more times as instructed by {2.}

SELECT Regexp_replace (' Joe Smith ', ' () {2,} ', ') as Rx_replace from dual Rx_replace----------Joe Smith
Back reference
A useful feature of regular expressions is the ability to store the subexpression for later reuse, which is also referred to as a back reference (outlined in table 10). It allows for complex substitution functions, such as swapping patterns in new locations or displaying repeated occurrences of words or letters. The matching part of the subexpression is saved in the temporary buffer. The buffer is numbered from left to right and accessed using the \digit symbol, where digit is a number between 1 and 9 that matches the digit subexpression, which is displayed with a set of parentheses.

The following example shows the conversion of the name Ellen Hildi Smith to Smith, Ellen Hildi by reference to each subexpression by number.

SELECT regexp_replace (' Ellen hildi Smith ', ' (. *) (. *) (. *) ', ' \3, \1 \2 ') from dual Regexp_replace (' EL------------------ Smith, Ellen Hildi

The SQL statement shows the three separate subexpression enclosed in parentheses. Each individual subexpression contains a matching meta character (.) and follows the * metacharacters, indicating that any character (other than a newline character) must match 0 or more times. A space separates each subexpression, and a space must also match. Parentheses create a subexpression that gets the value, and can be referenced with \digit. The first subexpression is assigned to \1, the second \2, and so on. These back references are used in the last parameter of the function (\3, \1 \2), which effectively returns the replacement substring and arranges them in the desired format (including commas and spaces). Table 11 details the various components of the regular expression.

A back reference is useful for replacing, formatting, and substituting values, and you can use them to find adjacent values. The following example shows the use of the REGEP_SUBSTR function to find any repeated alphanumeric values that are separated by spaces. The displayed results give a substring that identifies the repeated occurrence of the word is.

SELECT Regexp_substr (' The final test is the implementation ', ' ([[: alnum:]]+) ([[: space:]]+) \1 '] as SUBSTR from dual SUB STR------IS is
Matching parameter options
You may have noticed that the regular expression operators and functions contain an optional matching parameter. This parameter controls whether case sensitive, line break matching, and retention of multiple lines of input.
The practical application of regular expressions
Not only can you use regular expressions in queues, you can also use regular expressions anywhere you use SQL operators or functions, such as in Pl/sql languages. You can write triggers that take advantage of regular expression functionality to validate, generate, or extract values.

The next example shows how you can apply the Regexp_like operator to data validation in a column check constraint. It verifies the correct social Security number format when inserting or updating. Social Security numbers in formats such as 123-45-6789 and 123456789 for this column constraint is an acceptable value. Valid data must start with three digits, followed by a hyphen, plus two digits and a hyphen, and finally four digits. The other expression allows only 9 consecutive digits. The vertical bar symbol (|) separates the options.

ALTER TABLE students ADD CONSTRAINT stud_ssn_ck CHECK (regexp_like ssn, ' ^ ([[:d igit:]]{3}-[[:d igit:]]{2}-[[:d igit:]]{4 }| [[:d Igit:]] {9}) $'))

The characters at the beginning or end of the ^ and $ indicate are not acceptable. Make sure that your regular expression is not divided into multiple lines or contains any unnecessary spaces unless you want the format to be so and match accordingly. Table 12 illustrates the various components of the regular expression sample.
Compare regular expressions to existing functionality
Regular expressions have several advantages over common like operators and INSTR, SUBSTR, and REPLACE functions. These traditional SQL functions are not convenient for pattern matching. Only the like operator matches by using the% and _ characters but like does not support the repetition of expressions, complex substitutions, character ranges, word characters and POSIX character classes, and so on. In addition, the new regular expression function allows you to detect repeated occurrences of word and pattern swapping. The examples here give you an overview of the regular expression areas and how you can use them in your applications.
To actually enrich your toolkit
Because regular expressions help solve complex problems, they are very powerful. Some of the features of regular expressions are difficult to emulate with traditional SQL functions. When you understand the underlying building blocks of this slightly cryptic language, regular expressions will become an integral part of your toolkit (not only in SQL but also in other programming locales). In order for your patterns to be correct, while attempts and errors are sometimes necessary, the simplicity and strength of regular expressions are unquestionable.

Alice Rischert (ar280@yahoo.com) is the chairman of the database application development and design direction of the computer Technology and application Department of Columbia University. She wrote the Oracle SQL Interactive manual version 2nd (Prentice hall,2002) and the upcoming Oracle SQL sample (Prentice hall,2003). Rischert has over 15 years of experience as a database designer, DBA, and project director in Fortune 100 companies, and she has been using Oracle products since Oracle version 5.

Table 1: Locating meta characters
Metacharacters description ^ Position an expression to the beginning of a line $ to position an expression at the end of a line


Table 2: quantifier or repeat operator quantifier description * Match 0 or more times? matches 0 or 1 times + matches 1 or more times {m} exactly matches m times {m,} matches at least m times {m, n} matches at least m times but not more than n times



Table 3: Predefined POSIX character class character class descriptions [: Alpha:] alphabetic characters [: lower:] lowercase characters [: Upper:] uppercase characters [:d igit:] numbers [: alnum:] alphanumeric characters [: space:] whitespace characters (prohibit printing), such as carriage returns, line breaks, vertical tabs, and page breaks [:p UNCT:] punctuation characters [: Cntrl:] control characters (prohibit printing) [:p rint:] printable characters



Table 4: Substitution matching and grouping meta character descriptions for expressions | Replacing the separator replacement option, a group operator () is typically used () to group a subexpression into a substitution unit, a quantifier unit, or a back reference unit (see the "Back Reference" section) [char] characters a list of characters ; Most of the metacharacters in a list of characters (except character classes, ^ and-metacharacters) are understood as text



Table 5:regexp_like operator Syntax Description regexp_like (source_string, pattern[, Match_parameter]) source_string supports character data types (char, VARCHAR2, CLOB, NCHAR, NVARCHAR2 and NCLOB, but not LONG). The pattern parameter is another name for the regular expression. Match_parameter allows optional parameters such as handling line breaks, retaining multiline formatting, and providing control over case sensitivity.



Table 6:regexp_instr function Syntax Description regexp_instr (source_string, pattern[, start_position[, occurrence[, return_option[, Match_ Parameter]]] The function finds pattern and returns the first position of the mode. You can optionally specify the start_position you want to start your search. The occurrence parameter defaults to 1 unless you specify that you want to find a pattern that appears next. The default value of Return_option is 0, which returns the starting position of the pattern, and a value of 1 returns the starting position of the next character that matches the criteria.



Table 7: A description of the 5 digit plus 4-bit postal code expression syntax description must match the blank [:d Igit:]posix number class] Word characters the end of {5} word characters exactly 5 times (beginning of subexpression-a literal hyphen, because it is not a character list within the range metacharacters [character columns The beginning of the table [:d igit:]posix [:d igit:] class [Beginning of Word characters] characters end of {4} word characters exactly 4 times) end parenthesis, ending subexpression?? Quantifiers match the grouped subexpression 0 or 1 times, making 4-bit code optional $ positional metacharacters, indicating end-of-line



Table 8:regexp_substr function Syntax Description regexp_substr (source_string, pattern[, Position [, occurrence[, Match_parameter]]) REGEXP_ SUBSTR function returns a substring of the matching pattern.



Table 9:regexp_replace function Syntax Description regexp_replace (source_string, pattern[, replace_string [, Position[,occurrence, [Match_ Parameter]]] This function replaces the matching pattern with a specified replace_string, allowing for complex "search and replace" operations.



Table 10: metacharacters character description \digit backslash followed by a number between 1 and 9, and the backslash matches the first digit subexpression enclosed in parentheses. (Note: A backslash has another meaning in a regular expression that, depending on the context, may also represent an Escape character.)



Table 11: Description of the pattern Exchange regular expression The regular expression item description (the beginning of the first subexpression). matches any single character * Repeat operator other than a newline character, matching the end of the first subexpression of the. metacharacters 0 to N times before, and the result is obtained in \1 (in this case the result is Ellen. Must exist (the beginning of the second subexpression). matches any single character * Repeat operator other than a newline character, matching the end of the second subexpression of the preceding. metacharacters 0 to n times; The result is obtained in \2 (in this case, the result is hildi.) Blank (the beginning of the third subexpression). matches any single character * Repeat operator other than a newline character, matching the end of the third subexpression of the preceding. metacharacters 0 to N times); The result is obtained in \3 (in this case, the result is Smith.) )



Table 12: Social Security numbers Description of regular expressions Regular expression item description ^ line start character (regular expression cannot have any leading characters before matching). (Starting the subexpression and listing the replaceable options that are separated by the | metacharacters [:d igit:]posix number class] character characters the beginning of the character characters. The end {3} character characters exactly 3 times-hyphen [the beginning of the characters [:d Igit:]posix number class] Word characters end { 2} Word characters just repeat 2 times-another hyphen [the beginning of the characters [:d Igit:]posix number class] characters the end of {4} word characters exactly 4 times | Replace the metacharacters; ends the first option and begins the next substitution expression [the beginning of the character characters [:d Igit: ]posix Number class] The end of the characters {9} character characters exactly 9 times) ends the parentheses, ends the subexpression for substitution $ positional metacharacters, indicating the end of the line; no extra characters to fit the pattern


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.