Oracle Regular Expression practice
Introduction
Oracle 10g introduced support for regular expressions in SQL andPL/SQL with the following functions.
Oracle 10 Gb supports the following regular expressions in SQL and PLSQL:
REGEXP_INSTR-Similar to INSTR since t it uses a regular expression rather than a literal as the search string. similar to the INSTR function REGEXP_LIKE-Similar to LIKE against t it uses a regular expression as the search string. REGEXP_LIKE is really an operator, not a function. similar to the LIKE condition REGEXP_REPLACE-Similar to REPLACE into t it uses a regular expression as the search string. similar to the REPLACE function REGEXP_SUBSTR-Returns the string matching the regular expression. not really similar to SUBSTR. returns a string that matches a regular expression, which is similar to SUBSTR.
Oracle 11g introduced two new features related to regularexpressions.
The 11g introduces two new features:
REGEXP_COUNT-Returns the number of occurrences of the regular expression in the string. Returns the number of occurrences of a regular expression string. Sub-expression support was added to all regular expression functions by adding a parameter to each function to specify the sub-expression in the pattern match. subexpressions are supported in all regular expression functions. You can add a parameter.
Learning to write regular expressions takes a little time. if youdon't do it regularly, it can be a voyage of discovery each time. the generalrules for writing regular expressions are available here. you can read the Oracle Regular Expression Support here.
Rather than trying to repeat the formal definitions, I'll presenta number of problems I 've been asked to look at over the years, where asolution using a regular expression has been appropriate.
Here, we do not repeat the definition of a regular expression. Instead, we use an example of a set of problem-oriented Regular Expressions:
Example 1: REGEXP_SUBSTR
The data in a column is free text, but may include a 4 digit year.
Data is stored in free text in the field, but may contain four precise year data.
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('FALL 2014');
INSERT INTO t1 VALUES ('2014 CODE-B');
INSERT INTO t1 VALUES ('CODE-A 2014 CODE-D');
INSERT INTO t1 VALUES ('ADSHLHSALK');
INSERT INTO t1 VALUES ('FALL 2004');
INSERT INTO t1 VALUES ('FALL 2015');
COMMIT;
SELECT * FROM t1;
DATA
---------------------------------------------------------------------
FALL 2014
2014 CODE-B
CODE-A 2014 CODE-D
ADSHLHSALK
FALL 2004
5 rows selected.
SQL>
If we needed to return rows containing a specific year we coulduse the LIKE operator (WHERE data LIKE '% 100'), but how do we return rows using a comparison (<, <=,>, >=, <> )?
One way to approach this is to pull out of the 4 figure year andconvert it to a number, so we don't accidentally do an ASCII comparison. That 'spretty easy using regular expressions.
If we need to return data that contains the specified year, we can use the LIKE operator (...), But how does one return rows through the unequal operator? One path is to extract the year of four numbers and convert them to numbers. It can be easily implemented through regular expressions.
We can identify digits using the "\ d" or "[0-9]" operators. we want a group of four of them, which isrepresented by the "{4}" operator. so our regular expression will be "\ d {4}" or "[0-9] {4 }". the REGEXP_SUBSTR functionreturns the string matching the regular expression, so that can be used toextract the text of interest. we then just need to convert it to a number andperform our comparison.
We use \ d or [0-9] to recognize numbers. We need four groups, which can be represented by {4. At this point, our regular expression is \ d {4} or [0-9] {4 }. The REGEXP_SUBSTR function returns a string that matches the specified formal expression, so it can be used to extract the text we are interested in. Then we only need to convert it to a number and perform a comparison.
SELECT *
FROM t1
WHERE TO_NUMBER(REGEXP_SUBSTR(data, '\d{4}')) >= 2014;
DATA
---------------------------------------------------------------------
FALL 2014
2014 CODE-B
CODE-A 2014 CODE-D
FALL 2015
4 rows selected.
SQL>
Example 2: REGEXP_SUBSTR
Given a source string, how do we split it up into separatecolumns, based on changes of case and alpha-to-numeric, such that this.
Given a metacharacters, you must divide them into multiple columns according to the specified rules (based on case-insensitive letters and letters-to-numbers:
ArtADB1234567e9876540
Becomes this. After the split:
Art ADB 1234567 e 9876540
The source data is set up like this. The metadata is as follows:
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('ArtADB1234567e9876540');
COMMIT;
The first part of the string is an initcap word, so it starts witha capital letter between "A" and "Z ". we identify a singlecharacter using the "[]" operator, and ranges are represented using "-", like "A-Z", "a-z" or "0-9 ". so ifwe are looking for a single character that is a capital letter, we need to lookfor "[A-Z]". that needs to be followed by lower case letters, whichwe now know is "[a-z]", but we need 1 or more of them, which issignified by the "+" operator. so to find an initcap word, we need tosearch for "[A-Z] [a-z] + ". since we want the first occurrence of this, we can use the following.
The first part of A string is an uppercase letter, which may be A-Z. We use the [] operator to recognize a single character, as for the range, use "-", for example, "A-Z", "a-z" or "0-9 ". So if we need to find the upper-case initials, use "[A-Z]". Followed by several lower-case letters, you can use + to represent several (one or more ). The combined regular expression is: [A-Z] [a-z] +, so that the first column of the split method has.
REGEXP_SUBSTR(data, '[A-Z][a-z]+', 1, 1)
The second part of the string is a group of 1 or more uppercaseletters. we know we need to use the "[A-Z] +" pattern, but we need tomake sure we don't get the first capital letter, so we look for the secondoccurrence.
The second part is a group of one or more uppercase letters. We know that we need to use the pattern: [A-Z] +, but in order not to conflict with the first part, we specify the text that matches its 2nd appearance.
REGEXP_SUBSTR(data, '[A-Z]+', 1, 2)
The next part is the first occurrence of a group of numbers.
The next part is a set of pure numbers.
REGEXP_SUBSTR(data, '[0-9]+', 1, 1)
The next part is a group of lower case letters. We don't want to pickup those from the initcap word, so we must look for the second occurrence oflower case letters.
The next part is a group of lower-case letters, which are also considered not to conflict with the first part:
REGEXP_SUBSTR(data, '[a-z]+', 1, 2)
Finally, we have a group of numbers, which is the secondoccurrence of this pattern.
Finally, there is a set of numbers:
REGEXP_SUBSTR(data, '[0-9]+', 1, 2)
Putting that all together, we have the following query, whichsplits the data into separate columns.
Use the output of each part of the preceding regular expression as an independent field:
COLUMN col1 FORMAT A15
COLUMN col2 FORMAT A15
COLUMN col3 FORMAT A15
COLUMN col4 FORMAT A15
COLUMN col5 FORMAT A15
SELECT REGEXP_SUBSTR(data, '[A-Z][a-z]+', 1, 1) col1,
REGEXP_SUBSTR(data, '[A-Z]+', 1, 2) col2,
REGEXP_SUBSTR(data, '[0-9]+', 1, 1) col3,
REGEXP_SUBSTR(data, '[a-z]+', 1, 2) col4,
REGEXP_SUBSTR(data, '[0-9]+', 1, 2) col5
FROM t1;
COL1 COL2 COL3 COL4 COL5
--------- ---------- ---------- ----------- ------------
Art ADB 1234567 e 9876540
1 row selected.
SQL>
Example 3: REGEXP_SUBSTR
We need to pull out a group of characters from a "/" delimited string, optionally enclosed by double quotes. The data looks likethis.
We need to extract a group of characters from a string (containing separated characters/and double quotation marks "). The raw data is as follows:
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('978/955086/GZ120804/10-FEB-12');
INSERT INTO t1 VALUES ('97/95508/BANANA/10-FEB-12');
INSERT INTO t1 VALUES ('97/95508/"APPLE"/10-FEB-12');
COMMIT;
We are looking for 1 or more characters that are not "/", which we do using "[^/] + ". the "^" in thebrackets represents NOT and "+" means 1 or more. we also want toremove optional double quotes, so we add that as a character we don't want, giving us "[^/"] + ". so if we want the data from the thirdcolumn, we need the third occurrence of this pattern.
We need to find one or more non-"/" characters. You can use "[^/] +". ^ NOT in square brackets. We also need to remove optional double quotation marks, so we need to use [^/"] +. Therefore, if we need to obtain the string that appears for 3rd Times:
SELECT REGEXP_SUBSTR(data, '[^/"]+', 1, 3) AS element3
FROM t1;
ELEMENT3
---------------------------------------------------------------------
GZ120804
BANANA
APPLE
3 rows selected.
SQL>
Example 4: REGEXP_REPLACE
We need to take an initcap string and separate the words. The datalooks like this.
We need to extract the string with uppercase letters and separate them. The raw data is as follows:
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('SocialSecurityNumber');
INSERT INTO t1 VALUES ('HouseNumber');
COMMIT;
We need to find each uppercase character "[A-Z]". wewant to keep that character we find, so we will make that pattern asub-expression "([A-Z])", allowing us to refer to it later. for eachmatch, we want to replace it with a space, plus the matching character. thespace is pretty obvious, but we need to use "\ 1" to signify the textmatching the first sub expression. so we will replace the matching pattern witha space and itself, "\ 1 ". we don't want to replace the first letterof the string, so we will start at the second occurrence.
We need to use [A-Z] to locate each uppercase character. We need to keep the characters found, so we use a subexpression ([A-Z]) to reference it later. For each match, we want to replace it with a space and add the matched characters. Space is quite obvious, but we need to use "\ 1" to represent the text matched by the first subexpression. Therefore, we use a space to replace the matching mode with itself, that is, "\ 1 ". We do not want to replace the first letter of the string, so we start with 2nd characters:
SELECT REGEXP_REPLACE(data, '([A-Z])', ' \1', 2) AS hyphen_text
FROM t1;
HYPHEN_TEXT
--------------------------------------------------------------------
Social Security Number
House Number
2 rows selected.
SQL>
Example 5: REGEXP_INSTR
We have a specific pattern of digits (9 99: 99: 99) and we want toknow the location of the pattern in our data.
We have a specified digital mode (999: 99: 99) and want to know where the mode is in our data.
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('1 01:01:01');
INSERT INTO t1 VALUES ('.2 02:02:02');
INSERT INTO t1 VALUES ('..3 03:03:03');
COMMIT;
We know we are looking for groups of numbers, so we can use "[0-9]" or "\ d ". we know the amount of digits in eachgroup, which we can indicate using the "{n}" operator, so we simplydescribe the pattern we are looking.
We know that we are looking for a group of numbers, so use "[0-9]" or "\ d ". We know the number of numbers in each group, so we can use the {n} operator, so let's briefly describe the pattern:
SELECT REGEXP_INSTR(data, '[0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}') AS string_loc_1,
REGEXP_INSTR(data, '\d \d{2}:\d{2}:\d{2}') AS string_loc_2
FROM t1;
STRING_LOC_1 STRING_LOC_2
------------ ------------
1 1
2 2
3 3
3 rows selected.
SQL>
Example 6: REGEXP_LIKE andREGEXP_SUBSTR
We have strings containing parentheses. We want to return the textwithin the parentheses for those rows that contain parentheses.
We have strings contained in parentheses. We only want to return strings in parentheses.
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('This is some text (with parentheses) in it.');
INSERT INTO t1 VALUES ('This text has no parentheses.');
INSERT INTO t1 VALUES ('This text has (parentheses too).');
COMMIT;
The basic pattern for text between parentheses is "\(. *\)". the "\" characters are escapes for theparentheses, making them literals. without the escapes they wocould be assumed todefine a sub-expression. that pattern alone is fine to identify the rows of interestusing a REGEXP_LIKE operator, but it is not appropriate in a REGEXP_SUBSTR, as itwowould return the parentheses also. to omit the parentheses we need to include asub-expression inside the literal parentheses "\((. *)\)". we can then REGEXP_SUBSTR using thefirst sub expression.
The basic format for matching the text in parentheses is \ (. * \). \ Is an escape character, so that the character following it becomes the literal value. However, when REGEXP_SUBSTR is used in this mode, parentheses are returned together. To ignore the parentheses, we need to include the subexpression "\ (. *) \)" in the literal brackets :"\((.*)\)".
COLUMN with_parentheses FORMAT A20
COLUMN without_parentheses FORMAT A20
SELECT data,
REGEXP_SUBSTR(data, '\(.*\)') AS with_parentheses,
REGEXP_SUBSTR(data, '\((.*)\)', 1, 1, 'i', 1) AS without_parentheses
FROM t1
WHERE REGEXP_LIKE(data, '\(.*\)');
DATA WITH_PARENTHESES WITHOUT_PARENTHESES
-------------------------------------------------- -------------------- --------------------
This is some text (with parentheses) in it. (with parentheses) with parentheses
This text has (parentheses too). (parentheses too) parentheses too
2 rows selected.
SQL>
Note: REGEXP_SUBSTR (data ,'\((. *) \) ', 1, 1,' I ', 1) the final I code is case-insensitive. The last "1" indicates the text of the matched subexpression returned. (Range: 0-9)
Example 7: REGEXP_COUNT
We need to know how many times a block of 4 digits appears intext. The data looks like this.
We need to know the number of times the four digit blocks appear in the string. View raw data:
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('1234');
INSERT INTO t1 VALUES ('1234 1234');
INSERT INTO t1 VALUES ('1234 1234 1234');
COMMIT;
We can identify digits using "\ d" or "[0-9]" and the "{4}" operator signifies 4 of them, so using "\ d {4}" or "[0-9] {4}" with the REGEXP_COUNT functionseems to be a valid option.
We can use the expressions \ d or [0-9] And {4} to recognize the blocks of four numbers.
SELECT REGEXP_COUNT(data, '[0-9]{4}') AS pattern_count_1,
REGEXP_COUNT(data, '\d{4}') AS pattern_count_2
FROM t1;
PATTERN_COUNT_1 PATTERN_COUNT_2
--------------- ---------------
1 1
2 2
3 3
3 rows selected.
SQL>
Example 8: REGEXP_LIKE
We need to identify invalid email addresses. The data looks likethis.
We need to verify the email address. The raw data is as follows:
DROP TABLE t1;
CREATE TABLE t1 (
data VARCHAR2(50)
);
INSERT INTO t1 VALUES ('me@example.com');
INSERT INTO t1 VALUES ('me@example');
INSERT INTO t1 VALUES ('@example.com');
INSERT INTO t1 VALUES ('me.me@example.com');
INSERT INTO t1 VALUES ('me.me@ example.com');
INSERT INTO t1 VALUES ('me.me@example-example.com');
COMMIT;
The following test gives us email addresses that approximate toinvalid email address formats.
The following tests give us an approximate invalid email address.
SELECT data
FROM t1
WHERE NOT REGEXP_LIKE(data, '[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', 'i');
DATA
--------------------------------------------------
me@example
@example.com
me.me@ example.com
3 rows selected.
SQL>
-----------------------------
Dylan Presents.