Merlin's Magic: parsing a sequence of characters with a new regular expression library

Source: Internet
Author: User
Tags character classes character set expression mail regular expression valid

Text string for parsing mode

Regular expressions are methods based on text matching patterns--similar to how the compiler generates class files. The compiler finds various patterns in the source code to convert the source-code expression to bytecode. By recognizing these source code patterns, the compiler is able to convert only valid source codes into compiled class files.

What is a pattern?

In the context of a regular expression, a pattern is a textual representation of a sequence of characters. For example, if you want to know whether the word car exists in a sequence of characters, you use the pattern car, because this is the method that accurately represents the string. For more complex patterns, you can use special characters as placeholders. If you are not searching for car, but you want to search for any text string that starts with the letter C and ends with the letter R, you use the C*R mode, where * represents any number of characters before the first R. The C*r pattern matches any string that starts with a C and ends with R, such as Cougar, Cavalier, or Chrysler.

How to specify a pattern expression

The main part of pattern matching is about what expressions to use. Pattern first saves the expression you want to use and then passes it to the Matcher class to check its match in the context of the character sequence. For example, if you want to verify an e-mail address, you might want to check whether user input matches a pattern--it contains an alphanumeric sequence followed by an @ sign, and two groups of characters separated by periods. This can be represented by an expression \p{alnum}+@\w+\.\p{alpha}{2,3}. (yes, this simplifies the structure of e-mail addresses and may exclude some valid e-mail addresses, but it is sufficient as an example.) )

Before discussing the specifics of the pattern language, let's take a closer look at \p{alnum}+@\w+\.\p{alpha}{2,3}. The \p{alnum} sequence represents a single alphanumeric character (A to Z, A to Z, or 0 to 9). The plus sign (+) after \p{alnum} is called a quantifier (quantifier). It is applied to the previous section of the expression, indicating that \p{alnum} must appear one or more times. Use an asterisk (*) to indicate that you want to appear 0 or more times (including once). @ means that it must now be at least one alphanumeric character, so that the entire pattern match succeeds. \w+ is similar to \p{alnum}+, but an underscore (_) is added. Some of the sequences have multiple expressions. The backslash (\.) represents the period. If there is no backslash before, a single period represents any character. The final \p{alpha}{2, 3} represents two or three alphabetic characters.

As long as you learn the standard language, you can master all the secrets of the pattern. Let's look at some of the more common types of expressions:

Text (Literal): Any character in an expression that does not have a special meaning is treated as a literal and matches itself.

Quantifier (quantifier): Some characters or expressions that are used to calculate the number of times a text or grouping can appear in a sequence of characters so that the sequence matches an expression. Grouping is specified by a set of characters within parentheses.

? Indicates a single occurrence or no occurrence at all

* means to appear 0 or more times (including once)

+ Indicates the occurrence of one or more times

Character class (Character Class): A character class is a character set within square brackets, where the match can be any one of the characters in parentheses. You can combine character classes with quantifiers, for example, [acegikmoqsuwy]* will be any sequence of characters that contains only the odd letters in the alphabet. Some character classes are predefined:

\d― numbers (0 to 9)

\d--Non-digital

\s--whitespace characters, such as tabs or line breaks

\s--non-white-space characters

\w--Single character (A to Z, a to Z, 0 to 9, and underscore)

\w--Non-single character (any other character)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.