Regular Expressions Overview
- What are Regular Expressions?
- Regular expressions (regex for short) are special text strings that are used for searching text. They can describe not only literal text, but most importantly text patterns and are indispensable tool in advanced text processing. The regexes can be used to find text that conforms to a specific set of rules, something that follow a pattern. The common text patterns that we encounter in everyday documents include social security and account numbers, emails, phone and credit card numbers, street addresses, dates, SKU codes, web addresses and etc. Ordinary text search can be only used to find fixed text strings, for example a specific phone number or email address. Alternatively, regular expressions can be used to find any valid phone numbers or only phone numbers from a specific area code(s) or phone numbers starting/ending with specific digits. This is where power of regex is coming from.
- Many text processing software applications are using regular expressions in the text search. This is a well established language with a huge number of examples and tutorials freely available on Internet.
- Table of Contents
- Introduction
- A regular expression is a pattern describing a certain amount of text. It is matched against a subject string from left to right. Most characters stand for themselves, and match the corresponding characters in the subject text. The simplest form of regular expression is actual literal text. The power of regular expressions comes from the ability to include alternatives, character classes and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in a different way.
- Meta Characters
- All alphabetic characters and digits match themselves literally in regular expressions. A number of punctuation characters have special meaning:
- ^ $ . * + ? = | \ / ( ) [ ] { }
- Some of these characters have special meaning only within certain contexts of regular expressions and treated literally on other contexts. As a general rule, if you want to include any of these punctuation characters literally in a regular expression, you must precede them with a \. The most common mistake is using a period without a backslash to literally match a period character. Period without a backslash matches any possible symbol except a newline.
- Repetitions
- The characters that specify repetition always follow the pattern to which they are being applied. By using repetitions it possible to match a specific number of the same kind of character or pattern:
-
- + Matches the previous item one or more times. For example, A+ will match strings such as A, AA, AAA and etc.
- ? Matches the previous item zero or one time. It is used to match an optional part of the pattern.
- * Matches the previous item zero or more time. It is used to match optional parts of the pattern.
- {n} Matches the previous item exactly n times. For example, A{2} will match strings such as AA.
- {n,m} Matches the previous item at least n times, but no more than m times. For example, A{2,5} will match strings such as AA, AAA, AAAA or AAAAA.
- {n,} Matches the previous item at least n or more times. For example, A{2,} will match strings such as AA, AAA, AAAA, AAAAA and so on.
- Character Types
-
- \d Matches any decimal digits (0,1,2,3,4,5,6,7,8,9).
- \D Matches any character that is not a decimal digit.
- \s Matches any whitespace character such as space, tab and newline.
- \S Matches any character that is not a whitespace.
- \w Matches any "word" character (letter, digit or the underscore).
- \W Matches any character that is not a "word" character.
- [...] Matches any character that listed inside square brackets. For example [abc] matches one character that is either a, b or c.
- [^...] Matches any character that is not listed inside square brackets. For example [^abc] matches one character that is NOT a, b or c.
- [0-9] Matches any character between 0 and 9. It is equivalent to using \d.
- [A-D] Matches any character between A and F (A, B, C, D).
- Matching Alternatives
- Vertical bar characters are used to separate alternative patterns. For example, the pattern Configuration|Settings matches either "Configuration" or "Settings". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used.
- Examples:
-
- Arizona|Nevada Matches either Arizona or Nevada.
- Arizona|Nevada|California Matches Arizona, Nevada or California.
- \d{3}-\d{2}-(\d{4}|XXXX) Matches a social security number with 4 last digits being either digits or 4 letters X. For example, it will match both 507-55-1234 and 507-21-XXXX.
- Sub-Patterns
- Sub-patterns are delimited by parentheses (round brackets), which can be nested. They are used to group parts of the pattern together.
- Examples:
-
- State of (Arizona|Nevada|California) Matches "State of " followed by "Arizona, Nevada or California".
- (541|503)-\d{3}-\d{4} Matches phone numbers that start 541 and 503 area codes.
- Job (Site)? \d{5} Matches the word "Job" that is optionally followed by "Site" and a 5 digit number. It will match "Job 89123" and "Job Site 12345" text strings.
- Matching Whole Words
- Simple text patterns such as Alert are also going to match words "Alerts", "Alerted" and etc. If you want your pattern to match only whole words, surround it with \b meta-characters. For example, use \bAlert\b to match only word "Alert" and exclude all other words that might contain it as a sub-string. Use \b anywhere you need to match a "word boundary". A word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w. Meta-character \b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters.
- Using Anchors to Match Text Lines
- Anchors do not match any characters. They match only a particular text position in the string. Meta-character ^ matches at the start of the string/text, and $ matches at the end of the string. Symbol \b matches at a word boundary. E.g. ^B matches only the first B in "B123-B923". \B matches at every position where \b cannot match.
- Lookahead and Lookbehind Expression
- It is often necessary to match a certain text but only include a portion of the text string into a match. For example, you want to match social security numbers, but only want to include first 5 digits in the match. Regular expression syntax provides a special "look-ahead" expression to accomplish that.
-
- (?=p)
Positive lookahead assertion. Requires
that the following characters match the pattern p, but do not include those characters
in the match. For example, \d{3}-\d{3}-(?=\d{4}) will match all phone numbers, but will not include last 4 digit
into the match.
Another example: Supreme (?=Court) will match "Supreme" but only if it is followed by "Court". - (?!p) Negative lookahead assertion. Requires that the following characters do not match the pattern p. For example, \d{3}-\d{3}-(?!5523) will match all phone numbers except numbers not ending with 5523, but will not include last 4 digits into the match.
- (?<=p) Positive lookbehind assertion. Requires that the following characters match the pattern p, but do not include those characters in the match. For example, (?<=541-)\d{3}-\d{4} will match all phone numbers from 541 area code, but will not include it into the match.
- (?<!p) Negative lookbehind assertion. Requires that the following characters do not match the pattern p. For example, (?<!541-)\d{3}-\d{4} will match all phone numbers except numbers from 541 area code, but will not include it into the match.
- (?=p)
Positive lookahead assertion. Requires
that the following characters match the pattern p, but do not include those characters
in the match. For example, \d{3}-\d{3}-(?=\d{4}) will match all phone numbers, but will not include last 4 digit
into the match.
- The lookbehind expression needs to have a fixed length. For example, (?<=\d{3}) is a valid expression, while (?<=\d{3,5}) or (?<=\d+) are not.
- Keep Matched Text out of the Overall Match
- Use \K keyword to keep the text matched so far out of the overall regex match. For example, \d+, \Kd+ matches only the second number in the following list of numbers: 24, 47. You can use \K pretty much anywhere in any regular expression. You should only avoid using it inside lookbehind. This keyword can be used for situations similar to when lookbehind expressions are used, but without a fixed-length limitation that is impossed on the lookbehind patterns. But this flexibility does come at a cost. Lookbehind really goes backwards through the string. This allows lookbehind check for a match before the start of the match attempt. When the match attempt was started at the end of the previous match, lookbehind can match text that was part of the previous match. \K cannot do this, precisely because it does not affect the way the regex engine goes through the matching process. Another limitation is that while lookbehind comes in positive and negative variants, \K does not provide a way to negate anything.