An Introduction to Regular Expressions and Their Use in .NET I...
Regular expressions may look a little strange at first, but they are very powerful and you will probably need to use them at some point.
Regular expressions pop up in several locations in .NET and its supporting technologies. A prime example would be in the RegularExpression server validation control but the power of regular expressions are also utilised elsewhere, for example within XSD schemas with the pattern facet. This gives a good indication of what regular expressions are all about – pattern matching. In fact the .NET Framework regular expression classes are part of the base class library and can be used with any language or tool that targets the common language runtime, including ASP.NET and Visual Studio .NET.
Regular expressions provide a powerful, flexible, and efficient method for processing text. The extensive pattern-matching notation of regular expressions allows you to quickly parse large amounts of text to find specific character patterns; to extract, edit, replace, or delete text substrings; or to add the extracted strings to a collection in order to generate a report. For many applications that deal with strings (such as HTML processing, log file parsing, and HTTP header parsing), regular expressions are an indispensable tool.
Microsoft .NET Framework regular expressions incorporate the most popular features of other regular expression implementations such as those in Perl and awk. Designed to be compatible with Perl 5 regular expressions, .NET Framework regular expressions include features not yet seen in other implementations, such as right-to-left matching and on-the-fly compilation.
The focus of this first article shall be the regular expression language itself and on providing some examples that exemplify its key features. We'll also take an initial look at the use of regular expressions in .NET which we'll continue into a second article.
The regular expression language is designed and optimized to manipulate text. The language comprises two basic character types: literal (normal) text characters and metacharacters. It is the set of metacharacters that gives regular expressions their processing power.
No doubt you are familiar with the ? and * metacharacters frequently used to represent any single character or group of characters. Regular expressions extend this basic idea providing a large set of metacharacters that make it possible to describe very complex text-matching expressions with relatively few characters. It is this expressive capability that is central to the power of regular expressions.
So, for example:
\d represents a character from the group [0-9] and
\w represents a character from the group [a-zA-Z0-9]
. matches any character
Metacharacters aren't the only symbol used by regular expressions; there are also:
modifiers, quantifiers, anchors, escape characters, back references, alternation symbols and character
class symbols. Let's run through these briefly in turn:
Modifiers change how a match is performed. For example 'i' indicates that case should be ignored when matching strings.
e.g.
/solution/i represents a case insensitive search for the string 'solution'.
Quantifiers are a sub-set of metacharacters which specify the number of times a particular character should match. These include:
| ? | Matches any character zero or one times. |
| * | Matches the preceding element zero or more times. |
| + | Matches the preceding element one or more times. |
| {num} | Matches the preceding element num times. |
| {min, max} | Matches the preceding element at least min times, but not more than max times. |
e.g.
apple[0-9]{3,5}
matches any number between zero and nine at least three times in a row and no more than five times in a row.
This matches "apple123", "apple4321", "apple15243", but not "apple21".
Anchors specify the position where the pattern occurs. For example:
| ^ | Matches at the start of a line. |
| $ | Matches at the end of a line. |
| \< | Matches at the beginning of a word. |
| \> | Matches at the end of a word. |
| \b | Matches at the beginning or the end of a word. |
| \B | Matches any character not at the beginning or end of a word. |
e.g.
/\bliz/ matches lizard but not blizzard
Escape characters allow you to search for asterisks, question marks, slashes, etc., in a string. Since most of the non-alphanumerical characters are treated as special characters in regular expressions, place a backslash before the character to reverse the meaning of the special character.
e.g.: f".*" finds any character any number of times. But "\.*" finds strings of full stops of various lengths. The backslash allows you to search for a plain full stop "\.".
Backreferences allow you to load the results of a matched pattern into a buffer and then reuse it later in the expression. This allows regular expressions to behave as a search and replace.
For example,
s/\(apple)/pies, \1 and cherry
finds all instances of "apple", loads them into memory, and then replaces them with "pies, apple and cherry".
This technique handles strings of data that change slightly from instance to instance, such as page numbering schemes.
Alternation allows a regular expression to express a logical OR. If you want to search for apple or fruit, you could use the following:
apple|fruit
Add parentheses to limit the scope of alternate matches. This is useful when you search for words with two different spellings. For example,
gr(a|e)y
searches for both gray and grey.
Character classes match any character listed inside that class and use square brackets to separate from the rest of the regular expression. For example,
apple[0123456789]
matches "apple" followed by a zero, a one, a two, a three, a four, a five, a six, a seven, an eight, or a nine. You can abbreviate this using a dash. For example,
apple[0-9]
means the same as the longer regular expression just above. To match "apple" followed by any uppercase or lowercase alphanumeric character, we could write
apple[0-9A-Za-z]
If we separated those three ranges with a space, to make it easier to read,
apple[0-9 A-Z a-z]
this matches "apple" followed by any number between zero and nine, any uppercase or lowercase letter, or a space.
That's a quick look at the syntax. Let's now start examining the support for regular expressions in .NET.
As we've introduced, regular expressions are likely to useful wherever we have text based data we need to manipulate. This may be in support of other parts of the .NET Framework, e.g. the regular expression validator control, or in our own bespoke code.
The implementation of regular expression functionality is within the System.Text.RegularExpressions namespace, which contains the eight classes listed below:
RegEx - the Regex class represents an immutable (read-only) regular expression. It also contains static methods that allow use of other regular expression classes without explicitly instantiating objects of the other classes.
MatchCollection – represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.
Match – contains all the text matched by a single match
GroupCollection – contains all the groups in a single match
Group – contains the details of a single group in a group collection.
CaptureCollection – contains all the Capture objects for a single group
Capture – returns the string matched by a single capture within a group
RegexCompilationInfo – provides details needed to compile Regex into a standalone assembly
The Regex class is the most powerful and commonly used class. We'll take a closer look at Regex in part II of this series of two articles.
In this first article we've introduced background information regarding regular expressions. In the next article we'll look in more detail at the support for the introduced concepts as provided by .NET.
Various Online
VB.NET Text Manipulation Handbook
Liger et al.
Wrox