TN19: Regular Expression Control

Tech Note 19: Regular Expression Control

November 30, 2006

Introduction

NS Basic/CE comes with a reasonably good set of functions for string manipulation. However, people who have used other programming languages such as Perl know that there are much more powerful pattern matching and replacement functions. A standard for this is Regular Expressions. Starting with NS Basic/CE 7.0, NS Basic/CE has a built in regular expression object. It is available on all devices running Windows CE 4.0 and later. Regular Expressions were originally developed as part of UNIX. They can be cryptic and difficult to learn, but they allow sophisticated pattern matching in strings.

Regular Expressions are quite involved (it is possible to buy whole books devoted to their use!). If you google on "Regular Expresssions", you can get quite a bit more information about this powerful tool.

Using the RegEx Object

Support for Regular Expressions is implemented in the RegExp object. Let's do a simple example to show how it works. Suppose we have this string:

StringToSearch = "http://www.nsbasic.com"

The RegExp object can then be created:

AddObject "RegExp", "RegularExpressionObject"

Set RegularExpressionObject = New RegExp

RegEx Properties

This object has three properties:

Pattern specifies the Regular Expression that should be searched for. See the "List of all Pattern characters" below for more information.
IgnoreCase should be True or False depending on whether the search should be case sensitive (the default is True).
Global is True if the search should match all occurrences of the pattern, or False if just the first occurrence should be matched.

With RegularExpressionObject
  .Pattern = ".com"
  .IgnoreCase = True
  .Global = True
End With

RegEx Methods

RegExp can do 3 things to a string: Text, Execute and Replace. Here is how each of them work.

Test

This uses the Test method of the RegExp object to see if the Regular Expression is found in the StringToSearch string.

res = RegularExpressionObject.Test(StringToSearch)

The Test method will return True if the Regular Expression was found, and False if it was not found.

If res Then
  Print RegularExpressionObject.Pattern & " was found in " & StringToSearch
Else
  Print RegularExpressionObject.Pattern & " was not found in " & StringToSearch
End If

Execute

The RegExp Execute method is a more sophisticated version of the Test method. As well as seeing if the Regular Expression is found within a string, it will also return the number of matches made within that string, and at which positions in the string the matches were found. It returns its result in an object, so the results can be enumerated in a For Each loop.

StringToSearch = "The answer to life, the universe and everything is 42."

Set RegularExpressionObject = New RegExp

With RegularExpressionObject
  .Pattern = "the"
  .IgnoreCase = True
  .Global = True
End With

Set res = RegularExpressionObject.Execute(StringToSearch)

If res.Count > 0 Then
  For Each item in res
    Print item.Value & " was matched at position " & item.FirstIndex
  Next
Else
  Print RegularExpressionObject.Pattern & " was not found in the string: " & StringToSearch
End If

As with the Test method, the RegExp’s Global and IgnoreCase properties are useful.

Replace

This can be used to replace a part of a string using Regular Expression matching. For example, in the script below, each case of "a" is replaced by "o".

InitialString = "My name is Zaphod"

Set RegularExpressionObject = New RegExp

With RegularExpressionObject
  .Pattern = "a"
  .IgnoreCase = True
  .Global = True
End With

ReplacedString = RegularExpressionObject.Replace(InitialString, "o")

Print "Replaced " & InitialString & " with " & ReplacedString

Real life Regular Expressions

So far, there is nothing here that couldn't already be done with other NS Basic/CE functions. The power of Regular Expressions only become apparent when more complex situations are encountered. For example, the function below will strip out all the HTML tags from strings:

Function stripHTMLtags(HTMLstring)
  Set RegularExpressionObject = New RegExp  
  With RegularExpressionObject
    .Pattern = "<[^>]+>"
    .IgnoreCase = True
    .Global = True
  End With
  stripHTMLtags = RegularExpressionObject.Replace(HTMLstring, "")
End Function

The function can then be called using something like:

Print stripHTMLtags("This is some HTML")

The function works because it replaces HTML tags with a null character. HTML tags are identified using the Regular Expression held in the Pattern property. This is a sequence of special characters. This means that a HTML tag should start with a "<". It should then contain one or more characters except for a greater than sign ">". This is indicated by enclosing the greater than sign in square brackets, and using the plus sign (which means match the preceding character one or more times. The ^ symbol denotes that the character should NOT appear. Finally, it should contain a greater than sign to close the HTML tag.

The dollar sign is used to look for matches at the end of a string, so the following will look for .com at the end of a string:

.Pattern = ".com$"

Use a bar to specify that several expressions should be matched. The following will match .co.uk or .com at the end of a string:

.Pattern = ".gov|.com$"

List of all Pattern characters

Character	Description
\	Marks the next character as either a special character or a literal. For example, "n" matches the character "n". "\n" matches a newline character. The sequence "\\" matches "\" and "\(" matches "(".
^	Matches the beginning of input.
$	Matches the end of input.
*	Matches the preceding character zero or more times. For example, "zo*" matches either "z" or "zoo".
+	Matches the preceding character one or more times. For example, "zo+" matches "zoo" but not "z".
?	Matches the preceding character zero or one time. For example, "a?ve?" matches the "ve" in "never".
.	Matches any single character except a newline character.
(pattern)	Matches pattern and remembers the match. The matched substring can be retrieved from the resulting Matches collection, using Item [0]...[n]. To match parentheses characters ( ), use "$" or "$".
x\|y	Matches either x or y. For example, "z\|wood" matches "z" or "wood". "(z\|w)oo" matches "zoo" or "wood".
{n}	n is a nonnegative integer. Matches exactly n times. For example, "o{2}" does not match the "o" in "Bob," but matches the first two o's in "foooood".
{n,}	n is a nonnegative integer. Matches at least n times. For example, "o{2,}" does not match the "o" in "Bob" and matches all the o's in "foooood." "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
{ n , m }	m and n are nonnegative integers. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood." "o{0,1}" is equivalent to "o?".
[ xyz ]	A character set. Matches any one of the enclosed characters. For example, "[abc]" matches the "a" in "plain".
[^ xyz ]	A negative character set. Matches any character not enclosed. For example, "[^abc]" matches the "p" in "plain".
[ a-z ]	A range of characters. Matches any character in the specified range. For example, "[a-z]" matches any lowercase alphabetic character in the range "a" through "z".
[^ m-z ]	A negative range characters. Matches any character not in the specified range. For example, "[m-z]" matches any character not in the range "m" through "z".
\b	Matches a word boundary, that is, the position between a word and a space. For example, "er\b" matches the "er" in "never" but not the "er" in "verb".
\B	Matches a non-word boundary. "ea*r\B" matches the "ear" in "never early".
\d	Matches a digit character. Equivalent to [0-9].
\D	Matches a non-digit character. Equivalent to [^0-9].
\f	Matches a form-feed character.
\n	Matches a newline character.
\r	Matches a carriage return character.
\s	Matches any white space including space, tab, form-feed, etc. Equivalent to "[ \f\n\r\t\v]".
\S	Matches any nonwhite space character. Equivalent to "[^ \f\n\r\t\v]".
\t	Matches a tab character.
\v	Matches a vertical tab character.
\w	Matches any word character including underscore. Equivalent to "[A-Za-z0-9_]".
\W	Matches any non-word character. Equivalent to "[^A-Za-z0-9_]".
\num	Matches num, where num is a positive integer. A reference back to remembered matches. For example, "(.)\1" matches two consecutive identical characters.
\ n	Matches n, where n is an octal escape value. Octal escape values must be 1, 2, or 3 digits long. For example, "\11" and "\011" both match a tab character. "\0011" is the equivalent of "\001" & "1". Octal escape values must not exceed 256. If they do, only the first two digits comprise the expression. Allows ASCII codes to be used in regular expressions.
\xn	Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04" & "1". Allows ASCII codes to be used in regular expressions.