recipes : Text Processing : Regular expressions: complicated string matches

Problem

How do I match complicated strings in Matlab? Regular expressions in MATLAB.

Solution

Sometimes it's not enough to search for a simple series of characters within a string. You may have a task that requires you to search for something more abstract. For example, you may want to extract a ZIP code or a URL from a text file. Perhaps you want to extract data from an XML file? Regular expressions come to the rescue in instances such as these.

Regular expressions (regexes, for short) are rather like a miniature programming language for concisely defining a sequence of characters which you want to match. Before going further, it's worth pointing out that regexes frequently can look rather formidable (particularly at first). For example, a regular expression to check if a string is a valid MAC address is: ^([0-9a- fA-F][0-9a-fA-F]:){5}([0-9a-fA-F][0-9a-fA-F])$

It's beyond the scope of this recipe to cover regexes in detail (there are large books on the topic). Instead I will outline the MATLAB functions that implement regular expressions and provide a few examples. If you work through a complicated regex in a systematic way, it will make eventually make sense. It's worth persisting because there's a lot you an do with regular expressions.

The MATLAB function regexp is used to match a regular expression to a string. Its sister function, regexpi, does this in a case-insensitive way. The regexprep allows you to match part of a string with a regular expression and then replace this with another string. For our purposes, we will focus on string matching with regexp. Here's a simple example:

testString='cat on the mat';
regex='at';

[s,e]=regexp(testString,regex)

s =
     2    13

e =
     3    14

Here we've made no real use of regular expressions at all. We've simply used regexp to match "at" in the string "cat on the mat". The sequence "at" appears twice and the output arguments tell us the start and end of each appearance. So the first appearance is from the second to third characters which is s(1) to e(1). If you look in the regexp help you will see that there are many output arguments and you ask regexp to return a specific output argument. Let's try that with an example that actually uses a regular expression:

testString='The aardvark eats crumpets and strumpets for dinner';

%Match all words that end with the string "pets"
regexp(testString,'\w+pets','match')
 ans = 

    'crumpets'    'strumpets'

The regular expression \w+pets is read as follows:

  1. \w match any alphabetic, numeric, or underscore character
  2. + one or more times
  3. followed by these characters: pets

The \w is a shorthand for representing alphabetic, numeric, and underscore characters. There are other such shorthands and these are covered on the MATLAB regular expression page. For example, \d means match a numeric digit. The reason that \w and \d are shorthands is because there's a long-form expression which does the same job, but is written in a more basic way. The long-form for \d is [0-9]. So, "match digits 0 to 9". For \w it's [a-zA-Z_0-9]. So, "match a-z (lower case) or A-Z (upper case) or an underscore or 0 to 9".

Now you know how \w is related to its long-form version. Hopefully you can also see how it's possible to make up any expression you want. For example, say you want to match first an upper case H, then one or more lower case characters:

testString='Siobhan the brachiosaurus says "Hello!"';

regexp(testString,'H[a-z]+','match')
ans = 

    'Hello'

This matches only the word "Hello" because nothing else contains "H" and the "!" character doesn't match the expression, so the regex processing engine stops when it encounters it. The + is known as a quantifier. There are other quantifiers (e.g. to match between n and m times) and these are discussed in the regular expression help page. If you read about quantifiers, and I also tell you that ^ means the start of a line and $ means the end of a line then you should know enough to decipher that MAC address regex with which we opened this page.

Discussion

For fear of making this recipe overly long, I will stop here. The main point I want to get across is that regular expressions are very useful and flexible. They take a little getting used to but are worth the effort. Carefully reading the MATLAB documentation pages should be enough to get you going. Raise questions below if you're still stuck.

 

Want to continue the discussion?
Enter your comments, suggestions, or thoughts below

comments powered by Disqus