Get Smallest Matches Possible with Regular Expressions

I occasionally use regular expressions, and when I don’t regret it. But it can be difficult to get the syntax and concepts straight in my head because I only use them occasionally.

The general purpose behind regular expressions is to take a piece of text and find smaller pieces of text within it that match a certain pattern. There are various syntaxes for defining the patterns, and they overlap with each other quite a bit. You can match specific characters, numerical digits, whitespace characters, etc. Obviously I can’t provide a full tutorial here…other places on the web have tutorials on this.

One gotcha I often run into is when I am trying to match a pattern that starts and ends with specific characters. But regular expression engines often give you the largest possible match for your patterns rather than the smallest, and typically I want the smallest possible match. For example, let’s say the text you are working with is:

abcd abcd abed

For demo purposes, let’s say the first pattern I want to match is “bcd”.

Below is some Python code that illustrates how to use regular expressions to find matches for this pattern. (To the Python purists, I will admit that this code is not as short/terse as it could be. I decided to make it a little more verbose so readers not super familiar with Python would not get lost, for example, if I used list comprehensions.)

import re
 
text = "abcd abcd abed"
pattern = "bcd"
 
matches = []
 
for matchGroup in re.finditer(pattern, text):
    matchText = matchGroup.group(0)
    matches.append(matchText)
 
print matches

The output of this will be:

['bcd', 'bcd']

OK, that’s good. Now let’s say I wanted to match anything that started with “b” and ended with “d”. This is exactly the kind of task for which regular expressions are useful. I can use the same code but specify a “b” followed by a dot character followed by a plus followed by a “d”: “b.+d”. This means that I want to find anything that starts with a “b”, ends with a “d”, and has at least one character in between. When I run the Python code I get:

import re
 
text = "abcd abcd abed"
pattern = "b.+d"
 
matches = []
 
for matchGroup in re.finditer(pattern, text):
    matchText = matchGroup.group(0)
    matches.append(matchText)
 
print matches
 
Output:
['bcd abcd abed']

It found the largest single match that started with a “b” and ended with a “d”: almost the entire text. But that’s not exactly what I was looking for. I wanted it to identify the smaller, individual matches: “bcd”, “bcd”, and “bed”.

Well, there’s an easy way around this. Simply prefix the last character in your pattern with a question mark. (Note that this syntax may vary depending on the programming language or regular expression implementation you are using.)

import re
 
text = "abcd abcd abed"
pattern = "b.+?d"
 
matches = []
 
for matchGroup in re.finditer(pattern, text):
    matchText = matchGroup.group(0)
    matches.append(matchText)
 
print matches
 
Outputs:
['bcd', 'bcd', 'bed']

Leave a Reply