Study Guide: Regular Expressions
Instructions
This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.
Assignments
Important: For solutions to these assignments once they have been released, find the links in the front page calendar.
Lectures
Guides
What are Regular Expressions?
Consider the following scenarios:
- You've just written a 500 page book on how to dominate the game of Hog. Your book assumes that all players use six-sided dice. However, in 2035, the newest version of Hog is updated to use 25-sided dice. You now need to find all instances where you mention six-sided dice in your article and update them to refer to 25-sided dice.
- You are in charge of a registry of phone numbers for the SchemeCorp company. You're planning a company social for all employees in the city of Berkeley. To make your guest list, you want to find the phone numbers of all employees living in the Berkeley area codes of 415 or 314.
- You're the communications expert on an interplanetary voyage, and you've received a message from another starship captain with the locations of a large number of potentially habitable planets, represented as strings. You must determine which of these planets lie in your star system.
What do all of these scenarios have in common? They all involve searching for patterns within a larger piece of text. These can include extracting strings that begin with a certain set of characters, contain a certain set of characters, or follow a certain format.
Regular expressions are a powerful tool for solving these kinds of problems. With regular expression operators, we can write expressions to describe a set of strings that match a specified pattern.
For example, the following code defines a function that matches all words that start with the letter "h" (capitalized or lowercase) and end with the lowercase letter "y".
import re
def hy_finder(text):
"""
>>> hy_finder("Hey! Hurray, I hope you have a lovely day full of harmony.")
['Hey', 'Hurray', 'harmony']
"""
return re.findall(r"\b[Hh][a-z]*y\b", text)
Let's examine the above regular expression piece by piece.
- First, we use
r""
, which denotes a raw string in Python. Raw strings handle the backslash character\
differently than regular string literals. For example, the\b
in this regular expression is treated as a sequence of two characters. If we were to use a string literal without the additionalr
,\b
would be treated as a single character representing an ASCII bell code. - We then begin and end our regular expression with
\b
. This ensures that word boundaries exist before the "h" and after the "y" in the string we want to match. - We use
[Hh]
to represent that we want our word to start with either a capital or lowercase "h" character. - We want our word to contain 0 or more (denoted by the
*
character) lowercase letters between the "h" and "y". We use[a-z]
to refer to this set. - Finally, we use the character
y
to denote that our string should end with the lowercase letter "y".
Regular Expression Operators
Regular expressions are most often constructed using combinations of operators. The following special characters represent operators in regular expressions: \
, (
, )
, [
, ]
, {
, }
, +
, *
, ?
, |
, $
, ^
, and .
.
We can still build regular expressions without using any of these special characters. However, these expressions would only be able to handle exact matches. For example, the expression potato
would match all occurences of the characters p, o, t, a, t, and o, in that order, within a string.
Leveraging these operators enables us to build much more interesting expressions that can match a wide range of patterns. We'd recommend using interactive tools like regexr.com or regex101.com to practice using these.
Let's take a look at some common operators.
Pattern | Description | Example | Example Matches | Example Non-matches |
---|---|---|---|---|
[] |
Denotes a character class. Matches characters in a set (including ranges of characters like 0-9 ). Use [^] to match characters outside a set. |
[top] |
t, o, p | s, march, 3 |
. |
Matches any character other than the newline character. | 1. |
1a, 1?, 11 | 1, 1\n |
\d |
Matches any digit character. Equivalent to [0-9] . \D is the complement and refers to all non-digit characters. |
\d\d |
12, 42, 60 | 4, 890 |
\w |
Matches any word character. Equivalent to [A-Za-z0-9_] . \W is the complement. |
\d\w |
1a, 9_, 4Z | 1-, a5 |
\s |
Matches any whitespace character: spaces, tabs, or line breaks. \S is the complement. |
\d\s\w |
1 s, 9 _, 4 Z | 1s, 1 s |
* |
Matches 0 or more of the previous pattern. | a* |
, a, aa, aaaaa | schmorp, mlep |
+ |
Matches 1 or more of the previous pattern. | lo+l |
lol, lool, loool | ll, lal |
? |
Matches 0 or 1 of the previous pattern. | lo?l |
lol, ll | lool, lulz |
| |
Usage: Char1 | Char2 . Matches either Char1 or Char2. |
a|b |
a, b | c, d |
() |
Creates a group. Matches occurences of all characters within a group. | (<3)+ |
<3, <3<3, <3<3<3 | <<, 33 |
{} |
Used like {Min, Max} . Matches a quantity between Min and Max of the previous pattern. |
a{2,4} |
aa, aaa, aaaa | a, aaaaa |
^ |
Matches the beginning of a string. | ^aw+ |
aw, aww, awww | wa, waaa |
$ |
Matches the end of a string. | \w+y$ |
hey, bay, stay | yes, aye |
\b |
Matches a word boundary, the beginning or end of a word. | \w+e\b |
bridge, smoothie | next, everlasting |
Regular Expressions in Python
In Python, we use the re
module (see the Python documentation for more information) to write regular expressions. The following are some useful function in the re
module:
re.search(pattern, string)
- returns a match object representing the first occurrence ofpattern
withinstring
re.sub(pattern, repl, string)
- substitutes all matches ofpattern
withinstring
withrepl
re.fullmatch(pattern, string)
- returns a match object, requiring thatpattern
matches the entirety ofstring
re.match(pattern, string)
- returns a match object, requiring thatstring
starts with a substring that matchespattern
re.findall(pattern, string)
- returns a list of strings representing all matches ofpattern
withinstring
, from left to right