Chapter 14 Regular Expressions

In the world of string matching, regular expressions are the best thing since sliced bread. Admittedly, in the somewhat odd world of string matching, they didn’t really have bread in the first place, but still.

XKCD 208: Regular Expression

Regular expressions (or “regexes”) is a declarative2 language to describe strings of characters (sequences of letters, punctuation, words, sentences, etc). For example, this is a relatively simple regular expression from http://regexlib.com/Search.aspx?k=email to match email addresses:

^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$

There is an excellent online app at https://regex101.com/ that helps you learn about and test regexes by breaking them down and explaining to you exactly what it does. For this regular expression, it provides this explanation:

Regex result from RegEx101

Of course, this regular expression does not work for all email addresses. If you studied this broken down explanation, you’ll have realised that, for one, this regular expression only matches email addresses with domain names that have a 2- or 3-letter top level domain (TLD, the characters following the last period in a domain name). Many different regexes exist to match email addresses - for example, this one “works 99.99% of the time”:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

So, while the building blocks of regular expressions are relatively simple, things can get out of hand quickly.

XKCD 1171: Perl Problems


  1. Declarative languages describe something; e.g. HTML and regexes don’t contain commands, but descriptions, whereas imperative languages, like R or javascript, contain commands.↩︎