Chapter 14 Regular Expressions
In the world of string matching, regular expressions are the best thing since sliced bread. Admittedly, in the somewhat odd world of string matching, they didn’t really have bread in the first place, but still.
Regular expressions (or “regexes”) is a declarative2 language to describe strings of characters (sequences of letters, punctuation, words, sentences, etc). For example, this is a relatively simple regular expression from http://regexlib.com/Search.aspx?k=email to match email addresses:
^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$
There is an excellent online app at https://regex101.com/ that helps you learn about and test regexes by breaking them down and explaining to you exactly what it does. For this regular expression, it provides this explanation:
Of course, this regular expression does not work for all email addresses. If you studied this broken down explanation, you’ll have realised that, for one, this regular expression only matches email addresses with domain names that have a 2- or 3-letter top level domain (TLD, the characters following the last period in a domain name). Many different regexes exist to match email addresses - for example, this one “works 99.99% of the time”:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
So, while the building blocks of regular expressions are relatively simple, things can get out of hand quickly.
Declarative languages describe something; e.g. HTML and regexes don’t contain commands, but descriptions, whereas imperative languages, like R or javascript, contain commands.↩︎