We interviewed Liam Wiltshire, who is one of our speakers at ConFoo Vancouver 2016. His presentation is titled “RegEx Is Your Friend”. Liam is a senior developer and business manager. He currently focuses on providing gCommerce platforms for multiplayer sandbox games. He lives in the United Kingdom.
What are regular expressions and how are they useful?
RegEx is a powerful way of defining a search pattern that can be used to match and replace parts of a string. Its power comes from its flexibility -- when you need to match parts of a string, but you don't know what the specifics are (for example, it's a number but you don't know what specifically, or a URL, but you don't know exactly what the URL will be), RegEx comes into its own. RegEx is used for a wide range of purposes, notably Data Validation.
Checking user-provided data against a regular expression pattern that describes the format you are expecting is an effective way of validating the data you receive. If you are handling user data, you almost certainly want to check that the data matches what you are expecting -- that if you've asked for a phone number, then the user has entered a phone number, or that you're not about to add 'Little Bobby Tables' to your database. Or equally, checking that the data provided doesn't contain things you don't want (HTML, URLs, etc).
Text Replacement: Find and Replace is a tool that most of us use regularly. That could be in your OS (for example, sed), in your email client or word processor, amongst many other things. Regular expressions are often used to strip out data when you don't know the specifics of what you need to remove - for example if you want to strip all URLs out of a posted comment (where you know the format a URL follows, but not exactly what domain, what scheme etc.). RegEx allows you to replace based on what the text 'looks like', rather than requiring an exact match.
Text Extraction: In our jobs, we regularly deal with data manipulation. We'd like to think that all the data we work with is well formatted, but everyone has to pull data from an HTML page at some time! RegEx can be employed to extract patterns of data (email addresses, URLs or anything else) from within a larger block of text.
Pattern Matching/Configuration: If you've ever had to configure a web server, or modify a fail2ban rule, or configure almost anything else in a *nix environment, you've probably had to work with regular expressions.
Why are people afraid to try regular expressions?
Regular Expressions get a bad rap for being slow, hard to understand, and hard to maintain. I've seen developers go to amazing lengths to avoid using them -- chaining string splits 10 or more times, or using fixed string replaces to replace everything you don't need until you get to the bits you do. However, as with everything in development, it depends on how it's used. Just as you can have terribly written code (be that PHP, Perl, Python or anything else) which is slow, looks terrible, uses copy-paste everywhere, and is unmaintainable, you can also find code in those exact same languages that is practically a work of art. RegEx done right can be just as effective, and easy to use and understand.
How do you keep regular expressions readable?
On the whole, the rules are the same as they are with any programming language - whitespace, comments and references! Most flavours of RegEx include a modifier to tell the engine to ignore whitespace (in PHP and Perl this is the 'x' modifier; in Python you can use both X and VERBOSE flags). This then lets you add comments to your pattern and break it up over multiple lines, instantly making it much more readable.
Equally, one of the hardest things to read in RegEx are back references. When you start littering your pattern with \1, \2, \3 and more, and then trying to work out which capturing group relates to each, things can get very confusing. Most flavours of RegEx allow you to give these capturing groups names, so that you can give it an understandable reference, and then use that reference within the pattern to make the whole thing more readable. For example, both of these patterns match a canadian postcode, but which is the easier to understand?
[ABCEGHJKLMNPRSTVXY] (?# Starts with any letter except D,F,I,O,Q,U,W or Z)
[0-9] (?# Exactly one number)
[ABCEGHJKLMNPRSTVWXYZ]) (?# Any letter except D,F,I,O,Q or U)
[0-9] (?# Exactly one number)
[ABCEGHJKLMNPRSTVWXYZ] (?# Any letter except D,F,I,O,Q or U)
[0-9]) (?# Exactly one number)
What is your top RegEx performance advice?
That's a tough one. The key to it is -- much the same as with anything else -- understanding as much as possible how the engine processes your pattern. By knowing the order in which things are checked, what the steps are in testing a string against your pattern, and how many steps it will take to do that, you can really help improve your RegEx performance. As an example, consider the pattern:
If you tested those two patterns against the string "I really like PHP!", then the captured group will be the same in both instances, but the first pattern needs to go through 39 steps, whereas the second only requires 4! The negative character class allows for a much more optimised test in the RegEx engine, resulting in a much 'lighter' query for the same thing.
Which tools would you recommend for debugging regular expressions?
The tool I most commonly use is regex101.com. It's great for quickly knocking up patterns and testing them against text in real time, as well as being able to explain what a pattern is doing. It can 'beautify' patterns for you and show you how the RegEx engine will process your pattern and the number of steps required to do so. All of this helps you write better performing RegEx.