Confront your greatest fear and parse a string with a Regular Expression
In this blog we usually talk about tech management but let's have a refreshing tutorial on a less advanced topic!
Regular expressions are a scary thing and can take quite a while to be digested — even for mid-level developers. Many useful tools such as regex101 that will decode the syntax for you or debuggex which will let you visualize the expression as finite state machine.
But nothing like putting your hands in the dirt to understand how something really works! Something that eluded me for years is how you could parse a string — and in particular if there is an escaped quote in it?
Let’s start with the beginning. We want to match a basic string. Regular expressions are expressed in JS.
# To match
"hello"
# Regxp
/".*"/
The structure is simple: first a quote, then any character any number of time, then another quote. But in real life you’re probably working on a parser. For example:
<something foo="bar" bar="foo" />
In this case the regular expression is going to get greedy and return "bar" bar="foo"
, which is not what we want.
The first trick is probably to tell the regular expression not to be greedy by using the ?
symbol.
# To match
<something foo="bar" bar="foo" />
# Regular expression
/".*?"/
That’s fine but now if like in most cases you want to allow your users to have quotes in the string by escaping them, you’ll be out of luck. This for example will not work:
const name = "Dwayne \"The Rock\" Johnson";
// Will match: "Dwayne \" and " Johnson"
This part got me perplexed for the longest time. There are different ways to solve it, my personal favorite is to consider what we want to allow within our string. Namely:
Any character that isn’t an end quote is fine:
[^"]
in regex language (^
is for not)Any escape sequence — aka something that starts with a backslash:
\\.
in regex
Since we don’t want the first match to eat up the second match (afterall a “backslash” is “not a quote”), we’ll make sure to put them in the right order so that the matching can happen easily.
# To match
const name = "Dwayne \"The Rock\" Johnson";
# Regular Expression
/"(\\.|[^"])*"/
And that’s it! You are now matching an escaped string. Not that scary anymore?
Let’s study the second method, that I’ve found inside of Lark (amazing package by the way). It’s both simpler and more confusing and does not work with older JavaScript engines, but let’s go into it.
Essentially, if you say that “escaped quotes must not terminate the string” then it means that “the last quote of the string can’t be escaped”. That’s something we can easily check with a negative assertion:
# To match
const name = "Dwayne \"The Rock\" Johnson";
# Regular Expression
/".*?(?<!\\)"/
The novelty here is that instead of just having a non-greedy match-all ( .*?
), we’re adding at the end an assertion (?<!\\)
to check that there is no backslash before the end. This has however a drawback, it’s that you can’t escape a backslash right before the end of the string, because then the last quote would be preceded by a backslash (still with me?). In short, this doesn’t work:
const effect = "Domino \\";
But fortunately, we can allow to terminate the string with quoted backslashes!
# To match
const effect = "Domino \\";
# Regular Expression
".*?(?<!\\)(\\\\)*?"
And here we are! Matching strings another way.
Let’s hope that this problem-oriented walkthrough helped you understand relatively advanced thought patterns in regular expression. Often you’ll walk on problems that can seem intractable without the proper knowledge but which can easily be unlocked if you master regular expressions — or better even: parsers! But that’s for another article.