How to understand the language of
<\/?[\w\s]*>|<.+[\W]>
Ever wondered how regex works under the hood? How does it understand an incantation like "<\/?[\w\s]*>|<.+[\W]>"
and magically produces a desired result? This series is going to describe exactly how it works and how to implement a feature-rich regex engine.
Now you might be wondering, why would you want to do that? Well, turns out, this is a fantastic learning opportunity. Parsers, compilers, finite automaton, graphs, trees, extended grapheme clusters - it has everything! Last but not least, you get a chance to learn regex.
Regular Expressions #
Regular expression patterns can be as simple as “https?
”, where “?
” is a Zero or One Quantifier and which matches either http
or https
. Or as complex as this, which validates an email address, I think:
((?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:a(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Implementing an engine capable of matching patterns like “https?
” is easy because you can cut corners. Building one that supports the majority of the features of a modern regex engine is not.
If you are not familiar with regex or need a refresher, I would highly recommend going through this Quick Reference or any other tutorial on the subject. There is going to be no introduction into regex in this series. You only need to know the basics to continue.
To make it more challenging, I also wanted it to have performance comparable to NSRegularExpression
which uses highly optimized ICU regex engine written in C.
Overview #
There are four main pieces of the puzzle that needs to be solved to make it all work.
Grammar #
Before writing a parser for a regular expression language, one needs to define the rules of the language, or grammar. This is what Regex, Prologue: Grammar is about. This article outlines some of the theory behind languages and grammars. If that’s not what you are interested in, you can skip it and jump straight into parsing. However, I would still recommend revisiting this article later.
Parser #
Regular expressions have complicated syntax with many constructs, including recursive ones, like Capture Groups. The pattern itself is just a raw string. To reason about it, you first need to parse it and create an abstract representation which you can easily manipulate – an abstract syntax tree (AST). In Regex, Part 1: Parser we will implement such parser using Parser Combinators (or monadic parsers) which are a fantastic example of functional programming used to bring practical benefits.
Compiler #
In Regex, Part 2: Compiler we will learn about Finite State Automaton and how you can use it to represent regular expressions. In this part, some of the theory from the part about grammars will come in handy.
Matcher #
And finally, in Regex, Part 3: Matcher I will introduce two very different matcher algorithms. The first one uses backtracking and is similar to the algorithm found in .NET and other platforms. The second one guarantees linear time complexity and is close to what is used by RE2 created by Google.
What’s Next #
I hope I got you excited! This is going to be a fun series packed with ideas and concepts. It will be useful for anyone, regardless of whether you have a CS background or not. You can find a complete implementation on GitHub and if you have any comments or suggestions, please, feel free to hit me up on Twitter.