# Binary Expression Parsing in Haskell

This post is a basic guide to building a binary expression parser with operator precedence entirely from scratch in Haskell. This information is surprisingly hard to find on the internet, so I’ve decided to make a blog post about it.

Why build an expression parser? If you want to build a compiler, this is usually the first place to start. Most programming languages support binary expressions like `1*2+3`

, and parsing is most of the work when compiling expressions.

This post assumes intermediate knowledge of Haskell. You should be somewhat familiar with functors, applicative functors, and monads. I recommend spending some time understanding these topics before continuing, otherwise the code later in the post won’t make much sense.

Before we begin, I highly recommend checking out JSON Parsing from Scratch in Haskell. The parser we’re going to build uses the same basic ideas, just for a different problem.

All the code in this post is available on my GitHub.

## Contents

- Overview
- Grammars
- Parser
- Setup
- Basic Parsers
- Functor
- Applicative
- Alternative
- Monad
- Parsing Expressions
- Testing

## Overview

The expression parser transforms a string representing an expression into a binary tree. Here are some examples:

```
+ + *
/ \ / \ / \
"1*2+3" -> * 3 "2+4*6" -> 2 * "3*6*9" -> * 9
/ \ / \ / \
1 2 4 6 3 6
```

The binary tree is structured in such a way that operators with higher precedence are lower in the tree. In these examples, the multiplication operator `*`

has higher precedence than the addition operator `+`

, so `*`

is always lower in the tree. In general, every path downwards in the tree must have operators with increasing precedence.

By structuring the tree this way, we’re able to correctly evaluate the entire expression. We just need to traverse the binary tree bottom-up, evaluating the operators with the highest precedence first. In the first example, we first calculate `1 * 2`

, which is `2`

, and then calculate `2 + 3`

, which returns the final result `5`

.

This post focuses on the hard part: parsing an expression into the correct tree. There are multiple algorithms that solve this problem. The algorithm we’re going to implement is called precedence-climbing, and is the best approach for the kind of parser we’re going to build. We’re going to work our way up towards it by first building simple parsers, then gradually building more complex parsers until we get our final expression parser.

# Specs

The expression parser will support unsigned integers and the following operators:

```
Operator Precedence Associativity
------------------------------------------
* / % 5 Left to right
+ - 4 Left to right
& 3 Left to right
^ 2 Left to right
| 1 Left to right
= 0 Right to left
```

These operators are a subset of the binary operators in C. They include multiplicative, additive, bit manipulation, and assignment operators. I’ve added the assignment operator since it’s right-associative, making our expression parser a bit more interesting.

Before we dive into building the parser, let’s quickly review how parsers work.

## Grammars

Parsers operate according to specific grammar, which defines the rules that the input stream is expected to follow. Here’s the grammar for the expression parser we’re going to build:

```
expression: int-constant (op int-constant)*
int-constant: ('0' | ... | '9')+
op: '*' | '/' | '%' | '+' | '-' | '&' | '^' | '|' | '='
```

Grammars can be in any particular form. Here, the left side defines the name of a specific rule, which are called *nonterminals*, and the right side defines the pattern that the input string is expected to follow.

Characters enclosed in `''`

are called *terminals*. Terminals are the actual characters that the parser will consume. The characters `+ * | ( )`

are called *qualifiers*. Qualifiers modify the behaviour of the parser. Here’s what they mean:

`*`

- Zero or more times

`+`

- One or more times

`|`

- Select alternatives

`()`

- Parse group

This is what the grammar defines: an expression must start with one integer, and may contain many groups of operators and integers afterwards. An integer must be one or more digits, and an operator must be one of these characters: `* / % + - & ^ | =`

. Spaces within expressions are not supported, though this is easy to implement later.

Now that we’ve defined our grammar, let’s review what a parser actually is.

## Parser

A parser transforms a stream of input into a data structure, according to a specific grammar. The expression parser that we’re going to build transforms a string into a binary tree.

There are several ways to build a parser. The simplest approach is to parse the input recursively by creating a function for every nonterminal in the grammar. This is called recursive descent. Each function attempts to match the pattern in the nonterminal with the start of the remaining input. If the match is successful, the function consumes that part of the input and outputs a subset of the final data structure. We then combine these functions to build the final data structure.

The expression parser combines the simpler integer and operator parse functions to build a tree.

```
expression: int-constant (op int-constant)*
```

Each function only outputs a single node in the tree. The expression parser then joins these nodes together to build increasingly large subtrees until we get the final expression tree.

There are a few ways we can combine parse functions. We can either call the functions directly, or use parser combinators. A parser combinator is a function that takes one or more parse functions as input, and returns another parse function as output. It lets us build more complicated parsers out of simpler ones through function composition. For example, if we want to parse a specific string, we can combine multiple single character parsers together in sequence.

There are other ways to build a parser. A parser generator automates most of the task for you, but requires giving up control to a generalist tool. For other approaches, check the wikipedia page on parsing.

Okay, enough theory. Let’s actually build this thing!

## Setup

This code compiles on ghc version `9.0.1`

, though any version should do. We’re also going to use the ghci interpreter to interactively test our parse functions. The only third party Haskell library we should have installed is QuickCheck, which we will use to test the properties of our parser. If you don’t want to run any tests, you can skip this step.

Let’s start with some imports:

Let’s implement our data types:

`Expr`

is the data type for the binary tree. The binary operator data type `Op`

contains the type, precedence and fixity for a specific operator.

Now that we have our basic data types, let’s define our parse functions.

# The Parser Type

Each parse function will consume part of the input string and return some intermediate output and the remaining string. With error handling, we can think of every parser as having this declaration:

This function takes a string as input and returns either an error, or a tuple containing some intermediate output `a`

and the remaining string. For simplicity, this error type is just a string:

In order to simplify building parser combinators, let’s wrap this function in a `newtype`

:

With `Parser`

defined as a specific type, we can make it an instance of a bunch of useful typeclasses that we’ll need to combine parse functions later. This will also simplify our function declarations and allow us to easily change the internals of the parser if we want to make changes.

We now have all the necessary data types to implement our first parse function. Let’s start with the simplest of them all.

## Basic Parsers

# Satisfy

The satisfy parser is the basic atom of our parser combinator, among which all other parsers will be built. This parser takes a boolean function that tests some condition on the first character of the input string, and returns that character if the condition is true.

Let’s run this in ghci:

With `satisfy`

, we can start to build slightly more interesting parsers. Let’s move on to parsing single characters and digits.

# Char

The char parser simply parses the input character.

Examples:

# Digit

The digit parser parses any digit character.

Examples:

With `char`

and `digit`

, we now have two very simple parsers that are able to parse a single character. This is a good first step, but we need parsers that output the correct types for our binary tree: `Op`

and `Int`

. Since we’re using parser combinators, we can wrap these parsers within other parsers to modify their output. Let’s implement our first parser combinator that does this.

## Functor

Functor defines a function called `fmap`

which is basically just `map`

but for more complex data types. With functors, we can apply any function to the output of a parser.

We can now create new parsers that convert the outputs of `char`

and `digit`

into `Op`

and `Int`

respectively. Let’s start with converting digit characters to integers.

# Int

The int parser runs the digit parser and uses `fmap`

to convert its output to an integer.

Examples:

Note that this only converts a single digit character to an integer. Later, we’re going to extend this parser to support multiple characters. Let’s move on to parsing operators.

# Operators

Starting with the multiply operator, this parser runs the char parser and overwrites its output with the operator data type `Op`

.

The `<$`

function overwrites the output of the parser on the right with the value on the left. In this case, we’re overwriting `'*'`

with `Op Mul 5 LFix`

. We get this function for free when implementing the `Functor`

instance.

Using `mulOp`

as a template, we can easily implement parsers for every other operator. We just need to make sure that the precedence and associativity for each operator are correct.

Examples:

Exciting stuff. But what if we want to parse *any* operator? We need some way to try multiple operator parsers, and return the output of the first one that succeeds. For the integer parser, we also need a way to parse more than a single digit. Let’s implement our next two parser combinators that solve these problems.

## Applicative

Applicative functors are the same as functors, except that the input function must reside in the *output* of a parser.

This instance is only necessary because the next typeclasses we’re going to implement require `Applicative`

as a dependency (for ghc versions `7.10`

and above).

We’re not going to be directly using these functions, so it’s not particularly important that we understand how they work. Let’s move on to the next parser combinator that we’re actually going to use.

## Alternative

Alternative lets us try a second parser if the first one fails.

Alternative also provides the `some`

and `many`

functions that can run a parser in a loop. The `some`

function runs a parser one or more times, while the `many`

function runs a parser zero or more times. Both loops terminate where there is a parse error and return a list containing each parsed element.

We can now select between different parsers and run parsers in a loop. Let’s use this functionality to build an operator parser that can select between different operators.

# Operator

The op parser simply tries a bunch of operator parsers and returns the output of the first match.

Examples:

Now that our parsers are getting more complex, let’s write a parser combinator that provides more helpful error messages. The error message in the last example isn’t particularly relevant to parsing operators.

# Custom Errors

The `<?>`

function runs a parser and overwrites its error message with the input string if it fails.

Since we always want to run this function last, we’ve given it an infix precedence of `1`

. Let’s now update our operator parser to provide a more helpful error message.

Examples:

Much better! Let’s now move on to parsing integers with multiple digits.

# Int

The updated int parser now converts a string of digits into an integer.

This parser runs the digit parser one or more times and returns a string as its output. This string is then converted to an integer with the built-in `read`

function.

Examples:

With `int`

and `op`

, we’re now able to parse unsigned integers and select between different operators. We’ll use these parsers to output the nodes in our tree. We now just need a way to sequence them together to build increasingly large subtrees. Let’s implement our final parser combinator.

## Monad

Monad allows us to sequence multiple parsers together, piping the output of one parser into another.

The `>>=`

function pipes the output of the first parser into a function-wrapped second parser that takes a single argument of the same type. This wrapped parser is free to do whatever it wants with the first parser’s output.

We will use monads in our final expression parser to join the outputs of `int`

and `op`

into a binary tree. Before we get there, we need just two more parsers.

# Option

The option parser attempts to run a specific parser. If it fails, it returns the input value.

Examples:

This parser will come in handy when parsing expressions that might contain a parse error. Instead of throwing an exception, we’ll use it to return the current subtree.

There is one last parser that we need. This one is not related to `Monad`

, but we’re going to need it to parse expressions.

# Lookahead

The lookahead parser runs a specific parser but does not consume any input if it succeeds.

Examples:

Lookahead allows us to peek at the next characters in the input. The expression parser will use lookahead extensively to determine whether or not to return from a recursive call.

We’ve now built every parser we need to parse binary expressions. Let’s finally implement our expression parser!

## Parsing Expressions

The expression parser that we’re going to build transforms a string representing an expression into a binary tree. Now that we have our `int`

and `op`

parsers, we can output single nodes in our binary tree. Using `>>=`

, we can then sequence them together to form increasingly large subtrees. Let’s now start with parsing the simplest expression, an integer constant.

This parser just wraps the output of the int parser with `IntConst`

and returns an expression. Let’s now move on to the more interesting part: parsing binary expressions. The algorithm I’m going to describe here is called precedence-climbing.

The algorithm takes a minimum precedence value and a left subtree as input. It peeks at the next operator, and recursively builds the right subtree if the operator has higher precedence or has the same precedence but is right-associative. Otherwise, it returns the left subtree. After building the right subtree, it joins the left and right subtrees to form the *next* left subtree, and repeats the entire process until the input string is empty or invalid.

Here’s an example. The numbers indicate the ordering of function calls.

```
> runParser expr "1*2+3"
1. expr call: Notes:
-------------------------------------------------------------------------------
input: "1*2+3" Parse integer, then pass result to opExpr with
-1 precedence. This allows any op to come next.
2. opExpr call: Subtree: Notes:
-------------------------------------------------------------------------------
input: "*2+3" * Does the next op have higher precedence? Yes, so
minPrec: -1 / \ initialise a new subtree (shown left) and
left subtree: 1 1 ... recursively compute the right subtree (3).
After completing the subtree, call opExpr again
with the same precedence and the completed
subtree as input (4).
3. Compute right: Subtree: Notes:
-------------------------------------------------------------------------------
input: "+3" 2 Does the next op have higher precedence? No, so
minPrec: 5 return the left subtree and go back to (2).
left subtree: 2
4. opExpr call: Subtree: Notes:
-------------------------------------------------------------------------------
input: "+3" + Does the next op have higher precedence? Yes, so
minPrec: -1 * / \ initialise a new subtree and recursively compute
left subtree: / \ * ... the right subtree (5).
1 2 / \
1 2 After completing the subtree, call opExpr again
(6).
5. Compute right: Subtree: Notes:
-------------------------------------------------------------------------------
input: "" 3 Empty input, so return the left subtree and go
minPrec: 4 back to (4).
left subtree: 3
6. opExpr call: Subtree: Notes:
-------------------------------------------------------------------------------
input: "" + Empty input, return left subtree. Done.
minPrec: -1 / \
left subtree: * 3
/ \
1 2
```

Somewhat of a long example, but hopefully you get the point. Every time we try to build a right subtree, we pass the current operator’s precedence `prec`

and the next node as input. During the recursive call, if the next operator has lower precedence (or has the same precedence but is left-associative), we return the single node as the right subtree.

Once the right subtree is computed, we join the left and right subtrees into the *next* left subtree, and make another recursive call. This time we’re passing the initial precedence `minPrec`

and the joined subtree as input. This makes the joined subtree the left subtree of the next operator, which we now know must have lower precedence. This process repeats until the input string is empty or invalid.

Let’s now implement our final parse function that ties everything together.

Examples:

We now have our final expression parser! Before we move on to testing, let’s write a pretty-printer so we can better visualise the output.

# Pretty-Printing

First, we need to remove the automatic derivation of `Show`

from the binary operator data type `OpType`

and write a custom one instead.

Now let’s implement the pretty-printer. This function indents a specific substring for each line of output.

Examples:

```
> putStrLn $ printExpr $ parseExpr "1*2"
*
├ 1
└ 2
> putStrLn $ printExpr $ parseExpr "1*2+3"
+
├ *
│ ├ 1
│ └ 2
└ 3
> putStrLn $ printExpr $ parseExpr "2+4*6"
+
├ 2
└ *
├ 4
└ 6
```

Not bad. I’ve chosen this format since it’s easier to code, but it’s also possible to print the tree similar to the other examples in this post. Feel free to mess around with this function in ghci, it’s quite satisfying to view the output this way.

Let’s finish by writing some tests to make sure everything works as expected.

## Testing

We’re going to use QuickCheck to test the properties of our expression parser. If you’re not familiar with QuickCheck, this post provides a pretty good explanation:

Basically, QuickCheck generates a bunch of random input and checks whether a function satisfies certain user-defined properties for every input. For our expression parser, there are at least two properties we want to test:

- Every path downwards in the tree has operators with increasing precedence.
- The inorder traversal of the tree is equivalent to the input string.

Before we test these properties, we first need to generate random binary expressions. Let’s write some simple generators.

# Generators

The binary expression generator `genExpr`

generates a random integer between 0-1000, and a random number of groups of operators and integers afterwards.

If any of these generators don’t make sense, make sure to read the post linked above. These generators follow almost exactly from the example generators in that post.

Here’s some sample output using a smaller integer range of 0-20:

Now that we can generate random binary expressions, let’s write our first property.

# Property 1: Increasing Precedence

The first property checks whether every path downwards in the tree has operators with increasing precedence.

For every random input, this property parses it into a binary tree and runs the local `isIncr`

function that recursively checks whether operators are in increasing order.

# Property 2: Inorder Traversal == Input

The second property checks whether the inorder traversal of the binary tree is equivalent to the input string.

This parses the input into a binary tree, converts the tree back into a string, and then checks whether the two strings match.

Let’s now run these two properties:

Nice! We’ve now built a simple binary expression parser with operator precedence entirely from scratch and tested it with randomized inputs. We’ve essentially built the frontend to a small compiler.

This parser provides a good foundation for building a parser for a more sophisticated grammar, like a programming language. In fact, this is the exact approach I’ve taken for a C compiler I’m currently working on.

Thanks for reading this basic guide on binary expression parsing. If you have any comments, please send me an email.

All the code in this post is available here.