Haskell, as implemented in GHC, has a very rich language for expressing computations in types. Thanks to the DataKinds extension, any inductively defined data type can be used not only at the term level, but also at the type level. A notable exception are strings, which provide the main theme for today’s blog post.
String type in Haskell is defined as a list of
the type-level equivalent,
Symbol, is defined as a primitive in GHC,
presumably for efficiency. After all, the type checker passes these
types around, and the simpler their structure, the less potential work
the constraint solver needs to do.
The problem is this: since
Symbol is defined as a primitive, there
is no way to pattern match on its structure, and the only way to
interact with them are by using the built-in primitive operations,
namely appending and (efficient, constant-time) comparison.
In this blog post, I will show how these primitives can be used to recover the ability to do arbitrary introspection of these type-level string literals, thereby enabling a whole range of applications where statically known information can be exploited.
The technique presented here was inspired by Daniel Winograd-Cort’s pull request for the generic-lens library.
All of this is packaged into the symbols library.
I have written about type-level symbol
parsing in PureScript to implement a type-safe
function. (There, I achieved symbol decomposition by patching the
compiler, but no such thing is required here.)
Reusing that example, we will be able to write
The implementation of the printf example using the technique described in this blog post can be found on github.
First, let’s have a look at the primitives GHC provides for
manipulating type of kind
These functions are implemented in the compiler, and exported from the GHC.TypeLits module:
Note that there is no
Uncons primitive that returns the head (first
character) and the tail of the symbol. It turns out that we can
Uncons using the two primitives above.
The fact that
AppendSymbol is a type family suggests a rather
straightforward semantics. It appends two symbols together resulting
in a third one:
That is to say, it should only go in one way, so to speak.
However, if we have a look at the
in GHC, we can see that there’s more going on. There are special rules
for the interaction of
AppendSymbol constraints with equality
constraints. In concrete terms, GHC will solve the following
That is, if we know a prefix of a symbol, we can decompose it to get
the matching suffix. Morally, the actual signature of
AppendSymbol would be closer to
But this can’t be expressed today in GHC (type family dependencies
only allow the inputs to be decided solely by the result, and no
such combination of inputs and outputs are allowed), so
really is a lot more powerful than what the type system would like to admit!
Even with the ability to decompose symbols, there is a problem, however. This decomposition only works if we know what the prefix is. And in general, we need to know two out of the three symbols involved in the constraint to get the third.
As a result, the following won’t work:
suffix is unsolved.
We might think that we can just try all possible characters as potential prefixes until one matches, but that would require backtracking in the constraint solver, and GHC’s constraint solver doesn’t backtrack.
That is, trying a prefix that doesn’t match results in an unsolvable constraint:
But since we can’t backtrack, there is no way to try a different character once we’ve committed to a particular prefix.
If we knew what the first character was, we could strip it off and get the remaining symbol this way, which would allow us to treat Symbols as a list of characters essentially.
It turns out that we can simply use alphabetical ordering to find out
what the first character of a string is.
CmpSymbol compares two symbols,
and returns one of
GT as a result.
Observe that for any string longer than one, it’s always true that the
string follows its first character alphabetically, and precedes any
character after its first one. As an example, consider the string
"hello world", whose first character is
h, and the letter after
i. Then we have
For strings of length one, they will simply return
EQ when compared
with their first character (themselves).
We now put the pieces together to implement an uncons function for
symbols. First, we need
Head, a function that returns the first
character of a symbol. Second, we will use
Head to interact with
AppendSymbol to retrieve the tail of the symbol. Doing this
repeatedly will allow us to turn a symbol into a list of characters,
which in turn can be consumed by ordinary type families.
So, to find out what the first character of a symbol is, we just need to find the last character in the ASCII table that precedes our symbol. To do this reasonably efficiently, we use binary search. Since indexing into a type-level list takes linear time, we use a balanced binary search tree instead. Recall that symbol comparisons are constant-time, so the whole operation is constant time (as we’re working with a fixed size alphabet), so this optimisation simply improves the constant factor by an order of magnitude.
The printable subset of the ASCII character set can be encoded as the following tree:
(I generated this structure with the help of other type families, but found that inlining the result into the source file results in must faster lookups.)
Note that each node contains two consecutive characters: this is so that we can easily decide when to stop: when the first element is less than, and the second element is greater than our input string.
Lookup type family (and
Lookup2, to make up for a lack of local
declarations in type families) implements a standard binary search.
Head is just a lookup in the binary tree.
Next, we need to interact the
AppendSymbol constraint with
Head. We now turn to a type class,
sym is our symbol,
h is the head, and
t is the tail. It would
be nice to have a functional dependency
sym -> h t, but
unfortunately we can’t make that pass, as recall that the backwards
AppendSymbol are essentially hidden from the type
We write a single instance, which sets up the right constraints:
First, we write
h ~ Head sym, which unifies
h with the first
element of the symbol using the binary lookup defined
previously. Then, the
AppendSymbol h t ~ sym constraint will trigger
the solution of
t, due to the now known prefix
uncons member is not necessary for things to work out, but it
helps illustrate the working of the type class in the REPL:
Finally, we can write the
Listify class to recursively break down a
symbol into a list of characters:
And with this, we can parse anything we’d like.
Of course all of the above could be done a lot more efficiently with compiler support, and there’s no reason for that not to happen at some point in the future. This post is just a proof of concept that something like this is already possible today, and the presented technique is suitable for some lightweight applications. For anything larger scale, Template Haskell is probably much better suited for the job today.