### Overview

Haskell, as implemented in GHC, has a very rich language for expressing computations in types. Thanks to the DataKinds extension, any inductively defined data type can be used not only at the term level, but also at the type level. A notable exception are strings, which provide the main theme for today’s blog post.

The `String`

type in Haskell is defined as a list of `Char`

s. However,
the type-level equivalent, `Symbol`

, is defined as a primitive in GHC,
presumably for efficiency. After all, the type checker passes these
types around, and the simpler their structure, the less potential work
the constraint solver needs to do.

The problem is this: since `Symbol`

is defined as a primitive, there
is no way to pattern match on its structure, and the only way to
interact with them are by using the built-in primitive operations,
namely appending and (efficient, constant-time) comparison.

In this blog post, I will show how these primitives can be used to recover the ability to do arbitrary introspection of these type-level string literals, thereby enabling a whole range of applications where statically known information can be exploited.

The technique presented here was inspired by Daniel Winograd-Cort’s pull request for the generic-lens library.

All of this is packaged into the symbols library.

# Motivation

I have written about type-level symbol
parsing in PureScript to implement a type-safe `printf`

function. (There, I achieved symbol decomposition by patching the
compiler, but no such thing is required here.)

Reusing that example, we will be able to write

The implementation of the printf example using the technique described in this blog post can be found on github.

# Primitives

First, let’s have a look at the primitives GHC provides for
manipulating type of kind `Symbol`

, namely `AppendSymbol`

and
`CmpSymbol`

.

These functions are implemented in the compiler, and exported from the GHC.TypeLits module:

Note that there is no `Uncons`

primitive that returns the head (first
character) and the tail of the symbol. It turns out that we can
implement `Uncons`

using the two primitives above.

## AppendSymbol

The fact that `AppendSymbol`

is a type family suggests a rather
straightforward semantics. It appends two symbols together resulting
in a third one:

That is to say, it should only go in one way, so to speak.

However, if we have a look at the
implementation
in GHC, we can see that there’s more going on. There are special rules
for the interaction of `AppendSymbol`

constraints with equality
constraints. In concrete terms, GHC will solve the following
constraint:

That is, if we know a prefix of a symbol, we can decompose it to get
the matching suffix. Morally, the actual signature of
`AppendSymbol`

would be closer to

But this can’t be expressed today in GHC (type family dependencies
only allow the inputs to be decided solely by the result, and no
such combination of inputs and outputs are allowed), so `AppendSymbol`

really is a lot more powerful than what the type system would like to admit!

Even with the ability to decompose symbols, there is a problem,
however. This decomposition only works if we *know* what the prefix
is. And in general, we need to know two out of the three symbols
involved in the constraint to get the third.

As a result, the following won’t work:

that is, `suffix`

is unsolved.

We might think that we can just try all possible characters as potential prefixes until one matches, but that would require backtracking in the constraint solver, and GHC’s constraint solver doesn’t backtrack.

That is, trying a prefix that doesn’t match results in an unsolvable constraint:

But since we can’t backtrack, there is no way to try a different character once we’ve committed to a particular prefix.

*If we knew* what the first character was, we could strip it off
and get the remaining symbol this way, which would allow us to
treat Symbols as a list of characters essentially.

## CmpSymbol

It turns out that we can simply use alphabetical ordering to find out
what the first character of a string is. `CmpSymbol`

compares two symbols,
and returns one of `LT`

, `EQ`

, or `GT`

as a result.

Observe that for any string longer than one, it’s always true that the
string follows its first character alphabetically, and precedes any
character after its first one. As an example, consider the string
`"hello world"`

, whose first character is `h`

, and the letter after
`h`

is `i`

. Then we have

For strings of length one, they will simply return `EQ`

when compared
with their first character (themselves).

# Decomposition

We now put the pieces together to implement an uncons function for
symbols. First, we need `Head`

, a function that returns the first
character of a symbol. Second, we will use `Head`

to interact with
`AppendSymbol`

to retrieve the tail of the symbol. Doing this
repeatedly will allow us to turn a symbol into a list of characters,
which in turn can be consumed by ordinary type families.

## Head

So, to find out what the first character of a symbol is, we just need to find the last character in the ASCII table that precedes our symbol. To do this reasonably efficiently, we use binary search. Since indexing into a type-level list takes linear time, we use a balanced binary search tree instead. Recall that symbol comparisons are constant-time, so the whole operation is constant time (as we’re working with a fixed size alphabet), so this optimisation simply improves the constant factor by an order of magnitude.

The printable subset of the ASCII character set can be encoded as the following tree:

(I generated this structure with the help of other type families, but found that inlining the result into the source file results in must faster lookups.)

Note that each node contains two consecutive characters: this is so that we can easily decide when to stop: when the first element is less than, and the second element is greater than our input string.

The `Lookup`

type family (and `Lookup2`

, to make up for a lack of local
declarations in type families) implements a standard binary search.

Finally, `Head`

is just a lookup in the binary tree.

## Uncons

Next, we need to interact the `AppendSymbol`

constraint with
`Head`

. We now turn to a type class, `Uncons`

:

`sym`

is our symbol, `h`

is the head, and `t`

is the tail. It would
be nice to have a functional dependency `sym -> h t`

, but
unfortunately we can’t make that pass, as recall that the backwards
dependencies of `AppendSymbol`

are essentially hidden from the type
system.

We write a single instance, which sets up the right constraints:

First, we write `h ~ Head sym`

, which unifies `h`

with the first
element of the symbol using the binary lookup defined
previously. Then, the `AppendSymbol h t ~ sym`

constraint will trigger
the solution of `t`

, due to the now known prefix `h`

.

The `uncons`

member is not necessary for things to work out, but it
helps illustrate the working of the type class in the REPL:

Finally, we can write the `Listify`

class to recursively break down a
symbol into a list of characters:

And with this, we can parse anything we’d like.

# Conclusion

Of course all of the above could be done a lot more efficiently with compiler support, and there’s no reason for that not to happen at some point in the future. This post is just a proof of concept that something like this is already possible today, and the presented technique is suitable for some lightweight applications. For anything larger scale, Template Haskell is probably much better suited for the job today.