Fast Unicode Character Lookups

In this post, I'll be going over a very useful application of tries (also called prefix trees): fast key-value lookups for small integers. In particular, these tries are specialized for

Some experience with low-level optimization and data structures is recommended for reading this post.

Motivation

The motivation to explore key-value data structures with the properties that I wanted (listed above) came from a problem in a project I was working on. Given a Unicode character, I wanted to quickly get its combining class, which is an 8-bit number that is assigned by the Unicode specification. For instance, the character ◌̃ (U+0303) has a combining class value of 230.1

Combining classes are always constant (i.e. they can't change at runtime), so I had the ability to create a data structure at compile time instead of computing it during the program's lifetime.

You might already be thinking of a data structure that is a classic, efficient, and interesting solution to this problem: perfect hash tables. For reference, a perfect hash table is a hash table that has no collisions, which is made possible by knowing the keys ahead of time. I used them in my initial implementation, and they were generally pretty fast!

This is the Python translation2 of my hash table lookup code:

def perfect_hash(x: int, salt: int, n: int) -> int:
    mask_32 = 0xFFFFFFFF
    y = ((x + salt) * 2654435769) & mask_32
    y ^= (x * 0x31415926) & mask_32
    return (y * n) >> 32

Only a few instructions, right? Yes, but that was not the problem.

Although they look great on paper, in reality, there are many performance downsides to hash tables, even ones I could fully compute at compile time! Here are a few I ran into:

This all led me to wanting an alternative data structure that didn't have these downsides. After doing some research, tries were the clear best choice.

What is a Trie?

A more descriptive name for a trie is "prefix tree". There are many ways to interpret a trie, but I will go over the perspective that is most useful for this problem.

Consider a naïve table lookup for a 16-bit integer. That is, we have a static array A\text{A} that contains 2162^{16} 8-bit integers (the combining class values). If we have a Unicode character with code point value xx, we can access its value by performing A[x]\text{A}[x]. This is theoretically very fast because it takes just one memory access, but the table ends up being 65,536 bytes large!

diagram of naïve implementation

The problem is that, in order to index to the value we want, we need 16 bits of information. What if we try indexing with less bits? Let's try only looking at the first 10 bits of xx (the prefix). So if we now have a table of size 2102^{10}, we can perform lookups by doing A[x6]\text{A}[x \gg 6]. But since we used the 10 bit prefix of xx, there are 262^6 code points with the same prefix. For example, the prefix of x=0x1DC0x = \text{0x1DC0} ◌᷀ is 8 (x6x \gg 6), but x=0x1DC2x = \text{0x1DC2} ◌᷂ also has a prefix of 8. These characters have different combining classes, though. Hence we have a problem because we cannot differentiate these two characters when trying to get their combining class values.

diagram of prefix implementation

So what if we instead stored an index in A\text{A}, so that each i=A[x6]i = \text{A}[x \gg 6] identifies a group of 64 code points with the same 10 bit prefix as xx? We can then create a new table, Bi\text{B}_i, that has size 64 such code points. To differentiate code points in this small subset, we can index Bi\text{B}_i using the six bits we shifted out when getting ii: Bi[x&63]\text{B}_i[x \, \& \, 63].4 If we store the 8 bit combining class at this location, we now have a reliable key-value data structure! This, in essence, is a two-stage trie.

diagram of B_i sub-tables implentation

For simplicity, let's concatenate each Bi\text{B}_i into one big table, B\text{B}. We index into it using B[i+(x&63)]\text{B}[i + (x \, \& \, 63)] (assuming ii points to where the 64 byte block starts in the big B\text{B} table). You might be wondering how this is an improvement space-wise over our naïve data structure. It is not right now, but it has a lot of potential.

diagram of concatenated B table implementation

Here's just one application of this new key-value lookup scheme: I mentioned earlier that the vast majority of characters have a combining class value of zero. This means that most Bi\text{B}_i blocks are "zero blocks" (i.e. filled with 64 zeroes). So in our A\text{A} table, we can change all ii values that point to different zero blocks in B\text{B} point to the same zero block in B\text{B}, and drop all the other zero blocks. This alone saves us a ton of space in our combining class example.

diagram of deduplicated B table implementation

Further Compaction

There are a few other compaction techniques that are similar to the zero block deduplication:

  1. General block deduplication
  2. Fast prefix-suffix merging
  3. Slow prefix-suffix merging

General block deduplication is just the generalized version of what was described before: keep track of which 64-byte blocks we've encountered and change all indices that point to copies of that 64-byte block to point to the same 64-byte block. This is the same thing as what we did for the zero blocks, but with arbitrary block values. This can save a lot of space on many real-world value distributions.

Prefix-suffix deduplication can also save space. Suppose we have two distinct blocks, Bi\text{B}_i and Bj\text{B}_j, that are of this form:

diagram of overlapping blocks

Notice that the last 7 bytes of BiB_i is the same as the first 7 bytes of BjB_j. This means that we can merge Bi\text{B}_i and Bj\text{B}_j in the B\text{B} table so that indices pointing to Bj\text{B}_j start at the last 7 bytes of Bi\text{B}_i.

diagram of block merging

How do we find blocks that we can perform this operation on? One possible way is to try to merge blocks as we are putting together B\text{B} from the many Bi\text{B}_i blocks. For example, we try to merge Bi1\text{B}_{i-1} and Bi\text{B}_{i}. This is pretty quick to check, but obviously isn't the optimal way to construct B\text{B} using prefix-suffix merging:

B = blocks[0].copy()
for i in range(1, len(blocks)):
  # Number of bytes shared in the suffix of blocks[i - 1] and prefix of blocks[i]
  n = longest_overlap(blocks[i - 1], blocks[i])
  B.extend(blocks[i][n:])

Again, this only tries to merge adjacent blocks (by index).

We can reach the theoretically optimal solution using a complete directed graph. Each node in this graph is a block, and the weights between each node is determined by how many bytes are shared between prefixes and suffixes. For correctness, edge weights should be less than half the block size (64 in our case)5.

diagram of complete directed graph

We can then find the optimal way to arrange these blocks by finding the maximum cost Hamiltonian path. A Hamiltonian path is a path through the graph that visits every node exactly once. This is actually an NP-hard problem! If we wanted to find a decent solution, we could an approximation algorithm.

Depending on the data, finding a near-optimal prefix-suffix deduplicated path can be helpful. But in the case of my combining class problem, doing prefix-suffix deduplication by only checking adjacent blocks was good enough.

Final Notes

I hope this post illustrates an aspect of tries very useful for low-level performance: compressing indexing spaces while maintaining very fast lookups. The speed gains in my project were substantial when moving from perfect hash tables to tries. Note that this post highlights just one use case of tries. Tries pop up in many other areas in programming!

Credits are due to the ICU project for sparking my interest in using tries for low-level performance!

If you have any feedback, comments, or issues with this post, please let me know by submitting an issue!

Python Code

Here is reference Python code for the trie that I described in this post.

LIMIT = 0x10000

SHIFT = 6
BLOCK_LENGTH = 1 << SHIFT
MASK = BLOCK_LENGTH - 1

FLAG_UNSEEN = 0
FLAG_SEEN = 1


class Trie:
    def __init__(self) -> None:
        self.index: list[int] = [0] * (LIMIT >> SHIFT)
        self.data: list[int] = []
        # Keeps track of which 12-bit prefixes we've encountered
        self._flags: list[int] = [FLAG_UNSEEN] * (LIMIT >> SHIFT)

    def get(self, c: int) -> int:
        i = c >> SHIFT
        return self.data[self.index[i] + (c & MASK)]

    def set(self, c: int, value: int) -> None:
        assert c < LIMIT
        i = c >> SHIFT
        block: int
        # Already exists as an index
        if self._flags[i] == FLAG_SEEN:
            block = self.index[i]
        else:
            block = self._alloc_new_block(i)
        self.data[block + (c & MASK)] = value

    def compact(self) -> None:
        compressed = []

        # Keeps track of blocks that we've seen before
        blocks = {}
        for i in range(len(self.index)):
            block_index = self.index[i]
            block_slice = self.data[block_index : block_index + BLOCK_LENGTH]
            block = tuple(block_slice)
            # Check if we can fully deduplicate this block
            if block in blocks:
                self.index[i] = blocks[block]
                continue

            # If we can't deduplicate the block, then see if we can merge it
            # partially with the previous block
            overlap = longest_overlap(compressed, block_slice, BLOCK_LENGTH)
            if overlap > 0:
                index = len(compressed) - overlap
                compressed.extend(block_slice[overlap:])
                self.index[i] = index
                blocks[block] = index
                continue

            # If we can't do either of the above compaction operations, we
            # should allocate the new data
            self.index[i] = len(compressed)
            blocks[block] = len(compressed)
            compressed.extend(block)

        self.data = compressed

    def _alloc_new_block(self, i: int) -> int:
        new_block = len(self.data)
        self.data.extend([0] * BLOCK_LENGTH)
        self.index[i] = new_block
        self._flags[i] = FLAG_SEEN
        return new_block


def longest_overlap(lst1: list, lst2: list, max_overlap: int) -> int:
    overlap = max_overlap - 1
    if overlap > len(lst1):
        return 0
    while overlap > 0 and lst1[-overlap:] != lst2[:overlap]:
        overlap -= 1
    return overlap

Prefix-suffix merging is done by checking adjacent blocks (not using a complete graph) as described in the Further Compaction section.


  1. In the Unicode specification, combining class values are used to determine how combining characters should be sorted during normalization. For example, the string <U+0044, U+0300, U+0316> has comining classes <0, 230, 220>. Normalization would re-order this string to become <U+0044, U+0316, U+0300>. ↩︎

  2. The actual code was written in C. ↩︎

  3. Some vector instruction sets support 256-bit vectors (making 32x8 vectors possible), but for maximum compatibility with commonly-used instructions sets like ARM NEON, it is better to support 16-bit operations. ↩︎

  4. 63 is represented with six 1s in binary, so using 63 as a mask will get the last 6 bits of the number. ↩︎

  5. This is so prefix merging with one block doesn't overwrite the possible suffix merges with other blocks. ↩︎