Prefix trees in action

Florian Beeres

Pareto Principle and Performance: A 20% understanding of profiling tools and data structures is often enough to unlock 80% of the optimizations

It’s often said that what you learn in university is only relevant up to the day you graduate. After all, most software developers don’t write their own sorting algorithms, or create doubly linked lists from scratch. But this can also be a self fulfilling prophecy — if you don’t expect to use more advanced data structures, you won’t go looking for opportunities to do so.

We recently ran into one such opportunity where replacing one data structure with another solved our performance issues with less than a hundred lines of code.

Inserting Millions of Links

One of the features our team built is processing HTML documents that we get from an external source and connecting them to our own medical content by adding links. For example, in the snippet below, we can replace the term “acute heart failure” with a link to our knowledge library.

<p>Something something about acute heart failure, and so on</p>
<p>Something something about <a href=”https://www.amboss.com/us/knowledge/Acute_heart_failure">acute heart failure</a>, and so on</p>

But how we do know which words can be linked? This data comes from an internal service at AMBOSS. More precisely, we can send a list of strings to that service and get a list of destination objects back, where each object includes the exact term that can be linked and some metadata to construct the anchor element. In the above snippet, that service would have returned a destination object where term is acute heart failure.

Our job is now to scan the documents for words (or sequences of words) that we’ve received destinations for, and replace them with anchors.

The first version of this program was a huge, regular expression that we generated programmatically. But this was too slow, so we converted it to a custom loop.

Essentially we go through the text content of the document character by character (although Go is using unicode codepoints, called runes) and we add each character to a variable, which tracks the current phrase. For the above snippet that variable would evolve like this:

S → So → Som → … → Something → white space→ check if we have a destination, insert a link, empty the variable and move on.

This worked well at first but we quickly realized that it had a huge flaw — it wouldn’t work for destinations spanning multiple words, such as acute heart failure. Our algorithm would find the word heart and replace it with a link. Then it would process the word failure next. It would never get to heart failure though.

We solved this by first checking if the current phrase was a substring of any destination object. In that case, we would keep iterating and adding characters to the current phrase. This gave us the longest sequence of characters that we can link. That was essentially a nested loop:

for _, character := range documentContent {
 // ... do other stuff until we come across white space
 for dest := range destinations {
  if strings.HasPrefix(dest.Term, currentPhrase) {
   phrase.WriteRune(whitespace)
   continue outer
  }
 }
}

Why So Slow?

At this point you might be wondering why we would write a nested loop, when that’s obviously inefficient.

As I mentioned, the first iteration used regular expressions. This turned out to be a bottleneck, so we made the code faster. Compared to the rest of the program, the code in question was no longer an issue. As we optimized the rest of the program, the link adder functionality once more became the slowest code path. So we went back to the drawing board.

This iterative approach to development can work well because it encourages you to limit the scope of your changes to only fixing the problem at hand. But I’ll also admit that we could have skipped one iteration had we realized that our first fix (from regex to loop) still had huge potential for improvement.

It’s important to keep in mind that code is always about tradeoffs. Many code bases are full of nested loops and yet that doesn’t make these programs inherently slow. It always depends on the data you’re looping over. A nested loop that runs 10 times is not a problem, however a single loop that has to process a billion elements might be.

Because of that, the first step on the road to fast programs is to measure! It’s tempting to jump right in and write some code, but all too often this results in well meaning fixes that don’t help. Making an operation that accounts for 1% of the overall runtime 100% faster is far less effective than making something that accounts for 50% of the runtime10% faster.

Luckily, profiling Go is easy. We have an environment variable that, when turned on, results in pprof.StartCPUProfile(filename) being called in main(). We then analyzed the resulting .pprof file with go tool pprof -http localhost ./cpu.pprof and found this:

Even if you’ve never seen Go profiling output before, the big red boxes seem like good candidates to investigate. runtime.mapiternext is a function call that took up almost 30% of the sampled runtime. memeqbody is even worse, at 30.61%. These two functions alone make up the majority of our runtime.

Who calls these functions? We do, in AddLinks — the function that does all the scanning and linking I described in the first section of this article.

Now we know which of our functions is slow, but we don’t yet know which lines are responsible. If we can figure out what mapiternext does, we can probably also figure where exactly it’s being called. The documentation isn’t very helpful, but from looking at the source and considering the name and context of the code, it’s apparent that the function is involved in moving from one map key to the next. In other words, we were spending too much time doing for k,v := range map {}.

The other piece of the puzzle is memeqbody, which is involved in string comparisons. Since this function call is part of strings.HasPrefix and we’re calling that in the inner loop, it’s pretty clear that the nested loop is indeed responsible for how slow our code is.

Because math makes any blog post look fancy, we can also check the Big O complexity of our code.

We iterate over n words, and when we come across a word boundary, we do a nested loop over m link destinations. For each destination, we call HasPrefix, which has to traverse at least p (length of prefix) characters.

All in all, we’re looking at O(n * m * p) complexity.

What now?

Let’s forget about code and just think about what we’re trying to do here in normal, human language. We want to replace words with links and always favor the longest possible sequence. If we have destinations for both “heart” and “heart failure”, we want to favor the latter.

Therefore we need to know if the phrase we’re processing is a sub string of any of the destinations. Actually, we’re interested specifically in whether the phrase is a prefix of any destination. 🤔

Making Things 🚀

This is the point where a tiny bit of textbook knowledge goes a long way. You don’t have to have every tree data structure memorized, as long as you are vaguely aware that there is an entire family of trees that’s good at dealing with prefix searches: tries.

A trie, also called prefix tree, is a great fit for things like auto completion and search suggestions — where the user enters the first part of a word and you need to quickly find more words that fit to that prefix.

It’s much easier to explain with diagrams than with words, so here’s a hypothetical representation of the “heart” and “heart failure” case.

In a trie, the nodes are typically strings and nodes are connected by individual characters. In the above example, the first non-empty node h and the node after that (he) are connected through the character e. I collapsed the two leaves (heart failure and heart burn), just so it’s easier to visualize. Also note that the children of any node share a prefix, meaning both children of heart share the heart prefix, which is perfect for our use case.

In this trie three nodes would hold a value: heartheart failure and heart burn. The values could be anything, but in our case they’d be the destination objects. We can therefore not only use the trie to check if the currentPhrase is a prefix of a known destination, but we can also retrieve that destination object.

There is one thing to keep in mind though:

If we ask the trie for the destination object for heart, what do we expect to get back?

Obviously, we’re looking for the value associated with the node heart. But there are two more nodes which could be relevant, since they contain the prefix we asked for. In cases like this, it’s important to check the documentation of whatever library you’re using. It will probably differentiate between exact match (e.g., asking for heart failure) and ambiguous match (e.g., asking for heart).

Lastly, what kind of performance gains can we expect from using a trie? For any look up, we can treat each character in our search string as a path segment. You can imagine searching for heart to be akin to starting at the root node and following the connection by character h, then the connection by he, and so on.

The Big O complexity is therefore O(k), where k is the length of the search string, regardless of how many destination objects we have.

With the previous solution, we had O(n*m*p) complexity. We’ve replaced that with O(n * p). That’s a huge improvement because we have thousands of destinations (m), whereas the search string (p) will usually be in the 1—20 character range.

Adding tries to our code was very straight forward. It essentially boiled down to replacing the inner loop with a look up, as shown below.

for _, character := range documentContent {
 // ... do other stuff until we come across white space
 if destinationTrie.HasPrefix(currentPhrase) {
  phrase.WriteRune(whitespace)
  continue outer
 }
}

So instead of iterating over all keys in the map, and comparing each key against the currentPhrase, we make just a single call to our trie, here called destinationTrie. The surrounding code can stay exactly the same, which makes for a fairly small commit!

Of course we still need to verify that our change has indeed made things noticeably faster. So we ran exactly the same benchmarks as before (otherwise it’s hard to really compare the data) and indeed memeqbody went from taking up 325.57s to just 11.43s. mapiternext doesn’t even show up in the top 10 function calls anymore.

Apart from profiling one entire run, we also used Go’s support for benchmarking to get a more detailed idea of just how much faster the new implementation of AddLinks is:

Before 367244 3231 ns/op 1728 B/o 38 allocs/opAfter 775364 1344 ns/op 944 B/o 23 allocs/op

The new implementation takes 1344 nanoseconds per operation (ns/op) and is therefore 2.4 times faster. It also allocates less (38 vs 23 allocations per operation), which wasn’t even a focus of ours.

The overall runtime of our program, as measured by a human sitting in front of the machine, also went down by about two thirds.

Code complexity is hard to measure but I’d argue that the code is now simpler to reason about than before.

Earlier in this article, I wrote that we wanted to achieve the following:

Therefore we need to know if the phrase we’re processing is a sub string of any of the destinations.

The prefix tree implementation is the exact equivalent in code of the above requirement. The for loop we had before was an implementation detail and just distracted us from the overall goal. In a sense, the code now works at a higher level of abstraction.

Lastly, code size didn’t increase because there are various open source implementations for these data structures already.

Key Takeaways

  • Always measure before you write code — both before and after your changes
  • Know your profiling tools
  • A little knowledge of data structures and algorithms goes a long way
  • Tries are prefix trees and they’re great for asking, “Do you have anything with this prefix?”