Exploring Data Structures in the Real World: DNS Denylists

UPDATED ON APR 27, 2023 : 569 words, 3 minute read β€” SOFTWARE DEVELOPMENT

As a long time user of Pi-hole (and now NextDNS ) to unclog my network from all the garbage ads the internet serves, I’ve always been amazed at the speed at which their systems operate. It has to take a target URL request and compare it against huge block lists of thousands of bad domains1, all in the span of a few ms. Otherwise, we would get frustrated and go back to browsing the web naked & afraid.

Potential Data Structures πŸ”—︎

With speed such an important factor, using an appropriate data structure would have been an important early decision for Pi-hole & NextDNS. What are some potential data structures that can provide a speedy answer to “is this URL in the long list of blocked domains”?

Dictionary πŸ”—︎

Each domain could have a dictionary entry - lookup is O(1), but you will have a pretty big space requirement - it also doesn’t handle regex blocking like *.optimizely.com very well in the naΓ―ve implementation - you would have to get creative with what keys you add and checking sub strings of the target URL.

Reverse trie/ Radix tree πŸ”—︎

A potential search strategy could be using a reverse trie - so starting from the end of the target URL to check, step through your trie to see if it matches by suffix .e.g. m-o-c-.-y-l-e-z-i-m-i-t-p-o for optimizely.com.

This search is then O(m) where m is the length of the target URL - and you can even make it faster by cutting off early if you have regex rules such as “all subdomains of X URL”.

Since we know our use case is URLs and not just any arbitrary strings, we likely would be better off storing chunks at each node, rather than individual letters - e.g. com -> google -> api for api.google.com - (I have now learned this is called a Radix tree!).

Bloom filters πŸ”—︎

Digging around a little, I found a suggestion I hadn’t heard of before - Bloom Filters ! This data structure is neat because it’s fast, space-efficient, and one of the outputs is probabilistic: possibly in set , or definitely NOT in set. Using a bloom filter would let you check the most common case (URL not in the blocklist) in O(k) complexity- which is very close to O(1) since k is only the number of hash functions used.

As far as I understand, we would still need to handle the possibly in set outcome with a deterministic data structure, but the filter significantly cuts down the frequency of a more time expensive operation - not as clear how it helps with space complexity if you still need a second data structure.

Open question: how does Pi-hole do it? πŸ”—︎

Pi-hole is open-source, so if we dig far enough, we can find out how a real world DNS sinkhole keeps it’s lookup operations speedy - if you find it before I do, let me know so I can link directly to it! Otherwise, here’s the link to the Github organization and the FTL repo which I believe is where it resides.

  1. While thousands of URLs isn’t a ton of data to search through for a computer, when operating at the DNS level and the insane speed it requires, they probably don’t want to waste any time. ↩︎

See Also