I thought I'd kick off this blog by posting a short note on an incredible paper I read several months ago - The Case for Learned Index Structures

This paper is from Google; Jeff Dean (who popularised Map-Reduce) is one of the co-authors. It describes an unintuitive application of neutral networks (at least to me) - creating better indexes.

Conventional index structures like B-Trees, Hash Tables and Bloom Filters try to optimize for worst-case performance. i.e., we design our data structures asking ourselves what the worst-case computational (for writes and reads) and storage requirements would be, and optimize for that.

As an example, consider fetching a record indexed by a B-Tree. Worst-case performance would be some constant multiplier of log n, n being the number of records indexed by the tree.

This is where the core insight of the paper comes in. When we're building the index structure, we have access to the data we will be indexing. Why do we need to assume that the data is structured in the worst possible way? If the data has some structure we can learn, can't we build a better index structure for that dataset?

The authors give a simple example to illustrate:

For example,
if the goal is to build a highly-tuned system to store and query ranges of fixed-length records over a set of
continuous integer keys (e.g., the keys 1 to 100M), one would not use a conventional B-Tree index over
the keys since the key itself can be used as an offset, making it an O(1) rather than O(log n) operation to look-up any key or the beginning of a range of keys. Similarly, the index memory size would be reduced from O(n) to O(1).

The paper is readable, but long and dense - it took me a good chunk of a long flight to finish it, and I still had to skim over a few parts.