This is part 2 of a series on SQL Anti Patterns, by Bill Karwin:
Part 1: Jaywalking can be found here
Chapter 3. Naive Trees
How would you store and query a hierarchy? Usually, when we think of parent-child relationships, we know that that we need to join the parents to the children through a key. This works fine for 1-level relationships but breaks down the deeper you need to dive into the descendents.
When you have a table of rows that track immediate ancestors, when you join it to itself, you can get the children. Join it to itself again and you can also get the grandchildren. With each successive generation, you would need another join. This not only means you would need to know the depth up front (a user can’t ask for “all”), but your SQL will balloon in size and complexity quickly. This type of table is called an Adjacency List.
Not only are Adjacency Lists difficult to query, they are hard to maintain. Moving around non-leaf nodes or deleting descendants becomes a multi-query operation that is slow and error prone. Triggers and foreign keys with cascade actions can help, but may not fit every situation you need.
Karwin suggests 3 alternatives to Adjacency Lists, they may be more complex but they make some queries easier to execute.
Path Enumeration uses a string to store the ancestors of a node. When I first read this, I thought it violated AntiPattern #1 “Jaywalking.” You want me to store character separated ancestor values in a column!? I sputtered and sat dumbfounded. Then I thought about it some more. This is not storing multiple values, it’s one whole path. The path is the single value even if it is composed of parts. Karwin uses the UNIX file path as an example of a slash-delimited string of directories. A drawback of Path Enumeration is that you have to manage the paths yourself and the database is unable to enforce integrity.
Nested Sets, seems to combine double-linked lists and binary trees. Each node of the hierarchy gets two fields to denote its place in the tree. The first field notes the nodes placement as you descend the left branch of the tree. The second tracks its distance as it rises up the branches. Karwin provides a great illustration (Figure 3.3) that expresses this. When you trace back up to the root and descend into another branch, the left (older sibling) branch’s second (ascending) node list is the starting point of the first (descending) node list of the new branch. To select a set you only need to join the table on itself once and query the node fields using a BETWEEN clause to find the entire set of ancestors or children you are searching for.
There are some drawbacks of the Nested Set design worth considering. It can be harder to get immediate ancestors and children of a node than the Adjacency List model. Maintenance of the tree is a bit more complex, moving and inserting a new node may require the recalculation of the placement fields not only of all the children, but a node’s siblings as well. It is better to use the Nested Set model when you will be querying for sub-trees than individual nodes of the trees. I would use the Nested Sets model on OLAP tables where I won’t be having a barrage of inserts that would bring an OLTP application to a crawl.
The Closure Table approach separates the hierarchy from the elements themselves. A table is created to store the ancestors and descendents of elements using foreign keys and a row defining every ancestor/descendent relationship across all levels. Because the path information is no longer tied to the data itself it makes managing the sets easier and queries quicker to resolve.
The number of records stored in the table can explode with many deep relationships but these rows are often only two columns wide and consist of ids to your data. For every new item you insert a self-referencing row into the closure table, then you insert a row that references the new row and every ancestor to that row.
The pattern you use should always be based on the needs and constraints of the problem you are trying to solve. As a software developer the Adjacency List is intuitive and simple. Back when I had to create cross-tab reports across demographics I can appreciate the utility of Nested Sets. But as a DBA I like how the Closure Table models the relationship and keeps that relationship separate from the entities themselves.