SQL Anti Pattern Confessions #2

This is part 2 of a series on SQL Anti Patterns, by Bill Karwin:

Part 1: Jaywalking can be found here

Chapter 3. Naive Trees

How would you store and query a hierarchy? Usually, when we think of parent-child relationships, we know that that we need to join the parents to the children through a key. This works fine for 1-level relationships but breaks down the deeper you need to dive into the descendents.

Adjacency Lists

When you have a table of rows that track immediate ancestors, when you join it to itself, you can get the children. Join it to itself again and you can also get the grandchildren. With each successive generation, you would need another join. This not only means you would need to know the depth up front (a user can’t ask for “all”), but your SQL will balloon in size and complexity quickly. This type of table is called an Adjacency List.

Not only are Adjacency Lists difficult to query, they are hard to maintain. Moving around non-leaf nodes or deleting descendants becomes a multi-query operation that is slow and error prone. Triggers and foreign keys with cascade actions can help, but may not fit every situation you need.

Karwin suggests 3 alternatives to Adjacency Lists, they may be more complex but they make some queries easier to execute.

Path Enumeration

Path Enumeration uses a string to store the ancestors of a node. When I first read this, I thought it violated AntiPattern #1 “Jaywalking.” You want me to store character separated ancestor values in a column!? I sputtered and sat dumbfounded. Then I thought about it some more. This is not storing multiple values, it’s one whole path. The path is the single value even if it is composed of parts. Karwin uses the UNIX file path as an example of a slash-delimited string of directories. A drawback of Path Enumeration is that you have to manage the paths yourself and the database is unable to enforce integrity.

Nested Sets

Nested Sets, seems to combine double-linked lists and binary trees. Each node of the hierarchy gets two fields to denote its place in the tree. The first field notes the nodes placement as you descend the left branch of the tree. The second tracks its distance as it rises up the branches. Karwin provides a great illustration (Figure 3.3) that expresses this. When you trace back up to the root and descend into another branch, the left (older sibling) branch’s second (ascending) node list is the starting point of the first (descending) node list of the new branch. To select a set you only need to join the table on itself once and query the node fields using a BETWEEN clause to find the entire set of ancestors or children you are searching for.

There are some drawbacks of the Nested Set design worth considering. It can be harder to get immediate ancestors and children of a node than the Adjacency List model. Maintenance of the tree is a bit more complex, moving and inserting a new node may require the recalculation of the placement fields not only of all the children, but a node’s siblings as well. It is better to use the Nested Set model when you will be querying for sub-trees than individual nodes of the trees. I would use the Nested Sets model on OLAP tables where I won’t be having a barrage of inserts that would bring an OLTP application to a crawl.

Closure Tables

The Closure Table approach separates the hierarchy from the elements themselves. A table is created to store the ancestors and descendents of elements using foreign keys and a row defining every ancestor/descendent relationship across all levels. Because the path information is no longer tied to the data itself it makes managing the sets easier and queries quicker to resolve.

The number of records stored in the table can explode with many deep relationships but these rows are often only two columns wide and consist of ids to your data. For every new item you insert a self-referencing row into the closure table, then you insert a row that references the new row and every ancestor to that row.

The pattern you use should always be based on the needs and constraints of the problem you are trying to solve. As a software developer the Adjacency List is intuitive and simple. Back when I had to create cross-tab reports across demographics I can appreciate the utility of Nested Sets. But as a DBA I like how the Closure Table models the relationship and keeps that relationship separate from the entities themselves.

SQL Anti Pattern Confessions #1

SQL Anti Patterns, by Bill Karwin, is a book I wish I’d bought years ago when I really needed it. In these posts I’m briefly going to name the pattern, confess if I’ve been guilty of it in the past and communicate the suffering I’ve endured because of it. If you want solutions for these crimes against normalization I strongly suggest you pick up a copy of Mr. Karwin’s book.

Some days you are sitting around a conference room table, trying to solve a problem with you colleagues and someone says, “You know what we can do…”, and two years later you find yourself coding around that “quick and easy solution” that has plagued you six months after it went into production. Learn the lessons of this book well, it will not only make your queries run faster, but your applications will be flexible and provide you with room to grow and maneuver.

Chapter 2: Jaywalking

Plead: Guilty

Anytime you think it’s OK to store a list or comma separated values in a single field, STOP! This is not OK! You will pay severely!

Many developers confuse fields with columns when designing their database. They might think, “I’ll store the available colors for my umbrellas in field called available_colors and I’ll separate them with a comma.”

create table umbrellas
    umbrealla_id int not null auto_increment primary key
    available_colors text default '' 

| id   | available_colors |
|------|------------------|
|123456|red,blue,green    |
|123457|wine,navy,lime    |

You think, when it comes time to list the available colors I’ll just split the field on the comma and have my colors! While you have simplified your SQL, you have limited the number or colors you can offer. When someone asks what is the maximum number of colors you can store, you don’t know! The available_colors column is not a list of values but an arbitrary string! You can fit many colors with short names but not as many long-named ones.

Besides accounting for size, that column now holds a unique value for that row. If I wanted to discontinue a color, instead of deleting one color from a join table, I’d have to examine and update every row in my umbrella table, hoping I pluck out the correct sub-string without trimming or concatenating any of the other values in the column.

In the example above I’m using English names of colors, what if I wanted to localize my catalog for Spanish users? If I wanted to display “roja,azure,verde” I’d have to translate each sub-string after it was fetched!

Storing multi-value strings in a single column saves you from simple joins but forces you to write complicated queries that tie you in knots whenever you need to select, filter or aggregate data based on such values.

People who know how to properly design a database will assume that you do not when they see you abuse this cardinal rule without a good reason. There are limits to how many columns you can have in a row. There are byte-size limits on how much data you can cram into a single row. If you have a column that represents a list, better to make it a separate table. Indexing the value column can help you find particular values faster. After all a table is a list of rows.

Also, you are going to duplicate a lot of data values. When you store a list of values as a string you are essentially transforming information into an anonymous blob. Instead of your database organizing and managing your data, you’re transferring that responsibility to your application. Your business logic must now fetch, parse and provide context for your values.

Finally, despite your best estimate, someday, someone is going to find your BIG text field of multiple values just a bit too small for all their need. For the sake of a single “cell”, you’ll be required to alter the whole table to make it big enough. This not only wastes space, it could incur downtime while the table is copied into its new schema.

REMEMBER: One column holds one and ONLY one value at a time! Lists should never be stored as values but expressed as a relationship!