SQL Anti Pattern Confessions #2

This is part 2 of a series on SQL Anti Patterns, by Bill Karwin:

Part 1: Jaywalking can be found here

Chapter 3. Naive Trees

How would you store and query a hierarchy? Usually, when we think of parent-child relationships, we know that that we need to join the parents to the children through a key. This works fine for 1-level relationships but breaks down the deeper you need to dive into the descendents.

Adjacency Lists

When you have a table of rows that track immediate ancestors, when you join it to itself, you can get the children. Join it to itself again and you can also get the grandchildren. With each successive generation, you would need another join. This not only means you would need to know the depth up front (a user can’t ask for “all”), but your SQL will balloon in size and complexity quickly. This type of table is called an Adjacency List.

Not only are Adjacency Lists difficult to query, they are hard to maintain. Moving around non-leaf nodes or deleting descendants becomes a multi-query operation that is slow and error prone. Triggers and foreign keys with cascade actions can help, but may not fit every situation you need.

Karwin suggests 3 alternatives to Adjacency Lists, they may be more complex but they make some queries easier to execute.

Path Enumeration

Path Enumeration uses a string to store the ancestors of a node. When I first read this, I thought it violated AntiPattern #1 “Jaywalking.” You want me to store character separated ancestor values in a column!? I sputtered and sat dumbfounded. Then I thought about it some more. This is not storing multiple values, it’s one whole path. The path is the single value even if it is composed of parts. Karwin uses the UNIX file path as an example of a slash-delimited string of directories. A drawback of Path Enumeration is that you have to manage the paths yourself and the database is unable to enforce integrity.

Nested Sets

Nested Sets, seems to combine double-linked lists and binary trees. Each node of the hierarchy gets two fields to denote its place in the tree. The first field notes the nodes placement as you descend the left branch of the tree. The second tracks its distance as it rises up the branches. Karwin provides a great illustration (Figure 3.3) that expresses this. When you trace back up to the root and descend into another branch, the left (older sibling) branch’s second (ascending) node list is the starting point of the first (descending) node list of the new branch. To select a set you only need to join the table on itself once and query the node fields using a BETWEEN clause to find the entire set of ancestors or children you are searching for.

There are some drawbacks of the Nested Set design worth considering. It can be harder to get immediate ancestors and children of a node than the Adjacency List model. Maintenance of the tree is a bit more complex, moving and inserting a new node may require the recalculation of the placement fields not only of all the children, but a node’s siblings as well. It is better to use the Nested Set model when you will be querying for sub-trees than individual nodes of the trees. I would use the Nested Sets model on OLAP tables where I won’t be having a barrage of inserts that would bring an OLTP application to a crawl.

Closure Tables

The Closure Table approach separates the hierarchy from the elements themselves. A table is created to store the ancestors and descendents of elements using foreign keys and a row defining every ancestor/descendent relationship across all levels. Because the path information is no longer tied to the data itself it makes managing the sets easier and queries quicker to resolve.

The number of records stored in the table can explode with many deep relationships but these rows are often only two columns wide and consist of ids to your data. For every new item you insert a self-referencing row into the closure table, then you insert a row that references the new row and every ancestor to that row.

The pattern you use should always be based on the needs and constraints of the problem you are trying to solve. As a software developer the Adjacency List is intuitive and simple. Back when I had to create cross-tab reports across demographics I can appreciate the utility of Nested Sets. But as a DBA I like how the Closure Table models the relationship and keeps that relationship separate from the entities themselves.

SQL Anti Pattern Confessions #1

SQL Anti Patterns, by Bill Karwin, is a book I wish I’d bought years ago when I really needed it. In these posts I’m briefly going to name the pattern, confess if I’ve been guilty of it in the past and communicate the suffering I’ve endured because of it. If you want solutions for these crimes against normalization I strongly suggest you pick up a copy of Mr. Karwin’s book.

Some days you are sitting around a conference room table, trying to solve a problem with you colleagues and someone says, “You know what we can do…”, and two years later you find yourself coding around that “quick and easy solution” that has plagued you six months after it went into production. Learn the lessons of this book well, it will not only make your queries run faster, but your applications will be flexible and provide you with room to grow and maneuver.

Chapter 2: Jaywalking

Plead: Guilty

Anytime you think it’s OK to store a list or comma separated values in a single field, STOP! This is not OK! You will pay severely!

Many developers confuse fields with columns when designing their database. They might think, “I’ll store the available colors for my umbrellas in field called available_colors and I’ll separate them with a comma.”

create table umbrellas
    umbrealla_id int not null auto_increment primary key
    available_colors text default '' 

| id   | available_colors |
|------|------------------|
|123456|red,blue,green    |
|123457|wine,navy,lime    |

You think, when it comes time to list the available colors I’ll just split the field on the comma and have my colors! While you have simplified your SQL, you have limited the number or colors you can offer. When someone asks what is the maximum number of colors you can store, you don’t know! The available_colors column is not a list of values but an arbitrary string! You can fit many colors with short names but not as many long-named ones.

Besides accounting for size, that column now holds a unique value for that row. If I wanted to discontinue a color, instead of deleting one color from a join table, I’d have to examine and update every row in my umbrella table, hoping I pluck out the correct sub-string without trimming or concatenating any of the other values in the column.

In the example above I’m using English names of colors, what if I wanted to localize my catalog for Spanish users? If I wanted to display “roja,azure,verde” I’d have to translate each sub-string after it was fetched!

Storing multi-value strings in a single column saves you from simple joins but forces you to write complicated queries that tie you in knots whenever you need to select, filter or aggregate data based on such values.

People who know how to properly design a database will assume that you do not when they see you abuse this cardinal rule without a good reason. There are limits to how many columns you can have in a row. There are byte-size limits on how much data you can cram into a single row. If you have a column that represents a list, better to make it a separate table. Indexing the value column can help you find particular values faster. After all a table is a list of rows.

Also, you are going to duplicate a lot of data values. When you store a list of values as a string you are essentially transforming information into an anonymous blob. Instead of your database organizing and managing your data, you’re transferring that responsibility to your application. Your business logic must now fetch, parse and provide context for your values.

Finally, despite your best estimate, someday, someone is going to find your BIG text field of multiple values just a bit too small for all their need. For the sake of a single “cell”, you’ll be required to alter the whole table to make it big enough. This not only wastes space, it could incur downtime while the table is copied into its new schema.

REMEMBER: One column holds one and ONLY one value at a time! Lists should never be stored as values but expressed as a relationship!

Great Escape for HereDoc

Recently at work I was putting together a bash script that created RPM packages. One of the scripts requirements was that .spec files had to be created at run time and their body should be enclosed in a HEREDOC block.

A HEREDOC block allows you to redirect a long, multi-line string without having to insert escape sequences like “\n” at the end of each line. A heredoc block starts with << followed by a delimiter that will be repeated on it’s own line to end the block. For example if I wanted to write a short paragraph to a file I could do something like this

cat <<EOF 
This is a short paragraph. That
I want to write to a file. It's 
not a big paragraph, but will 
do for this example.
EOF
> paragraph.txt     

In the example above EOF is the delimiter and all the text between <<EOF and EOF will be written to paragraph.txt. To increase readability, many add the ‘-‘ modifier to the opening heredoc delimiter to suppress leading tabs. This allows me to indent the contents with my heredoc without the indentation being written to the file.

cat <<-EOF 
    This is a short paragraph. That
    I want to write to a file. It's 
    not a big paragraph, but will 
    do for this example.

    Adding the tabs to the text makes
    it more readable, and easier to 
    spot where my end delimiter is.
EOF
> paragraph.txt     

The problem I ran into with using the heredoc for my RPM spec file was that my spec file contained bash in it’s %post section. When I ran the script containing the heredoc the bash within the heredoc also ran creating errors! There was a lot of variables and commands within the spec so escaping each “$” or “`” by end was not only tedious but provided fertile ground for some future maintainer to encounter the same problem. In desperation I typed “heredoc escape” into Google and discovered that simply wrapping the opening delimiter in quotes (single or double) will escape the entire contents of the string.

cat <<EOF 
To print out your current working directory enter `pwd`
EOF

Results in the following with the command replaced with it’s value.

To print out your current working directory enter /home/joatis

Wrapping quotes around the delimiter though preserves my intent:

cat <<-'EOF' 
    To print out your current working directory enter `pwd`
EOF

To print out your current working directory enter `pwd`

Adding the quotes keeps your content free of ugly escapes, improves readability and should prevent future maintenance mistakes.

Dynamically bind_params for mysqli Prepared Statements

UPDATE 10/28/2018 – Thanks to Lars Krakhecke for spotting a typo in my for loop.

Personally, I avoid using mysqli whenever I can, but sometimes you find yourself working on a legacy application, or it’s the only library your team is familiar with. MySQLi, only has methods for connecting to a MySQL database and some of its methods have peculiar implementations that make them hard to use in large scale applications.

The Problem When calling bind_param() the first argument is string of ‘types’ where a character specifies the data type of a corresponding value. The number of types must match the number and order of values you’re binding. For example:

/* Connect to database */

$conn = mysqli_connect($db_server, $db_user, $db_pass, $db_name);
/* Prepare a statement */
$stmt = $conn->prepare("SELECT * accounts where type=? and balance > ?");

/* Bind the parameters to the statement
s = string
d = double (float)
*/

$stmt-bind_params('sd', $account_type, $min_balance);
$account_type = 'savings';
$min_balance = 100.00;
$stmt->execute();

/* The result of all savings accounts with a balance > 100.00 */
$result = $stmt->get_result();

Suppose I don’t want my code to be littered with hand-crafted unique bind_param calls, how could I create a function that will bind a variable list of parameters? Thankfully, I stumbled on

this clever solution that uses the  call_user_func_array function.

Solution The call_user_func_array function allows you to specify a callable function (or method) and pass an array of arguments to be fed to the callable. Since I want to use the mysqli_stmt bind_params function I pass as the first argument an array containing my stmt object and the method name I want to call as a string. The second argument is another array. This second array holds my ‘type string’ as its first element and then my values to bind as succeeding elements. The elements in this second array must be passed as references.

$typeString = 'sd';
$vals = array('savings', 100.00);
$valCount = count($vals);

/* Populate args with references to values */
$args = array(&$typeString);
for ($i=0; $i < $valCount; $i++){
    $args[] = &$vals[$i];
}

/* call $stmt->bind_params() using $args as its parameter list */
call_user_func_array( array($stmt, 'bind_param'), $args);

Using this technique I can write a function that will bind any number of arguments to a statement and have only one place that I need to manage when it comes to catching exceptions or handling other errors.