SQL Anti Pattern Confessions #2

This is part 2 of a series on SQL Anti Patterns, by Bill Karwin:

Part 1: Jaywalking can be found here

Chapter 3. Naive Trees

How would you store and query a hierarchy? Usually, when we think of parent-child relationships, we know that that we need to join the parents to the children through a key. This works fine for 1-level relationships but breaks down the deeper you need to dive into the descendents.

Adjacency Lists

When you have a table of rows that track immediate ancestors, when you join it to itself, you can get the children. Join it to itself again and you can also get the grandchildren. With each successive generation, you would need another join. This not only means you would need to know the depth up front (a user can’t ask for “all”), but your SQL will balloon in size and complexity quickly. This type of table is called an Adjacency List.

Not only are Adjacency Lists difficult to query, they are hard to maintain. Moving around non-leaf nodes or deleting descendants becomes a multi-query operation that is slow and error prone. Triggers and foreign keys with cascade actions can help, but may not fit every situation you need.

Karwin suggests 3 alternatives to Adjacency Lists, they may be more complex but they make some queries easier to execute.

Path Enumeration

Path Enumeration uses a string to store the ancestors of a node. When I first read this, I thought it violated AntiPattern #1 “Jaywalking.” You want me to store character separated ancestor values in a column!? I sputtered and sat dumbfounded. Then I thought about it some more. This is not storing multiple values, it’s one whole path. The path is the single value even if it is composed of parts. Karwin uses the UNIX file path as an example of a slash-delimited string of directories. A drawback of Path Enumeration is that you have to manage the paths yourself and the database is unable to enforce integrity.

Nested Sets

Nested Sets, seems to combine double-linked lists and binary trees. Each node of the hierarchy gets two fields to denote its place in the tree. The first field notes the nodes placement as you descend the left branch of the tree. The second tracks its distance as it rises up the branches. Karwin provides a great illustration (Figure 3.3) that expresses this. When you trace back up to the root and descend into another branch, the left (older sibling) branch’s second (ascending) node list is the starting point of the first (descending) node list of the new branch. To select a set you only need to join the table on itself once and query the node fields using a BETWEEN clause to find the entire set of ancestors or children you are searching for.

There are some drawbacks of the Nested Set design worth considering. It can be harder to get immediate ancestors and children of a node than the Adjacency List model. Maintenance of the tree is a bit more complex, moving and inserting a new node may require the recalculation of the placement fields not only of all the children, but a node’s siblings as well. It is better to use the Nested Set model when you will be querying for sub-trees than individual nodes of the trees. I would use the Nested Sets model on OLAP tables where I won’t be having a barrage of inserts that would bring an OLTP application to a crawl.

Closure Tables

The Closure Table approach separates the hierarchy from the elements themselves. A table is created to store the ancestors and descendents of elements using foreign keys and a row defining every ancestor/descendent relationship across all levels. Because the path information is no longer tied to the data itself it makes managing the sets easier and queries quicker to resolve.

The number of records stored in the table can explode with many deep relationships but these rows are often only two columns wide and consist of ids to your data. For every new item you insert a self-referencing row into the closure table, then you insert a row that references the new row and every ancestor to that row.

The pattern you use should always be based on the needs and constraints of the problem you are trying to solve. As a software developer the Adjacency List is intuitive and simple. Back when I had to create cross-tab reports across demographics I can appreciate the utility of Nested Sets. But as a DBA I like how the Closure Table models the relationship and keeps that relationship separate from the entities themselves.

SQL Anti Pattern Confessions #1

SQL Anti Patterns, by Bill Karwin, is a book I wish I’d bought years ago when I really needed it. In these posts I’m briefly going to name the pattern, confess if I’ve been guilty of it in the past and communicate the suffering I’ve endured because of it. If you want solutions for these crimes against normalization I strongly suggest you pick up a copy of Mr. Karwin’s book.

Some days you are sitting around a conference room table, trying to solve a problem with you colleagues and someone says, “You know what we can do…”, and two years later you find yourself coding around that “quick and easy solution” that has plagued you six months after it went into production. Learn the lessons of this book well, it will not only make your queries run faster, but your applications will be flexible and provide you with room to grow and maneuver.

Chapter 2: Jaywalking

Plead: Guilty

Anytime you think it’s OK to store a list or comma separated values in a single field, STOP! This is not OK! You will pay severely!

Many developers confuse fields with columns when designing their database. They might think, “I’ll store the available colors for my umbrellas in field called available_colors and I’ll separate them with a comma.”

create table umbrellas
    umbrealla_id int not null auto_increment primary key
    available_colors text default '' 

| id   | available_colors |
|123456|red,blue,green    |
|123457|wine,navy,lime    |

You think, when it comes time to list the available colors I’ll just split the field on the comma and have my colors! While you have simplified your SQL, you have limited the number or colors you can offer. When someone asks what is the maximum number of colors you can store, you don’t know! The available_colors column is not a list of values but an arbitrary string! You can fit many colors with short names but not as many long-named ones.

Besides accounting for size, that column now holds a unique value for that row. If I wanted to discontinue a color, instead of deleting one color from a join table, I’d have to examine and update every row in my umbrella table, hoping I pluck out the correct sub-string without trimming or concatenating any of the other values in the column.

In the example above I’m using English names of colors, what if I wanted to localize my catalog for Spanish users? If I wanted to display “roja,azure,verde” I’d have to translate each sub-string after it was fetched!

Storing multi-value strings in a single column saves you from simple joins but forces you to write complicated queries that tie you in knots whenever you need to select, filter or aggregate data based on such values.

People who know how to properly design a database will assume that you do not when they see you abuse this cardinal rule without a good reason. There are limits to how many columns you can have in a row. There are byte-size limits on how much data you can cram into a single row. If you have a column that represents a list, better to make it a separate table. Indexing the value column can help you find particular values faster. After all a table is a list of rows.

Also, you are going to duplicate a lot of data values. When you store a list of values as a string you are essentially transforming information into an anonymous blob. Instead of your database organizing and managing your data, you’re transferring that responsibility to your application. Your business logic must now fetch, parse and provide context for your values.

Finally, despite your best estimate, someday, someone is going to find your BIG text field of multiple values just a bit too small for all their need. For the sake of a single “cell”, you’ll be required to alter the whole table to make it big enough. This not only wastes space, it could incur downtime while the table is copied into its new schema.

REMEMBER: One column holds one and ONLY one value at a time! Lists should never be stored as values but expressed as a relationship!

Boston Code Camp: Part II

On March 25, 2017 I attended Boston Code Camp 27 at the Microsoft Technology Center in Burlington, MA. This is the second part of the notes I took from the sessions I attended.

After the Hololens talk I ended up taking two sessions on presentations.

The Cure for Boring Presentations

The first was The Cure for Boring Presentations by Rick Pollack “The Presentation Medic.” Rick started his talk explaining that he used to work for Motorola and in one meeting, while he was droning through a long list of product features, his customer actually fell asleep! After that he decided to take some speaking classes. He now offers lessons so that others will not suffer his same fate. As you’d expect from a good presenter, there was more than just the text of his slides.

From my notes:

  1. Give your audience a reason to care. Instead of mentioning features, describe the benefit (save money, save time, etc.) Diagnose your audience’s problem and prescribe them an idea.
  2. Your presentation should make a connection, establish a context, and your ideas should contrast where they are with how you can help them.
  3. Go through your slides and make sure there is a “you” rather than me focus. For example instead of saying “My software runs 33% faster,” shift the focus to “You’ll increase processing speed by 33%”
  4. Use hyperlinks within your power point slides around key topics in case they catch interest and you have to skip ahead. Clicking a link is much smoother than hitting TAB 36 times.
  5. Number your slides for easy navigation.
  6. Forget the Jargon and use simple language.
  7. Don’t use the slides as a teleprompter.
  8. Instead of hand outs make a PDF they can download later via URL or QR code.

Rick also advised in always framing your presentation in the form of a story. Our brains are hard wired for stories and the information they convey will last longer. Be a confident guide while you spin the tale of who, what, when, where and why they should care what you have to tell them.

Put your audience’s situation that makes them uncomfortable with the status quo. Describe the situation they would find themselves in before and after an upgrade or what they will be facing with or without your idea.

Filter and distill your content. Only include what data is necessary to support the story. If you were trying to sell a muscle car to a guy who loves engines don’t bore him brake-pad details, focus on the horse power!

Finally, your “Big Idea,” the big take away needs to be memorable, repeatable & tweet-able.

ScreenCasting 101

Betsy Weber (@betsyweber) from Microsoft gave a quick half-hour presentation on screen casting. Each of her slides consisted of a funny GIF from giphycat over a bullet point of advice she had learned the hard way.

Before you record:

  1. Write out your script or at least have an outline.
  2. Turn off background programs like Skype, Slack, Steam, Discord, etc.
  3. Gather and prepare all the materials you will need.
  4. Practice and do a dry run!
  5. Clean up/off your desktop! Minimize the distractions you have to compete with.
  6. Use the now blank real estate of your desktop to enforce your brand!
  7. Clean your browser history to avoid not only embarrassment but also eliminate another distraction.
  8. When moving your mouse SLOW it down! No one wants to have to concentrate on that tiny pointer zipping around.
  9. Pause your recording when you need a break.
  10. People wont tolerate bad audio. Try pulling a comforter over your head while recording audio to reduce ambient noise.
  11. Write the script, record the audio, record the video in that order! Then combine them in editing.
  12. For a microphone you can get a Blue microphone from Amazon for around $50.
  13. Spring the $6 for the pop-filter. It makes a big difference.
  14. When recording audio, stand up and make gestures. Sitting down constricts your diaphragm and you’ll sound better.
  15. Don’t drink soft drinks or milk before recording as it will make you sound Phlem-y.
  16. When recording audio use a clicker, tap the mic or just be silent for a moment to “mark” the audio histogram to make editing easier. Think of it like the clapboards they use in films to mark a scene.

Some screen capturing tools you can check out include:

  • Jing – Free but imposes a time limit.
  • Snagit – $50
  • Camtasia – $199

Boston Code Camp: Part 1

Microsoft Technology Center, Burlington, MA
On March 25, I attended Boston Code Camp 27 at the Microsoft Technology Center in Burlington, MA. Although I’ve never done any .NET programming, the schedule listed some open source and professional development tracks I thought looked interesting. Below is a list of the talks I attended along with my impressions of the ideas and presenters. Where possible I’ll link to any of the presentations.

Living Breathing Digital: Bending Reality with HoloLens

I’d seen the hype about Hololens online and really couldn’t pass up seeing this thing in real life. Jason Ioffe from BlueMetal did a great job handling a demonstration that wasn’t easy to pull off. Jason was wearing the Hololens, which was tethered to his Mac book, that was connected to the projector; allowing us to “see through his eyes.”

The Hololens is totally self contained and equipped with sensors and cameras that constantly scan the surroundings up to a distance of 20 meters. It’s not a bulky, or awkward looking headset like the VR visors we had in the 90’s. A HUD (which we couldn’t see) was super-imposed on the glass looking out at the real world. You use your finger as a mouse, fully extended to hover and then bent to click. I have to say, watching him scroll and click through phantasmal menus made me think of little Danny in “The Shining”.

Jason took a “snapshot” of the room and loaded it into the Unity 3D SDK where all the walls, desks and audience members appeared as a 3D, gray, video game scene. Into this scene he imported “artifacts” from the Unity library like a sphere, a revolving doughnut, apple tree and a cartoon robot that acted as an interface to a speech to text virtual assistant.

Loading these artifacts back into the Hololens was were the presentation bogged down. It’s hard to keep an audience enraptured while you all sit there watch the compile, deploy, debug ritual play out in real time. Once he was all set we watched him shake the Apple tree. Digital apples fell from the tree, landed on a front row desk and then rolled onto the floor. Looking at the blue cartoon robot, who sat under the tree, he asked it the weather. The robot replied with a text bubble. While the artifacts could “interact” with the scene, they cannot, currently deal with changes in the environment. For example if the digital apple had landed on the desk, then I moved the desk, the apple wouldn’t follow the desk or drop to the floor but remain suspended.

The Hololens is a cool piece of tech and its sleek design holds up as a testament to Moore’s Law but outside of video games or niche specialized presentations (medical, miniaturization design) I have a hard time thinking of what practical applications it would be put to.

Great Escape for HereDoc

Recently at work I was putting together a bash script that created RPM packages. One of the scripts requirements was that .spec files had to be created at run time and their body should be enclosed in a HEREDOC block.

A HEREDOC block allows you to redirect a long, multi-line string without having to insert escape sequences like “\n” at the end of each line. A heredoc block starts with << followed by a delimiter that will be repeated on it’s own line to end the block. For example if I wanted to write a short paragraph to a file I could do something like this

cat <<EOF 
This is a short paragraph. That
I want to write to a file. It's 
not a big paragraph, but will 
do for this example.
> paragraph.txt     

In the example above EOF is the delimiter and all the text between <<EOF and EOF will be written to paragraph.txt. To increase readability, many add the ‘-‘ modifier to the opening heredoc delimiter to suppress leading tabs. This allows me to indent the contents with my heredoc without the indentation being written to the file.

cat <<-EOF 
    This is a short paragraph. That
    I want to write to a file. It's 
    not a big paragraph, but will 
    do for this example.

    Adding the tabs to the text makes
    it more readable, and easier to 
    spot where my end delimiter is.
> paragraph.txt     

The problem I ran into with using the heredoc for my RPM spec file was that my spec file contained bash in it’s %post section. When I ran the script containing the heredoc the bash within the heredoc also ran creating errors! There was a lot of variables and commands within the spec so escaping each “$” or “`” by end was not only tedious but provided fertile ground for some future maintainer to encounter the same problem. In desperation I typed “heredoc escape” into Google and discovered that simply wrapping the opening delimiter in quotes (single or double) will escape the entire contents of the string.

cat <<EOF 
To print out your current working directory enter `pwd`

Results in the following with the command replaced with it’s value.

To print out your current working directory enter /home/joatis

Wrapping quotes around the delimiter though preserves my intent:

cat <<-'EOF' 
    To print out your current working directory enter `pwd`

To print out your current working directory enter `pwd`

Adding the quotes keeps your content free of ugly escapes, improves readability and should prevent future maintenance mistakes.

Dynamically bind_params for mysqli Prepared Statements

Personally, I avoid using mysqli whenever I can, but sometimes you find yourself working on a legacy application, or it’s the only library your team is familiar with. MySQLi, only has methods for connecting to a MySQL database and some of its methods have peculiar implementations that make them hard to use in large scale applications.

The Problem When calling bind_param() the first argument is string of ‘types’ where a character specifies the data type of a corresponding value. The number of types must match the number and order of values you’re binding. For example:

/* Connect to database */

$conn = mysqli_connect($db_server, $db_user, $db_pass, $db_name);
/* Prepare a statement */
$stmt = $conn->prepare("SELECT * accounts where type=? and balance > ?");

/* Bind the parameters to the statement
s = string
d = double (float)

$stmt-bind_params('sd', $account_type, $min_balance);
$account_type = 'savings';
$min_balance = 100.00;

/* The result of all savings accounts with a balance > 100.00 */
$result = $stmt->get_result();

Suppose I don’t want my code to be littered with hand-crafted unique bind_param calls, how could I create a function that will bind a variable list of parameters? Thankfully, I stumbled on

this clever solution that uses the  call_user_func_array function.

Solution The call_user_func_array function allows you to specify a callable function (or method) and pass an array of arguments to be fed to the callable. Since I want to use the mysqli_stmt bind_params function I pass as the first argument an array containing my stmt object and the method name I want to call as a string. The second argument is another array. This second array holds my ‘type string’ as its first element and then my values to bind as succeeding elements. The elements in this second array must be passed as references.

$typeString = 'sd';
$vals = array('savings', 100.00);
$valCount = count($vals);

/* Populate args with references to values */
$args = array(&$typeString);
for ($i=0; $i > $valCount; $i++){
    $args[] = &$vals[$i];

/* call $stmt->bind_params() using $args as its parameter list */
call_user_func_array( array($stmt, 'bind_param'), $args);

Using this technique I can write a function that will bind any number of arguments to a statement and have only one place that I need to manage when it comes to catching exceptions or handling other errors.