I’m going to come clean: for a large chunk of my adult software-writing life, I’ve actively ignored databases. That’s not to say I didn’t use them! Given the choice, I would implement application logic in the code I was writing; treating the relational database as a dumb collection of keys and their values.

In part this grew from a culture of ORM use. Rather than using icky SQL and getting tied to one platform or another, object relational mappers insulate us from underlying database engines and their — ahem — idiosyncrasies.

The other part was the desire to keep all business logic in one place: the application itself. Call it proper centralization if you will, but I’m going to call it what it was: developer hubris.

No more! I now embrace the power of the data model, picking a database engine the way I pick programming languages: decisively, fearlessly and without regret.

What’s a Variadic Type?

Defining a relational model involves two tasks: identifying the nouns (what, precisely, are you storing?) and enumerating their attributes (what about them?) Nouns become tables, their attributes become columns, and each column in a traditional relational table stores at most one unit of data.

Variadic attributes need a bit more finesse.

Easy now… easy…. (photo by Sigmund on Unsplash)

A Tag! A Tag! My Kingdom for a Tag!

Let’s consider a concrete example.

Suppose we’re building a blogging engine, and we are modeling the database. We’re going to want to start with a table of posts:

create table if not exists posts (
  id bigserial not null primary key,
  title text not null,
  url   text not null unique,
  body  text not null
);

Each post gets its own record and the four attributes–idtitleurl, and body–are all scalars. If we want all posts, we select all records. If we only need one post, we have limit and a where clauses.

Big deal, James. You know how to make a table.

What happens if we want to add tagging into the mix? We will define a tag as a simple English word, consisting of non-whitespace alphanumeric characters. Each post can have any number of tags (even zero!).

For now, let’s use the only hammer at our disposal, and make the list of tags a text column:

create table if not exists posts (
  id bigserial not null primary key,
  title text not null,
  url   text not null unique,
  body  text not null,
  tags  text not null default ''
);

Already, we have a problem: tags, like the other text fields, can only take a single value, but we have a list of values to insert! One solution is to serialize the list into a scalar. We know from our definition of tags that they can never include any punctuation, so let’s use a comma (,) to delimit the list in a (scalar) string!

insert into posts (title, url, body, tags) values (
  'A Tale of Two Databases',
  '/writings/with-apologies-to-dickens',
  'It was the best of blogs, it was the worst of blogs...',
  'classical,literature,parody'
)

Now we must remember whenever we deal with the posts table to deserialize the tags list after reading a post record, and to serialize it back before writing one.

(This works by the way, and I’m only marginally ashamed to admit that there are a few systems out in the wild written this way by my own hand)

Let’s Be Better At This

The flaws with this approach manifest as soon as we try to use tags for something more than just blog post render-candy. Search is a big deal, and it breaks in a big way.

We shall start with the following data:

insert into posts (title, url, body, tags) values
  ('A Tale of Two Databases',
   '/writings/with-apologies-to-dickens',
   'It was the best of blogs, it was the worst of blogs...',
   'parody,fundamentals,data'),
('Pride & Prejudice (& Processes)',
   '/writings/a-jane-austen-task-scheduler',
   'It is a truth universally acknowledged, that an idle CPU must be in want of a runnable task.',
   'fun,parody,schedulers'),
  ('Kubermorphosis',
   '/writings/a-chittering-k8s',
   'As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a containerized cluster',
   'parody,fun,schedulers,trends');

We are using tags as a shorthand for the topic(s) of each blog post. One day, a reader will wish to focus in on a particular topic-tag like schedulers — at which point they would presume to find such wonderful gems as Pride & Prejudice (& Processes) and Kubermorphosis.

To do that, we need to look inside of our tag list, and filter accordingly. We could employ the like predicate in a where clause, something akin to:

select * from posts where tags like '%schedulers%'

This suffers from partial negative matching. A query to find all posts tagged fun would inadvertently include A Tale of Two Databases, which while fundamental, is not in fact tagged as fun. To fix this, we can add the delimiters into the like query:

select * from posts where tags like '%,schedulers,%'

Now, however, we have to over-delimit the tags strings; the above query incorrectly omits Pride & Prejudice (& Processes), since its tags field has schedulers at the end of the serialized list, and does not append the trailing comma.

No, we need a better way. We should treat our tags list as a first-class data concept. Why not treat a tag like a noun and make a table for it? Here’s the SQL for setting up the two object tables and a join table:

create table if not exists posts (
  id bigserial not null primary key,
  title text not null,
  url   text not null unique,
  body  text not null
);
create table if not exists tags (
  id bigserial not null primary key,
  tag text not null unique
);
create table if not exists taggings (
  post_id integer references posts (id)
          on delete restrict on update restrict,
  tag_id  integer references tags (id)
          on delete restrict on update restrict,
  unique (post_id, tag_id)
);

Each tag is stored once, in the tags table, and joined zero or more times through taggings to posts. Here’s a data file for populating this new three-table structure with the same information we used in the scalar approach.

We can hide the multi-level join behind a view, if we wish:

create view tagged_titles as
select p.id, p.title, json_agg(t.tag) as tags
  from posts p inner join taggings j on j.post_id = p.id
               inner join tags     t on j.tag_id  = t.id
 group by p.id, p.title;

This will (a) omit any posts that are untagged, and (b) build a JSON array of tags for each post. Here’s the proof:

vector3=# select * from tagged_titles;
-[ RECORD 1 ]---------------------------------------
id    | 1
title | A Tale of Two Databases
tags  | ["parody", "fundamentals", "data"]
-[ RECORD 2 ]---------------------------------------
id    | 2
title | Pride & Prejudice (& Processes)
tags  | ["parody", "fun", "schedulers"]
-[ RECORD 3 ]---------------------------------------
id    | 3
title | Kubermorphosis
tags  | ["parody", "fun", "schedulers", "trends"]

(note that I’ve enabled ‘expanded display’ for legibility’s sake)

To search, we need to work from the other direction. Here is a view that maps each tag to all of the posts tagged with it:

create view titles_by_tag as
select t.tag, p.id, p.title
  from posts p inner join taggings j on j.post_id = p.id
               inner join tags     t on j.tag_id  = t.id;

Starting with a single tag, we can now find all tagged posts:

postgres=# select title from titles_by_tag
           where tag = 'schedulers';
-[ RECORD 1 ]--------------------------
title | Pride & Prejudice (& Processes)
-[ RECORD 2 ]--------------------------
title | Kubermorphosis

Using SQL’s INTERSECT operator, we can do a simple AND query, finding posts tagged both schedulers and trends :

postgres=# select title from titles_by_tag where tag = 'schedulers'
             intersect
           select title from titles_by_tag where tag = 'trends';
-[ RECORD 1 ]---------
title | Kubermorphosis

This is handy, but the over-normalization of data presents a few hurdles.

For starters, inserting a blog post goes from being a single SQL query to being somewhere between 1+n to 1+2n queries, where n = the number of tags on the post. The best case happens when you tag a post using only pre-existing tags. When every tag is new, we hit the the higher end.

Secondly, deletion becomes more difficult. To protect against accidentally corrupting the many-to-many relationship between posts and tags, we have opted for on delete restrict and on update restrict constraints on the foreign keys in the join table. This stops us from accidentally dropping a post before removing its associations in taggings, leaving a dangling reference:

postgres=# delete from posts where id = 3;
ERROR:  update or delete on table "posts" violates foreign key constraint "taggings_post_id_fkey" on table "taggings"
DETAIL:  Key (id)=(3) is still referenced from table "taggings".

To process the delete, we must first remove the reference from taggings:

postgres=# delete from taggings where post_id = 3;
DELETE 4
postgres=# delete from posts where id = 3;
DELETE 1

Deletion is now a two-query affair.

In the context of blogging, where reads happen more often than writes–one hopes!–these trade-offs may be acceptable. For the cases where they are not, let us consider the variadic ARRAY type that PostgreSQL has supported since time immemorial.

Arrays in Scalar Slots

Our original problem with the scalar tags column is that it needed to be multi-valued, and text columns are not multi-valued. PostgreSQL has an aggregate type, however, called text [] that is multi-valued.

It looks like this:

create table if not exists posts (
  id bigserial not null primary key,
  title text not null,
  url   text not null unique,
  body  text not null,
  tags  text [] not null default '{}'
);

This is strikingly similar to our first attempt: one table with a single column for tracking tags. No joins. No overly-aggressive normalization. So how does it fare under search? For that, allow me to introduce you to @>.

In the SQL expression where x @> y@> is pronounced “contains,” as in: where x contains y. Here are some fun queries you can try on your local neighborhood PostgreSQL instance:

select '{a,b,c}'::text[] @> '{a}'::text[];
select '{a,b,c}'::text[] @> '{a,c}'::text[];
select '{a,b,c}'::text[] @> '{d,e,f}'::text[];

The first two should return t; the last should return f.

Keep in mind that order and cardinality do not matter. Both of the following queries will return true (t):

select '{a,b,c}'::text[] @> '{a,a,a,a}'::text[];
select '{a,b,c}'::text[] @> '{c,b,a}'::text[];

Tagging is inherently about sets of tokens, where order and cardinality matter little if at all. Let’s find blog posts tagged schedulers:

variadic4=# select title, tags from posts
             where tags @> '{schedulers}';
-[ RECORD 1 ]--------------------------
title | Pride & Prejudice (& Processes)
tags  | {fun,parody,schedulers}
-[ RECORD 2 ]--------------------------
title | Kubermorphosis
tags  | {parody,fun,schedulers,trends}

We don’t even need to intersect multiple queries to handle a multiple tagging assertion:

postgres=# select title, tags from posts
            where tags @> '{schedulers,trends}';
-[ RECORD 1 ]-------------------------
title | Kubermorphosis
tags  | {parody,fun,schedulers,trends}

Disjoint taggings (OR semantics) can likewise be done in a single query, using SQL’s own OR keyword:

postgres=# select title, tags from posts
           where tags @> '{data}'
              or tags @> '{trends}';
-[ RECORD 1 ]-------------------------
title | A Tale of Two Databases
tags  | {parody,fundamentals,data}
-[ RECORD 2 ]-------------------------
title | Kubermorphosis
tags  | {parody,fun,schedulers,trends}

Beyond Search: Analytics in the Data Engine

PostgreSQL has a whole slew of additional, array-related tricks up its sleeves. My personal favorite is the unnest() function, which turns an array into a result set. This is one you’ve just got to see in action!

postgres=# select unnest(tags) as t from posts order by t asc;
 data
 fun
 fun
 fundamentals
 parody
 parody
 parody
 schedulers
 schedulers
 trends

Each array of tags has been un-nested into a row for each tag. At this point we can bring the rest of the power of SQL to bear on the data–here we’re ordering the query results alphabetically by tag. The duplicates show up because we aren’t suppressing them. We could slap a distinct qualifier on that query and use it to show what tags we’re actively using:

select distinct unnest(tags) as tag from posts;

Even better, we can find the arity of a tag–how many blog posts are tagged as such, and build up a leaderboard of sorts:

select unnest(tags) as tag,
       count(id)    as n
  from posts
 group by tag;

With the sample data we’ve been playing with, that query yields this:

postgres=# select unnest(tags) as tag,
                  count(id)    as n
             from posts
            group by tag;
     tag      | n
--------------+---
 fun          | 2
 trends       | 1
 data         | 1
 parody       | 3
 schedulers   | 2
 fundamentals | 1
(6 rows)

We’ve used 6 tags, half of which have only been used once, and one of which is used everywhere. Pretty typical for technical blogs, I’d wager.

My favorite part about the variadic approach to modeling tags is that it gives us the best of both worlds. Insertion and deletion are back to being single-query propositions. My application does not need to understand how tagging is implemented, only how to specify the tags on a post, and how to search (via the views we so helpfully created).

Photo by Thomas Rey on Unsplash

Don’t Underestimate Your Database Engine

If I can leave you with one parting thought it is this: don’t underestimate the power contained in relational database systems. Below the ORMs, the DAOs, the DALs, and the other don’t-worry-you-don’t-need-to-know-SQL layers, lies a sophisticated bit of technology that may be able to completely remove, or at least sidestep, a whole class of fiddly programming problems.

Use your database.

If You Liked That, You’ll Love These:

Building a new kind of consultancy

Building a new kind of consultancy

At Vivanti, we’re building a new type of cloud consultancy. One based on trust and empowerment. We are looking for savvy, technical people who want to forge their own path in the industry.