I’m going to come clean: for a large chunk of my adult software-writing life, I’ve actively ignored databases. That’s not to say I didn’t use them! Given the choice, I would implement application logic in the code I was writing; treating the relational database as a dumb collection of keys and their values.
In part this grew from a culture of ORM use. Rather than using icky SQL and getting tied to one platform or another, object relational mappers insulate us from underlying database engines and their — ahem — idiosyncrasies.
The other part was the desire to keep all business logic in one place: the application itself. Call it proper centralization if you will, but I’m going to call it what it was: developer hubris.
No more! I now embrace the power of the data model, picking a database engine the way I pick programming languages: decisively, fearlessly and without regret.
What’s a Variadic Type?
Defining a relational model involves two tasks: identifying the nouns (what, precisely, are you storing?) and enumerating their attributes (what about them?) Nouns become tables, their attributes become columns, and each column in a traditional relational table stores at most one unit of data.
Variadic attributes need a bit more finesse.

A Tag! A Tag! My Kingdom for a Tag!
Let’s consider a concrete example.
Suppose we’re building a blogging engine, and we are modeling the database. We’re going to want to start with a table of posts:
create table if not exists posts (
id bigserial not null primary key,
url text not null unique,
body text not null
);
Each post gets its own record and the four attributes–id
,
title
, url
, and body
–are all scalars. If we want all posts,
we select all records. If we only need one post, we have limit
and where
clauses.
Big deal, James. You know how to make a table.
What happens if we want to add tagging into the mix? We will define a tag as a simple English word, consisting of non-whitespace alphanumeric characters. Each post can have any number of tags (even zero!).
For now, let’s use the only hammer at our disposal, and make the
list of tags a text
column:
create table if not exists posts (
id bigserial not null primary key,
title text not null,
url text not null unique,
body text not null,
tags text not null default ''
);
Already, we have a problem: tags
, like the other text fields, can
only take a single value, but we have a list of values to insert!
One solution is to serialize the list into a scalar. We know from
our definition of tags that they can never include any
punctuation, so let’s use a comma (,
) to delimit the list in a
(scalar) string!
insert into posts (title, url, body, tags) values (
'A Tale of Two Databases',
'/writings/with-apologies-to-dickens',
'It was the best of blogs, it was the worst of blogs...',
'classical,literature,parody'
)
Now we must remember whenever we deal with the posts
table to
deserialize the tags
list after reading a post record, and to
serialize it back before writing one.
(This works by the way, and I’m only marginally ashamed to admit that there are a few systems out in the wild written this way by my own hand)
Let’s Be Better At This
The flaws with this approach manifest as soon as we try to use tags for something more than just blog post render-candy. Search is a big deal, and it breaks in a big way.
We shall start with the following data:
insert into posts (title, url, body, tags) values
('A Tale of Two Databases',
'/writings/with-apologies-to-dickens',
'It was the best of blogs, it was the worst of blogs...',
'parody,fundamentals,data'),
('Pride & Prejudice (& Processes)',
'/writings/a-jane-austen-task-scheduler',
'It is a truth universally acknowledged, that an idle CPU
must be in want of a runnable task.',
'fun,parody,schedulers'),
('Kubermorphosis',
'/writings/a-chittering-k8s',
'As Gregor Samsa awoke one morning from uneasy dreams he
found himself transformed in his bed into a containerized
cluster',
'parody,fun,schedulers,trends');
We are using tags as a shorthand for the topic(s) of each blog
post. One day, a reader will wish to focus in on a particular
topic-tag like schedulers
— at which point they would presume to
find such wonderful gems as Pride & Prejudice (& Processes)
and Kubermorphosis.
To do that, we need to look inside of our tag list, and filter
accordingly. We could employ the like
predicate in a where
clause,
something akin to:
select * from posts where tags like '%schedulers%'
This suffers from partial negative matching. A query to find all
posts tagged fun
would inadvertently include A Tale of Two
Databases, which while fundamental, is not in fact tagged as
fun
. To fix this, we can add the delimiters into the like
query:
select * from posts where tags like '%,schedulers,%'
Now, however, we have to over-delimit the tags strings; the above
query incorrectly omits Pride & Prejudice (& Processes),
since its tags
field has schedulers
at the end of the
serialized list, and does not append the trailing comma.
No, we need a better way. We should treat our tags list as a first-class data concept. Why not treat a tag like a noun and make a table for it? Here’s the SQL for setting up the two object tables and a join table:
create table if not exists posts (
id bigserial not null primary key,
title text not null,
url text not null unique,
body text not null
);
create table if not exists tags (
id bigserial not null primary key,
tag text not null unique
);
create table if not exists taggings (
post_id integer references posts (id)
on delete restrict on update restrict,
tag_id integer references tags (id)
on delete restrict on update restrict,
unique (post_id, tag_id)
);
Each tag is stored once, in the tags
table, and joined zero or
more times through taggings
to posts
. Here’s a data file for
populating this new three-table structure with the same
information we used in the scalar approach.
We can hide the multi-level join behind a view, if we wish:
create view tagged_titles as
select p.id, p.title, json_agg(t.tag) as tags
from posts p inner join taggings j on j.post_id = p.id
inner join tags t on j.tag_id = t.id
group by p.id, p.title;
This will (a) omit any posts that are untagged, and (b) build a JSON array of tags for each post. Here’s the proof:
vector3=# select * from tagged_titles;
-[ RECORD 1 ]---------------------------------------
id | 1
title | A Tale of Two Databases
tags | ["parody", "fundamentals", "data"]
-[ RECORD 2 ]---------------------------------------
id | 2
title | Pride & Prejudice (& Processes)
tags | ["parody", "fun", "schedulers"]
-[ RECORD 3 ]---------------------------------------
id | 3
title | Kubermorphosis
tags | ["parody", "fun", "schedulers", "trends"]
(note that I’ve enabled ‘expanded display’ for legibility’s sake)
To search, we need to work from the other direction. Here is a view that maps each tag to all of the posts tagged with it:
create view titles_by_tag as
select t.tag, p.id, p.title
from posts p inner join taggings j on j.post_id = p.id
inner join tags t on j.tag_id = t.id;
Starting with a single tag, we can now find all tagged posts:
postgres=# select title from titles_by_tag
where tag = 'schedulers';
-[ RECORD 1 ]--------------------------
title | Pride & Prejudice (& Processes)
-[ RECORD 2 ]--------------------------
title | Kubermorphosis
Using SQL’s INTERSECT
operator, we can do a simple AND
query,
finding posts tagged both schedulers
and trends
:
postgres=# select title from titles_by_tag where tag = 'schedulers'
intersect
select title from titles_by_tag where tag = 'trends';
-[ RECORD 1 ]---------
title | Kubermorphosis
This is handy, but the over-normalization of data presents a few hurdles.
For starters, inserting a blog post goes from being a single SQL query to being somewhere between 1+n to 1+2n queries, where n = the number of tags on the post. The best case happens when you tag a post using only pre-existing tags. When every tag is new, we hit the the higher end.
Secondly, deletion becomes more difficult. To protect against
accidentally corrupting the many-to-many relationship between
posts and tags, we have opted for on delete restrict
and on update restrict
constraints on the foreign keys in the join table. This
stops us from accidentally dropping a post before removing its
associations in taggings
, leaving a dangling reference:
postgres=# delete from posts where id = 3;
ERROR: update or delete on table "posts" violates foreign key
constraint "taggings_post_id_fkey" on table "taggings"
DETAIL: Key (id)=(3) is still referenced from table "taggings".
To process the delete, we must first remove the reference from
taggings
:
postgres=# delete from taggings where post_id = 3;
DELETE 4
postgres=# delete from posts where id = 3;
DELETE 1
Deletion is now a two-query affair.
In the context of blogging, where reads happen more often than
writes–one hopes!–these trade-offs may be acceptable. For the
cases where they are not, let us consider the variadic ARRAY
type
that PostgreSQL has supported since time immemorial.
Arrays in Scalar Slots
Our original problem with the scalar tags
column is that it needed
to be multi-valued, and text
columns are not multi-valued.
PostgreSQL has an aggregate type, called text []
, that is
multi-valued.
It looks like this:
create table if not exists posts (
id bigserial not null primary key,
title text not null,
url text not null unique,
body text not null,
tags text [] not null default '{}'
);
This is strikingly similar to our first attempt: one table with a
single column for tracking tags. No joins. No overly-aggressive
normalization. So how does it fare under search? For that, allow
me to introduce you to @>
.
In the SQL expression where x @> y
, @>
is pronounced “contains,”
as in: where x contains y. Here are some fun queries you can try
on your local neighborhood PostgreSQL instance:
select '{a,b,c}'::text[] @> '{a}'::text[];
select '{a,b,c}'::text[] @> '{a,c}'::text[];
select '{a,b,c}'::text[] @> '{d,e,f}'::text[];
The first two should return t
; the last should return f
.
Keep in mind that order and cardinality do not matter. Both of the
following queries will return true (t
):
select '{a,b,c}'::text[] @> '{a,a,a,a}'::text[];
select '{a,b,c}'::text[] @> '{c,b,a}'::text[];
Tagging is inherently about sets of tokens, where order and
cardinality matter little if at all. Let’s find blog posts tagged
schedulers
:
variadic4=# select title, tags from posts
where tags @> '{schedulers}';
-[ RECORD 1 ]--------------------------
title | Pride & Prejudice (& Processes)
tags | {fun,parody,schedulers}
-[ RECORD 2 ]--------------------------
title | Kubermorphosis
tags | {parody,fun,schedulers,trends}
We don’t even need to intersect
multiple queries to handle a
multiple tagging assertion:
postgres=# select title, tags from posts
where tags @> '{schedulers,trends}';
-[ RECORD 1 ]-------------------------
title | Kubermorphosis
tags | {parody,fun,schedulers,trends}
Disjoint taggings (OR
semantics) can likewise be done in a single
query, using SQL’s own OR
keyword:
postgres=# select title, tags from posts
where tags @> '{data}'
or tags @> '{trends}';
-[ RECORD 1 ]-------------------------
title | A Tale of Two Databases
tags | {parody,fundamentals,data}
-[ RECORD 2 ]-------------------------
title | Kubermorphosis
tags | {parody,fun,schedulers,trends}
Beyond Search: Analytics in the Data Engine
PostgreSQL has a whole slew of additional, array-related tricks up
its sleeves. My personal favorite is the unnest()
function, which
turns an array into a result set. This is one you’ve just got to
see in action!
postgres=# select unnest(tags) as t from posts order by t asc;
data
fun
fun
fundamentals
parody
parody
parody
schedulers
schedulers
trends
Each array of tags has been un-nested into a row for each tag. At
this point we can bring the rest of the power of SQL to bear on
the data–here we’re ordering the query results alphabetically by
tag. The duplicates show up because we aren’t suppressing them. We
could slap a distinct
qualifier on that query and use it to show
what tags we’re actively using:
select distinct unnest(tags) as tag from posts;
Even better, we can find the arity of a tag–how many blog posts are tagged as such, and build up a leaderboard of sorts:
select unnest(tags) as tag,
count(id) as n
from posts
group by tag;
With the sample data we’ve been playing with, that query yields this:
postgres=# select unnest(tags) as tag,
count(id) as n
from posts
group by tag;
tag | n
--------------+---
fun | 2
trends | 1
data | 1
parody | 3
schedulers | 2
fundamentals | 1
(6 rows)
We’ve used 6 tags, half of which have only been used once, and one of which is used everywhere. Pretty typical for technical blogs, I’d wager.
My favorite part about the variadic approach to modeling tags is that it gives us the best of both worlds. Insertion and deletion are back to being single-query propositions. My application does not need to understand how tagging is implemented, only how to specify the tags on a post, and how to search (via the views we so helpfully created).

Don’t Underestimate Your Database Engine
If I can leave you with one parting thought it is this: don’t underestimate the power contained in relational database systems. Below the ORMs, the DAOs, the DALs, and the other don’t-worry-you-don’t-need-to-know-SQL layers, lies a sophisticated bit of technology that may be able to completely remove, or at least sidestep, a whole class of fiddly programming problems.
Use your database.