Who doesn’t love a good interactive data session?
Usually the people who weren’t there to watch the ideas unfold. That includes future versions of our present selves, who often don’t remember the particulars of decisions made in the moment. Why is this view here? Why is it built on that query? If only there was some way to leave little notes for future selves to find and read!
We could write comments, sure. Look, here’s one:
-- explode out the price column using PostgreSQL's UNNEST()
-- function, so that we get multiple records; one for
-- each price in the ad hoc list!
--
create or replace view product_variants as
select sku,
unnest(string_to_array(prices, ';')) as price
from raw_product_variants
order by sku asc;
This works surprisingly well after the fact, once the analysis has been done interactively, through a REPL like pgadmin or Sequel Ace. In the moment however, it’s difficult to do this interactively while also keeping a record of the thoughts, trials, and tribulations encountered on the way.
Luckily, we already have a technology for both live programming and written records for posterity: Jupyter Notebooks!
Through some trickery, pip install
incantations, and a little bit
of Docker composition to tie it all together, I now have a
serviceable PostgreSQL-backed, SQL-exploring data REPL that runs
in the browser. In the rest of this essay, I’m going to show you
how I did it, and then give you some ideas on what to do with it
on your own data projects.
First, the components.
- The
jupyter/minimal-notebook
Docker image will serve as our base - On top of that we’ll add sqlalchemy and psycopg2 for painless access to our PostgreSQL instance.
- Then, we’ll add the excellent ipython-sql library for some syntactic sugar (to taste).
- Finally, we’ll spin our modified Jupyter image alongside the
standard
postgres
image via Docker Compose.
Here’s the Dockerfile
for the Python customizations to
jupyter/minimal-notebook
:
FROM jupyter/minimal-notebook
USER root
RUN apt-get update \
&& apt-get install -y postgresql libpq-dev build-essential
USER jovyan
RUN pip install ipython-sql sqlalchemy psycopg2 pandas
(by the way, if you don’t want to type all of this stuff out, or copy/paste files into existence, feel free to clone my git repo.)
We could build this image ourselves, but since we’re not going to be distributing it, we’ll let Docker Compose do it for us. Speaking of which, here’s the compose YAML definition:
version: '3'
services:
notebook:
build: .
ports: [8888:8888]
environment:
- JUPYTER_TOKEN=data
volumes:
- ./_/jupyter:/home/jovyan
db:
image: postgres
environment:
- POSTGRES_USER=jupe
- POSTGRES_PASSWORD=sekrit
volumes:
- ./_/jupyter/data:/data:ro
- ./_/pg:/var/lib/postgresql/data
We’re spinning two containers here: notebook
runs the interactive
Jupyter part, and db
is a stock PostgreSQL instance, with a root
user of jupe
and a password of sekrit
. Thanks to the magic of
Docker, the notebook
container can perform a DNS lookup on the
name db to (dynamically!) get the IP address of the PostgreSQL
container. This comes in extra handy when we start connecting to
the database once everything is up and spinning.
To get thing started, we need to build our custom Jupyter image, and then bring up all the containers:
$ docker-compose build
<a bunch of build output; apt updates, pip installs, etc.>
$ docker-compose up -d
[+] Running 3/3
⠿ Network jupyter-sql_default Created 0.0s
⠿ Container jupyter-sql-notebook-1 Started 0.4s
⠿ Container jupyter-sql-db-1 Started 0.4s
Now we can visit the Jupyter web interface in our browser. Our Docker Compose configuration is forwarding port 8888 outside the container to port 8888 inside, so (assuming we’ve not done something overly clever) we can visit http://localhost:8888 in our browser.
Your should look something like mine:

If you were observant, you may have noticed that we set a
JUPYTER_TOKEN
environment variable for the notebook container in
our Compose recipe. This lets us force an access token, so that we
don’t have to reference the output of the spinning container. I
did this for convenience; you may not want to do the same. To each
their own.
My token is data
. Whatever your token is, you’re going to have to
enter it in the text field to proceed.
Now we can get started for real. In a new Python-powered notebook,
we can load the sql
extension and connect to the PostgreSQL
database:
%load_ext sql
%sql postgresql://jupe:[email protected]/jupe
We can start adding Markdown and code cells, threading our
thoughts (in prose) with our exploration (in SQL). The
ipython-sqllibrary
that we added to the image allows us to use the
%%sql
keyword to execute arbitrary SQL through the Python
interpreter, like this:
%%sql
select pid, usename as user, datname as db, client_addr as ip
from pg_stat_activity;
From here, the possibilities are limitless. Using this set up, I
did a two-hour session on data cleaning using views, COPY [table] FROM [file]
queries and some elbow grease. When it was all done, I
gave out the input data set and the Jupyter notebook exports to
all the attendees so that they could play on their own.

If you want to play along, feel free to check out the repository on GitHub. If you do something interesting with it, and the dataset isn’t too large or too proprietary, please reach out on Twitter and let me know all the juicy details.