Who doesn't love a good interactive data session?
Usually the people who weren't there to watch the ideas unfold. That includes future versions of our present selves, who often don't remember the particulars of decisions made in the moment. Why is this view here? Why is it built on that query? If only there was some way to leave little notes for future selves to find and read!
We could write comments, sure. Look, here's one:
-- explode out the price column using PostgreSQL's UNNEST() -- function, so that we get multiple records; one for -- each price in the ad hoc list! -- create or replace view product_variants as select sku, unnest(string_to_array(prices, ';')) as price from raw_product_variants order by sku asc;
This works surprisingly well after the fact, once the analysis has been done interactively, through a REPL like pgadmin or Sequel Ace. In the moment however, it's difficult to do this interactively while also keeping a record of the thoughts, trials, and tribulations encountered on the way.
Luckily, we already have a technology for both live programming and written records for posterity: Jupyter Notebooks!
Through some trickery,
pip install incantations, and a little bit of Docker composition to tie it all together, I now have a serviceable PostgreSQL-backed, SQL-exploring data REPL that runs in the browser. In the rest of this essay, I'm going to show you how I did it, and then give you some ideas on what to do with it on your own data projects.
First, the components.
jupyter/minimal-notebookDocker image will serve as our base
- On top of that we'll add sqlalchemy and psycopg2 for painless access to our PostgreSQL instance.
- Then, we'll add the excellent ipython-sql library for some syntactic sugar (to taste).
- Finally, we'll spin our modified Jupyter image alongside the standard
postgresimage via Docker Compose.
Dockerfile for the Python customizations to
FROM jupyter/minimal-notebook USER root RUN apt-get update \ && apt-get install -y postgresql libpq-dev build-essential USER jovyan RUN pip install ipython-sql sqlalchemy psycopg2 pandas
(by the way, if you don't want to type all of this stuff out, or copy/paste files into existence, feel free to clone my git repo.)
We could build this image ourselves, but since we're not going to be distributing it, we'll let Docker Compose do it for us. Speaking of which, here's the compose YAML definition:
version: '3' services: notebook: build: . ports: [8888:8888] environment: - JUPYTER_TOKEN=data volumes: - ./_/jupyter:/home/jovyan db: image: postgres environment: - POSTGRES_USER=jupe - POSTGRES_PASSWORD=sekrit volumes: - ./_/jupyter/data:/data:ro - ./_/pg:/var/lib/postgresql/data
We're spinning two containers here:
notebook runs the interactive Jupyter part, and
db is a stock PostgreSQL instance, with a root user of jupe and a password of
sekrit. Thanks to the magic of Docker, the
notebook container can perform a DNS lookup on the name db to (dynamically!) get the IP address of the PostgreSQL container. This comes in extra handy when we start connecting to the database once everything is up and spinning.
To get thing started, we need to build our custom Jupyter image, and then bring up all the containers:
$ docker-compose build <a bunch of build output; apt updates, pip installs, etc.> $ docker-compose up -d [+] Running 3/3 ⠿ Network jupyter-sql_default Created 0.0s ⠿ Container jupyter-sql-notebook-1 Started 0.4s ⠿ Container jupyter-sql-db-1 Started 0.4s
Now we can visit the Jupyter web interface in our browser. Our Docker Compose configuration is forwarding port 8888 outside the container to port 8888 inside, so (assuming we've not done something overly clever) we can visit http://localhost:8888 in our browser.
Your should look something like mine:
If you were observant, you may have noticed that we set a
JUPYTER_TOKEN environment variable for the notebook container in our Compose recipe. This lets us force an access token, so that we don’t have to reference the output of the spinning container. I did this for convenience; you may not want to do the same. To each their own.
My token is
data. Whatever your token is, you’re going to have to enter it in the text field to proceed.
Now we can get started for real. In a new Python-powered notebook, we can load the
sql extension and connect to the PostgreSQL database:
%load_ext sql %sql postgresql://jupe:sekrit@db/jupe
We can start adding Markdown and code cells, threading our thoughts (in prose) with our exploration (in SQL). The
ipython-sqllibrary that we added to the image allows us to use the
%%sql keyword to execute arbitrary SQL through the Python interpreter, like this:
%%sql select pid, usename as user, datname as db, client_addr as ip from pg_stat_activity;
From here, the possibilities are limitless. Using this set up, I did a two-hour session on data cleaning using views,
COPY [table] FROM [file] queries and some elbow grease. When it was all done, I gave out the input data set and the Jupyter notebook exports to all the attendees so that they could play on their own.
If you want to play along, feel free to check out the repository on GitHub. If you do something interesting with it, and the dataset isn't too large or too proprietary, please reach out on Twitter and let me know all the juicy details.