Running SQL in a Jupyter Notebook

James Hunt

Who doesn’t love a good interactive data session?

Usually the people who weren’t there to watch the ideas unfold. That includes future versions of our present selves, who often don’t remember the particulars of decisions made in the moment. Why is this view here? Why is it built on that query? If only there was some way to leave little notes for future selves to find and read!

We could write comments, sure. Look, here’s one:

-- explode out the price column using PostgreSQL's UNNEST()
-- function, so that we get multiple records; one for
-- each price in the ad hoc list!
create or replace view product_variants as
  select sku,
         unnest(string_to_array(prices, ';')) as price
    from raw_product_variants
   order by sku asc;

This works surprisingly well after the fact, once the analysis has been done interactively, through a REPL like pgadmin or Sequel Ace. In the moment however, it’s difficult to do this interactively while also keeping a record of the thoughts, trials, and tribulations encountered on the way.

Luckily, we already have a technology for both live programming and written records for posterity: Jupyter Notebooks!

Through some trickery, pip install incantations, and a little bit of Docker composition to tie it all together, I now have a serviceable PostgreSQL-backed, SQL-exploring data REPL that runs in the browser. In the rest of this essay, I’m going to show you how I did it, and then give you some ideas on what to do with it on your own data projects.

First, the components.

Here’s the Dockerfile for the Python customizations to jupyter/minimal-notebook:

FROM jupyter/minimal-notebook
USER root
RUN apt-get update \
 && apt-get install -y postgresql libpq-dev build-essential
USER jovyan
RUN pip install ipython-sql sqlalchemy psycopg2 pandas

(by the way, if you don’t want to type all of this stuff out, or copy/paste files into existence, feel free to clone my git repo.)

We could build this image ourselves, but since we’re not going to be distributing it, we’ll let Docker Compose do it for us. Speaking of which, here’s the compose YAML definition:

version: '3'
    build: .
    ports: [8888:8888]
      - JUPYTER_TOKEN=data
      - ./_/jupyter:/home/jovyan
    image: postgres
      - POSTGRES_USER=jupe
      - POSTGRES_PASSWORD=sekrit
      - ./_/jupyter/data:/data:ro
      - ./_/pg:/var/lib/postgresql/data

We’re spinning two containers here: notebook runs the interactive Jupyter part, and db is a stock PostgreSQL instance, with a root user of jupe and a password of sekrit. Thanks to the magic of Docker, the notebook container can perform a DNS lookup on the name db to (dynamically!) get the IP address of the PostgreSQL container. This comes in extra handy when we start connecting to the database once everything is up and spinning.

To get thing started, we need to build our custom Jupyter image, and then bring up all the containers:

$ docker-compose build
<a bunch of build output; apt updates, pip installs, etc.>
$ docker-compose up -d
[+] Running 3/3
 ⠿ Network jupyter-sql_default       Created       0.0s
 ⠿ Container jupyter-sql-notebook-1  Started       0.4s
 ⠿ Container jupyter-sql-db-1        Started       0.4s

Now we can visit the Jupyter web interface in our browser. Our Docker Compose configuration is forwarding port 8888 outside the container to port 8888 inside, so (assuming we’ve not done something overly clever) we can visit http://localhost:8888 in our browser.

Your should look something like mine:

If you were observant, you may have noticed that we set a JUPYTER_TOKEN environment variable for the notebook container in our Compose recipe. This lets us force an access token, so that we don’t have to reference the output of the spinning container. I did this for convenience; you may not want to do the same. To each their own.

My token is data. Whatever your token is, you’re going to have to enter it in the text field to proceed.

Now we can get started for real. In a new Python-powered notebook, we can load the sql extension and connect to the PostgreSQL database:

%load_ext sql
%sql postgresql://jupe:[email protected]/jupe

We can start adding Markdown and code cells, threading our thoughts (in prose) with our exploration (in SQL). The ipython-sqllibrary that we added to the image allows us to use the %%sql keyword to execute arbitrary SQL through the Python interpreter, like this:

select pid, usename as user, datname as db, client_addr as ip
  from pg_stat_activity;

From here, the possibilities are limitless. Using this set up, I did a two-hour session on data cleaning using views, COPY [table] FROM [file] queries and some elbow grease. When it was all done, I gave out the input data set and the Jupyter notebook exports to all the attendees so that they could play on their own.

James using the Jupyter SQL notebook to present on data engineering topics, data cleaning techniques, and proper podcaster mic usage. Seriously, who keeps giving me air time?
We do a lot of ad hoc presentations and learning sessions over at Vivanti.

If you want to play along, feel free to check out the repository on GitHub. If you do something interesting with it, and the dataset isn’t too large or too proprietary, please reach out on Twitter and let me know all the juicy details.