Flat Files Are Hard

James Hunt

File names are easy, right?

Over the years I’ve built up an impressive computing fabric of my own, consisting primarily of two boxes that sit on my desk. These boxes collectively host the family infrastructure, side projects, open source work, and all the harebrained schemes and crazy ideas I dream up.

They’re also a bit of a mess.

In an attempt to clean those up, I spent a few days last month writing a file cataloging system. I figured if I could get a full listing of all files and directories on each server into a central data store, I could start to run some analysis on them. Were there any duplicate files (by checksum)? Where were the largest files? How many copies of left-pad are there hanging out in node_modules/ directories?

To implement, I turned to find, a small Linux utility that is supremely suited to crawling hierarchical filesystem structures and doing something on each of the nodes in the tree. Using find to emit a CSV file goes a little something like this:

find / -printf "%p,%s" > all-files.csv

This is as simple as it gets. It’s also wrong. It runs alright, but the comma-delimited text file it pops out is only sometimes viable for import into a database.

Here’s the first problem: files with commas in their names.

On a stock Linux system this wouldn’t be a problem. Sadly, I don’t often find that I care about the files that come with the base operating system. Rather, I need to know what I’ve added to the system in the way of code, data, and other assets. For example, I keep all of my digital library on one of the hosts, and that includes PDFs with file names like this:

$ ls -1 books/
-rw-r--r-- 1 jhunt staff   7159012  Dec 21 18:50 'Additive Art: 3d Printing, Prototyping, and Manufacturing.epub'
-rw-r--r-- 1 jhunt staff  13230982  Dec 21 18:50 'Additive Art: 3d Printing, Prototyping, and Manufacturing.pdf'

The problem, as anyone who has dealt with CSV extracts for any appreciable amount of time knows, is that the comma is now doing double duty. It is both a delimiter (outside of the data being transported to the database) and a data item. Serializing data with delimiters in it calls for an escape character of some sort.

Simple enough, right? We handle this in CSVs by enclosing fields in double quotes (ASCII 34). So instead of this:

file_name,file_size
books/Additive Art: 3d Printing, Prototyping, and Manufacturing.epub,7159012
books/Additive Art: 3d Printing, Prototyping, and Manufacturing.pdf,13230982

We would instead write this:

file_name,file_size
"books/Additive Art: 3d Printing, Prototyping, and Manufacturing.epub",7159012
"books/Additive Art: 3d Printing, Prototyping, and Manufacturing.pdf",13230982

We’ve now successfully solved the confusion about commas, but we’ve done so by introducing a new ambiguity, vis-a-vis the double quote. What happens if we have a data element with embedded quotes?

$ ls -1 history/
-rw-r--r-- 1 jhunt staff 1081857  Dec 24 23:36  'The "Nutcracker": Rise of an Old World Ballet.pdf'

The best way of handling this is to introduce an escape sequence; one or more characters that signal to the reader that we are about to treat a delimiter character normally.

Following the UNIX tradition (which comes to us by way of C), let’s use the backslash (ASCII 92). Every time we see a single backslash, we know that the next character will be interpreted as a non-delimiter.

file_name,file_size
"a file name with \"quotes\", and commas.txt",12345

The first quote is a delimiter; it delimits a string that is going to contain literal commas. It’s a way of saying to the reader “hey, we’re heading into rich text territory; next time you see a comma, it’s a value-comma, not a delimiter-comma.” In that sense, it acts as a special-purpose escape character. More on that later.

The second quote is preceded by our backslash. This is akin to saying to the reader “normally, we’d consider the field closed at a quote, just not this quote. It’s a value-quote, not a delimiter-quote.”

With the introduction of these two complexities to our encoding scheme, we’ve built a means to separate the quote-delimiter and quote-value cases from one another. Similarly, we can differentiate comma-delimiter (outside of quotes) and comma-value (inside of quotes) scenarios quite nicely.

Materially, we’ve introduced the same flaw we did when we decided to enclose problematic fields in quotes. We have a new escape character (backslash, ASCII 92) and we don’t handle what we do when that delimiter is actually a value.

This goes on forever if we keep introducing new delimiters. Eventually, we run out of characters in the ASCII space. Maybe we switch to Unicode and start delimiting with emoji. Good luck getting sqlplus to export with a 😈 field separator.

The only way to stop this is to re-use an escape signifier. Escape the escape, so to speak. This is where the \\ sequence comes in. It handles the edge case of an escape-delimiter that needs to be interpreted as a value, not a meta-value.

Okay, so now we can handle awful file names like “Either \ Or: The “Math” Problems, and Solutions, of Combinatorial Psychology.pdf”.

file_name,file_size
"research/Either \\ Or: The \"Math\" Problems, and Solutions, of Combinatorial Psychology.pdf",101443

At this point, we could theoretically remove the first abstraction we tried. Instead of enclosing fields in quotes, we could rely on the escaping technique:

file_name,file_size
research/Either \\ Or: The "Math" Problems\, and Solutions\, of Combinatorial Psychology.pdf,101443

ed.: I am unsure if Snowflake will handle this. Let’s try it and find out. Also, try PG

There’s one more delimiter we haven’t considered. In part that’s because it doesn’t show up all that often in real file names found on sane systems. In part, that’s also because you cannot see it.

In CSV, the newline or line feed character (ASCII 10 or \n, depending on who you ask) delimits whole records.

New lines in file names? What kind of monster does that? Remember when I said that some of the things on these two boxes are crazy ideas and harebrained schemes? Yeah, it’s me. I’m the monster. Past me, experimenting with strange filesystem ideas, is anyway.

xkcd comic #981, about certain improper uses of hard links to directories in file systems
courtesy xkcd

Here’s what happens in the naïve case when you don’t handle line feeds (or carriage returns) in file names:

file_name,file_size
"A truly \"weird\" file
name.pdf",1234

Two things are wrong here.

The first record is incomplete. Not only does the quote-delimiter not terminate, but there is no file size field. The record ends halfway through the filename, but the reader doesn’t know that–they can’t know it because we have no way of communicating it to them.

The second problem is more insidious: the second line is a validly-formed record with incorrect data. There is no file called name.pdf", and it certainly isn’t 1,234 bytes long.

(Note: if you come from a programming background, you may have assumed that the second line contains an unterminated string. You would be correct. However, in most CSV implementations [ed.: verify], the quote region logic is only activated if the first character of a field is the double quote.)

Luckily, this is a case we can easily handle with our existing backslash-escape scheme: we just need to escape the line feed (ASCII 10) and carriage return (ASCII 13) characters like we did with quote.

file_name,file_size
"A truly \"weird\" file\
name.pdf",1234

Dealing with non-printing characters is weird, so let’s run that through our old friend od to see the individual ASCII code points:

$ cat file.csv | od -a -t d1
0000000    f   i   l   e   _   n   a   m   e   ,   f   i   l   e   _   s
          102 105 108 101  95 110  97 109 101  44 102 105 108 101  95 115
0000020    i   z   e  nl   "   A  sp   t   r   u   l   y  sp   \   "   w
          105 122 101  10  34  65  32 116 114 117 108 121  32  92  34 119
0000040    e   i   r   d   \   "  sp   f   i   l   e   \  nl   n   a   m
          101 105 114 100  92  34  32 102 105 108 101  92  10 110  97 109
0000060    e   .   p   d   f   "   ,   1   2   3   4  nl
          101  46 112 100 102  34  44  49  50  51  52  10
0000074

Our escaped line feed is represented in the byte stream as a backslash (ASCII 92) followed by the line feed (ASCII 10).

This often doesn’t sit well with people, especially people who are used to the C-style escape scheme we’ve employed. For starters, it makes it difficult to use stock tools like [wc][2], without modification. Indeed, wc -l counts three lines for our example CSV above.

Instead, we will want to assign non-white space representations to white space characters. Traditionally, that means \n is a new line, \r is a carriage return, and \t is a tab (which we haven’t had a problem with in this experiment).

If we adopt that scheme, our final CSV will look like this:

file_name,file_size
"A truly \"weird\" file\nname.pdf",1234

Replacing value-newlines with non-white space escaped representations not only yields more readable CSV, but it also lets us use tools like wc and split on our CSV output.

Our final CSV file, complete with all sorts of weirdness, looks like this:

file_name,file_size
"books/Additive Art: 3d Printing, Prototyping, and Manufacturing.epub",7159012
"books/Additive Art: 3d Printing, Prototyping, and Manufacturing.pdf",13230982
"test/A truly \"weird\" file\nname.pdf",1234
"research/Either \\ Or: The \"Math\" Problems, and Solutions, of Combinatorial Psychology.pdf",101443

Getting find to print this out is going to be tricky. At this point, I’d reach for Perl.

#!/usr/bin/perl
use strict;
use warnings;
use File::Find; # kind of like find(1)
find(sub {
  my @s = stat($_);
  my $f = $File::Find::name;
  $f =~ s/\\/\\\\/g;
  $f =~ s/\n/\\n/g; $f =~ s/\r/\\r/g;
  $f =~ s/\t/\\t/g; $f =~ s/"/\\"/g;
  printf "\"%s\",%d\n", $f, $s[7];
}, @ARGV);

Flat files are hard.