When we talk about Fivetran, we normally focus on their SaaS connectors, like Zoho CRM and Microsoft Dynamics 365. But did you know that you can use the Fivetran FTP connector to pull data out of FTP drops and automatically sync the data into your data cloud?

In this essay, we’ll walk through the set up of an FTP Fivetran connector, and discuss some of the pitfalls and things to watch out for. To pull this off, we’re going to want an easy FTP server setup. For that, we’re going to turn to our old friend, Docker Compose.

Here’s a Compose YAML file to get you started:

version: '3'
services:
  ftp:
    image: delfer/alpine-ftp-server
    ports:
      - 21:21
      - 21000-21010:21000-21010
    environment:
      - USERS=fivetran_ftp|Vaefietu7Reijaepheiv5shoiChai0ga
      - ADDRESS=127.0.0.1

A couple of things here. First, you’re definitely going to need to change that 127.0.0.1 bit out for the public IP address of your Docker host. Also, you’re going to need a public IPv4 address for Fivetran to connect to.

If you get any weird errors about invalid port mappings for port numbers that are far above the highest legal TCP port (65535) you may need to upgrade to a 2.x version of the Docker Compose plugin. The older 1.x Python implementation seems to have issues with port ranges. YMMV.

I used FileZilla to test out the connectivity and user authentication bits to make sure everything works before we get to Fivetran’s connector setup. I have one inviolable rule for dealing with technology: always check your assumptions. I encourage you to do the same.

Before we can land data from our FTP drop, we need two things: some files in the FTP site, and a landing database in something like Snowflake. First, the landing area. If you don’t already have a landing database/schema and a role for Fivetran to use, here’s some SQL to get that set up:

create role fivetran_vendor;
create database explore1;

grant create schema, monitor, usage
  on database explore1
  to role fivetran_vendor;

grant all
  on schema explore1.land
  to role fivetran_vendor;

grant usage
  on warehouse a_virtual_warehouse
  to role fivetran_vendor;

create user fivetran_user
  login_name        = fivetran_user
  password          = 'a-stronger-password'
  default_warehouse = a_virtual_warehouse;

grant role fivetran_vendor
  to user fivetran_vendor;

(you can find this code in the Gitlab repo for this blog post.)

As for the files, here’s an example CSV with only a handful of rows so that we can experiment without incurring too much in MAR usage.

Now, onto the connector setup.

Screenshot of the Fivetran FTP connector setup form

I used the following details, based on the above Compose recipe:

  • Destination schema: land
  • Destination table: example
  • User: fivetran_ftp (this is the FTP user)
  • Password: Vaefietu7Reijaepheiv5shoiChai0ga
  • Host address: (your Docker host’s public IP or fully-qualified domain name)
  • Port: 21 (standard FTP port, unless you’ve changed yours)
  • Secure FTP: off (we didn’t set up TLS encryption)
  • Folder path: /ftp/fivetran_ftp
  • File pattern: .*\.csv$
  • Compression: uncompressed
  • Error handling: fail

Clicking the Save & Test button will cause Fivetran to persist the configuration details and attempt to validate them with a live connection to the dockerized FTP server. You can watch what happens on the FTP side of things with a docker logs ... command:

Changing password for fivetran_ftp
New password:
Retype password:
passwd: password for fivetran_ftp changed by root
Mon Aug  8 17:35:52 2022 [pid 38] CONNECT: Client "172.20.0.1"
Mon Aug  8 17:35:52 2022 [pid 37] [one] OK LOGIN: Client "172.20.0.1"
Mon Aug  8 17:36:18 2022 [pid 42] CONNECT: Client "172.20.0.1"
Mon Aug  8 17:36:18 2022 [pid 43] CONNECT: Client "172.20.0.1"
Mon Aug  8 17:36:18 2022 [pid 40] [one] OK LOGIN: Client "172.20.0.1"
Mon Aug  8 17:36:18 2022 [pid 41] [one] OK LOGIN: Client "172.20.0.1"
Mon Aug  8 17:36:18 2022 [pid 45] [one] OK UPLOAD: Client "172.20.0.1", "/ftp/fivetran_ftp/example.csv", 374 bytes, 332.83Kbyte/sec

If testing fails, check your connectivity (again, FileZilla is a huge help here), and tweak your Fivetran configuration until things start working.

Once Fivetran has verified connectivity to the FTP host, it will validate the downstream credentials for getting into Snowflake and populating the specified database, schema, and table with data.

Finally, Fivetran should perform an initial sync, and you should be able to interrogate the explore1.land.example table in a Snowflake worksheet.

A few things to be aware of

Make sure your files share a common schema

The FTP connector imports each file into a table based on the intrinsic schema of those files. For CSVs, this means that the landing table is created based off of the header lines. It is very important that all files picked up by a single connector configuration are homogenous. That’s why we recommend setting the folder path and file pattern optional configuration settings. Even if you are grabbing every file on the FTP site today, being explicit about where and what to pick up will help you down the road.

Be specific about compression settings

It’s also a good idea to explicitly set the compression configuration to something other than the default (infer). Unless you plan to mix compression algorithms (i.e. gzip for some, bzip2 for others), or need to support both compressed and uncompressed files, it’s better to be explicit.

Different filenames are interpreted as different (and new!) files

We should also point out that the connector uses file names and UNIX mtime timestamps to determine what it has and has not yet landed into your warehouse. This has particularly nasty ramifications for that mixed-mode of both compressed and uncompressed files.

Suppose you want to keep uncompressed TSV files around for a week before gzipping them up, in-place, to save on storage costs. Compressing each TSV that has exceeded this 7-day window will cause a new file (with a new inode, a different file name, and contemporaneous mtime) to spring into existence. Fivetran will dutifully pull down the archive, uncompress it, and re-land all that data a second time. Double the MAR, double the headache.

When you do find yourself in this situation, either by choice or by necessity, you’ll want to ignore the compressed files by supplying a file pattern setting that only matches uncompressed files.

Modification times can do bad things

While we’re on the topic of file freshness tracking, let’s talk about the consequences of an mtime bump. Let’s say you have a 20,000-record CSV file in your FTP drop, and that Fivetran has already ingested it. If you change the UNIX mtime, either via something like touch(1) or by actually modifying the contents, Fivetran will re-ingest the whole file. This will count against your monthly active rows usage math, and could cost you more than you were expecting.

Note: these same caveats apply to the FTPS and SFTP connectors, which are fundamentally filesystem access protocols that uses the TLS and SSH protocols (respectively) for transport security. You still have to worry about modification times and file compression shenanigans.

Be safe; stay sane

At the end of the day, understanding how Fivetran is detecting changes and re-synchronizing your landed data is key to successful use of the FTP, FTPS and SFTP connectors.

If You Liked That, You’ll Love These: