When we talk about Fivetran, we normally focus on their SaaS connectors, like Zoho CRM and Microsoft Dynamics 365. But did you know that you can use the Fivetran FTP connector to pull data out of FTP drops and automatically sync the data into your data cloud?
In this essay, we’ll walk through the set up of an FTP Fivetran connector, and discuss some of the pitfalls and things to watch out for. To pull this off, we’re going to want an easy FTP server setup. For that, we’re going to turn to our old friend, Docker Compose.
Here’s a Compose YAML file to get you started:
version: '3' services: ftp: image: delfer/alpine-ftp-server ports: - 21:21 - 21000-21010:21000-21010 environment: - USERS=fivetran_ftp|Vaefietu7Reijaepheiv5shoiChai0ga - ADDRESS=127.0.0.1
A couple of things here. First, you’re definitely going to need to change that
127.0.0.1 bit out for the public IP address of your Docker host. Also, you’re going to need a public IPv4 address for Fivetran to connect to.
If you get any weird errors about invalid port mappings for port numbers that are far above the highest legal TCP port (65535) you may need to upgrade to a 2.x version of the Docker Compose plugin. The older 1.x Python implementation seems to have issues with port ranges. YMMV.
I used FileZilla to test out the connectivity and user authentication bits to make sure everything works before we get to Fivetran’s connector setup. I have one inviolable rule for dealing with technology: always check your assumptions. I encourage you to do the same.
Before we can land data from our FTP drop, we need two things: some files in the FTP site, and a landing database in something like Snowflake. First, the landing area. If you don’t already have a landing database/schema and a role for Fivetran to use, here’s some SQL to get that set up:
create role fivetran_vendor; create database explore1; grant create schema, monitor, usage on database explore1 to role fivetran_vendor; grant all on schema explore1.land to role fivetran_vendor; grant usage on warehouse a_virtual_warehouse to role fivetran_vendor; create user fivetran_user login_name = fivetran_user password = 'a-stronger-password' default_warehouse = a_virtual_warehouse; grant role fivetran_vendor to user fivetran_vendor;
(you can find this code in the Gitlab repo for this blog post.)
As for the files, here’s an example CSV with only a handful of rows so that we can experiment without incurring too much in MAR usage.
Now, onto the connector setup.
I used the following details, based on the above Compose recipe:
- Destination schema:
- Destination table:
fivetran_ftp(this is the FTP user)
- Host address: (your Docker host’s public IP or fully-qualified domain name)
21(standard FTP port, unless you’ve changed yours)
- Secure FTP:
off(we didn’t set up TLS encryption)
- Folder path:
- File pattern:
- Error handling:
Save & Test button will cause Fivetran to persist the configuration details and attempt to validate them with a live connection to the dockerized FTP server. You can watch what happens on the FTP side of things with a
docker logs ... command:
Changing password for fivetran_ftp New password: Retype password: passwd: password for fivetran_ftp changed by root Mon Aug 8 17:35:52 2022 [pid 38] CONNECT: Client "172.20.0.1" Mon Aug 8 17:35:52 2022 [pid 37] [one] OK LOGIN: Client "172.20.0.1" Mon Aug 8 17:36:18 2022 [pid 42] CONNECT: Client "172.20.0.1" Mon Aug 8 17:36:18 2022 [pid 43] CONNECT: Client "172.20.0.1" Mon Aug 8 17:36:18 2022 [pid 40] [one] OK LOGIN: Client "172.20.0.1" Mon Aug 8 17:36:18 2022 [pid 41] [one] OK LOGIN: Client "172.20.0.1" Mon Aug 8 17:36:18 2022 [pid 45] [one] OK UPLOAD: Client "172.20.0.1", "/ftp/fivetran_ftp/example.csv", 374 bytes, 332.83Kbyte/sec
If testing fails, check your connectivity (again, FileZilla is a huge help here), and tweak your Fivetran configuration until things start working.
Once Fivetran has verified connectivity to the FTP host, it will validate the downstream credentials for getting into Snowflake and populating the specified database, schema, and table with data.
Finally, Fivetran should perform an initial sync, and you should be able to interrogate the
explore1.land.example table in a Snowflake worksheet.
A few things to be aware of
Make sure your files share a common schema
The FTP connector imports each file into a table based on the intrinsic schema of those files. For CSVs, this means that the landing table is created based off of the header lines. It is very important that all files picked up by a single connector configuration are homogenous. That’s why we recommend setting the folder path and file pattern optional configuration settings. Even if you are grabbing every file on the FTP site today, being explicit about where and what to pick up will help you down the road.
Be specific about compression settings
It’s also a good idea to explicitly set the compression configuration to something other than the default (infer). Unless you plan to mix compression algorithms (i.e. gzip for some, bzip2 for others), or need to support both compressed and uncompressed files, it’s better to be explicit.
Different filenames are interpreted as different (and new!) files
We should also point out that the connector uses file names and UNIX
mtime timestamps to determine what it has and has not yet landed into your warehouse. This has particularly nasty ramifications for that mixed-mode of both compressed and uncompressed files.
Suppose you want to keep uncompressed TSV files around for a week before gzipping them up, in-place, to save on storage costs. Compressing each TSV that has exceeded this 7-day window will cause a new file (with a new inode, a different file name, and contemporaneous mtime) to spring into existence. Fivetran will dutifully pull down the archive, uncompress it, and re-land all that data a second time. Double the MAR, double the headache.
When you do find yourself in this situation, either by choice or by necessity, you’ll want to ignore the compressed files by supplying a file pattern setting that only matches uncompressed files.
Modification times can do bad things
While we’re on the topic of file freshness tracking, let’s talk about the consequences of an mtime bump. Let’s say you have a 20,000-record CSV file in your FTP drop, and that Fivetran has already ingested it. If you change the UNIX mtime, either via something like
touch(1) or by actually modifying the contents, Fivetran will re-ingest the whole file. This will count against your monthly active rows usage math, and could cost you more than you were expecting.
Note: these same caveats apply to the FTPS and SFTP connectors, which are fundamentally filesystem access protocols that uses the TLS and SSH protocols (respectively) for transport security. You still have to worry about modification times and file compression shenanigans.
Be safe; stay sane
At the end of the day, understanding how Fivetran is detecting changes and re-synchronizing your landed data is key to successful use of the FTP, FTPS and SFTP connectors.