NFS performance troubles

Recently, I started having issues with self-hosted services running slow, despite ample memory reserves and load averages that were fractions of the number of available CPUs. That got pretty annoying, especially given that I recently set up Grafana alerts that send Telegram notifications when one of the services is slow or unreachable.

These alerts always came in waves of multiple services, so something that all the struggling machines have in common had to be the issue. The network turned out to be stable and fine, though.

When alerts about my NAS being unavailable joined the alert mix, things came a bit clearer.

Forgive me, great sysadmin in heaven, for I have sinned🔗

When setting up some of my services, I took a shortcut that I now have to pay the price for. I have a solid backup strategy for data on my NAS, and was too lazy to set up a service-specific backup strategy for most of my services that generate persistent data. So, I thought I could piggy-back onto NAS backups by mounting NAS volumes via NFS on the machines that do the actual computation, so everything is “automatically” backed up.

That’s fine for stuff like photos or videos - NFS can transfer large, infrequent stuff relatively efficiently.

The problem is that I also used these NFS volumes as storage backend for SQLite, Postgres, and VictoriaMetrics databases.

Databases typically make many small read- and write operations. These microseconds on local storage turn into milliseconds with remove storage, and they pile up.

Adding to that, databases often lock files and flush changes quite aggressively. With NFS, the server maintains locks, and flush guarantees are relaxed, potentially leading to data consistency issues.

Ignoring the warning shot🔗

I should have known better: A few months back, the microSD card that provided the root file system for the Raspberry Pi 3 running VictoriaMetrics died and the OS crashed.

At first, it looked like this incident would validate my decision to put everything on NFS, because recovery would only take flashing a fresh microSD cart with a basic OS and run docker compose up -d (that was before I dug up Ansible again).

That almost worked, but VictoriaMetrics refused to start. Turns out the data was corrupted, and I couldn’t un-corrupt it with LLM assistance, so I had to drop months worth of historic data and start fresh.

The incident showed that databases depend on local file system semantics when it comes to operations like fsync to make sure data remains consistent, but NFS doesn’t provide these guarantees. Should have listened to that warning.

Moving forward🔗

I now switched database-dependent services to use databases backed by local storage, and set up backup scripts that write and rotate backups on NFS volumes on a regular basis. It will increase the wear on the SD cards a little, but I’ll have to live with that, because replacing all the existing devices with something SSD-capable for better durability would be costly.

The effect of this improvement was striking: NFS latency dropped from tens of seconds to low milliseconds. The number of NFS IO operations per second decreased from thousands to below a hundred.

When the next microSD card fails, I’ll notice the things I missed in the backups, but that’s just an opportunity for better automation.