When Ghosts Fall
Around 6 months ago, I wrote a blog post describing my migration to Ghost, and how I set it up on Fly.io, a platform-as-a-service that allowed me to own my data.
While I'm still thrilled with Ghost, I experienced a recent failure of my Fly.io machines, which required me to rebuild my site from scratch. I wanted to document the process I had to go through, and the important caveats for anyone considering self-hosting Ghost with Fly.
My Fault
This is entirely mostly an own-goal, but I did not have backups.
To be totally clear, Fly.io's documentation is explicit about the fact that if you are running single machines for a given app — aka, running without horizontal scaling — you are in for potential heartache. The Fly.io architecture, like a lot of cloud hosting, considers some amount of failure to be part of the cost of doing business, and if you have all your eggs in one basket (or all your bytes in one server), you're opening yourself up to significant risk.
It is easy enough to create an automated backup process either via a local CLI command running on a cron (my current/temporary approach) or through something like scheduled GitHub Actions. I would strongly advocate for setting up backups from day 1; Fly.io is not joking when they warn you to expect failures. If anything, I was lucky to make it 6 months.
For the lack of backups, I take full responsibility. Procrastination loses me this round.
Also Not My Fault
On the other hand, I don't want to take all the credit here (although certainly I deserve 90% of the credit). For starters, I had actually tried to set up recurring backups on my Raspberry Pi. Unfortunately, the Fly.io CLI is incompatible with ARM devices. I recently purchased a new x64 home server which should be able to run Fly's CLI, so I'll be doing the backups on that machine, but this also feels like a limitation on Fly's part.
Additionally, the server hosting my apps appears to have crashed and rebuilt about 5 days before I discovered it. Those rebuilds, in theory, should have happened quietly in the background, and resolved the issues. However, the rebuilds did not work, causing my app's MySQL database machine to enter a failed state.
Even though the app entered a failed state, I received no notification — even though I had Failed Release emails enabled. That notification could have been the difference here, because discovering the issue 5 days after the failure meant that Fly.io's volume snapshots from before the failure had all expired.
Yes, Fly's snapshots expire after 5 days — and as far as I can tell, there's no way to extend this window even at an extra cost. Of course a "real" application would have had some sort of monitoring set up, but I think Fly could do a better job here of offering additional flexibility around snapshot retention (Fly.io competitor Platform.sh allows snapshots to be maintained for up to a year, depending on your plan and configuration), along with better alerting around issues within the platform.
Better Database Support, Please?
This issue also could have been avoided if Fly.io's platform supported more managed databases beyond Postgres. The managed solution they offer enables automatic scaling, backups, etc. — but if you're using MySQL, you're out of luck. MySQL is difficult to horizontally scale (there's a reason RDS is so expensive), but in the end it is still incredibly popular, and the lack of a managed MySQL solution in Fly's platform feels like a real limitation.
That means as a Ghost user, you're going to need to bring your own infrastructure and be your own DBA. Again, getting regular backups going is not too painful, but you will need to spend the time working out a solution.
Rebuilding and the Future
Ultimately the failures were so catastrophic that I ended up needing to completely delete my applications and start from scratch. I was unable to get my machines running (even to be able to shell into to debug) because the filesystems were corrupted and unmountable.
Fortunately I did not have a ton of content, and all of it was backed up on the Internet Archive's Wayback Machine. The only real loss was an hour or so of my time and a draft blog post, which is not nothing, but it's not everything.
In spite of all this I do plan to stick with Fly.io. I will email their support with some questions and suggestions (where were my failure emails? can I pay you to extend volume retention? can you give us native managed MySQL?) but at the end of the day this was ultimately my mistake, and in spite of the issues Fly.io's platform is fast, and made quick work getting the Ghost application back to a default state.
I am publishing my migration script to my website's GitHub repo, as a template for others to follow. My advice? Set up your database, then set up your backups immediately. Avoid my mistake, so you don't come crashing back down to earth.