As far as ZFS stopping the corruption altogether, you don't know that. The corruption could have been caused by a software glitch in VM software, git, or the VM guest OS and wouldn't be detected at all by ZFS since the data could have been bit-perfect but still invalid. It might have been data corruption, but perhaps not filesystem level corruption.
So while ZFS may or may not have helped here, it's still no substitute for backups.
There is too much churn in the data which makes it difficult, if not impossible to take a snapshot or backup at any time on their main server and be guranteed it's consistent.
Typically you would take the server offline for maintenance in these kinds of situations, but KDE believed they had a better solution for backups by using the --mirror.
Unfortunately it appears git --mirror is a lot like rsync --delete where if things get corrupted at the source, that corruption will be mirrored on sync. KDE claims this isn't properly documented behavior of --mirror, and they're probably right.
Yes, KDE had tarballs, but they didn't have tarballs for backup purposes.. The Tarballs were individual tarballs of each respository and were meant to make it easier to download repository contents. The tarballs did not contain everything to restore all content that was on the git server. git bundle to suffer from the same problem that it just doesn't copy everything. The only way to copy everything and keep the main server online 24/7, AFAIK, is with --mirror.
So KDE was right that git --mirror was probably the best way to go if they didn't want to take the server down for maint.. What they failed to do was:
1. Take a mirror server offline and run a git fsck to make sure that what it mirrored was sane.
1a. AFAIK, all the KDE's mirrors were always online and KDE claims that since the servers were all online, the data churn on them made it impossible to get a sane snapshot for tarballs, ZFS, or any other backup purposes. Thus, it should have been obvious to KDE, IMO, that they need to take the main server offline or one of the mirrors offline to do proper backups. Since everything is mirroring off of the main server, and git fsck is known to take forever on 1500 repositories (as KDE claims) it makes sense to take one of the mirrors down and use the mirror to do the backups.
1b. KDE claims the corruption might have started months ago. If they ran git fsck on any of the mirrors after doing a mirror and taking it offline, they would have discovered the problem months ago.
2. After running git fsck on the mirror, they should then take a snapshot /backup of it. They'd be guaranteed that all their snapshots / backups were sane because the mirror is offline so people aren't pushing to it and it's also git fsck clean.
3. Keep backups that date back for a solid year, I'm shocked that people still don't do this, but it should be obvious.. Also, make sure those backups are stored at least a couple hundred miles away from your nearest online server because I've known plenty of companies that spend a lot of money doing backups to tape, only to have them wiped out in the same natural disaster.
In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
With that said, the people that claim ZFS is a substitute for backups are those trying to refute the idea, such as yourself. No one who uses ZFS makes such claims.
ZFS (and BTRFS) are a *possible backup* solution, but it has not much to do with checksuming and bit-perfect control or in any other way being bullet-proof.
Shit happens. Today it was a bad fsck.ext4, tomorrow it might something completely different (hardware failure, bug in git itself, whatever)...
No matter how much you argue that problem XyZ would have been prevented by that special feature #1322 of filesystem AbC, one day will come when some un predicted problem will fry your setup. It's not a matter of robustness, it's a matter of *TIME*. All things fail eventually, for some reasons or another.
A proper backup solution isn't a mirror. A proper backup solution is the one which can answer the case : "Oh shit! I don't have proper copy of file Foo.Bar I need to go back a few months ago when I had the correct version still around".
ZFS and BTRFS happen to be also good solutions for that. Not because of robustness, but because they feature "copy-on-write", data deduplication and so on. Thus snapshotting comes for "free".
If your precious data is periodically duplicated (simply mirrored) to a backup server which runs BTRFS and does periodic snapshots (daily for the last 2 weeks, weekly for the last 2 months, monthly for the last year, and yearly since begining of project) you would be safe. If anything happens to your precious data, even if it got silently corrupted, with nobody noticing, and the files getting mirrored/duplicated.
Should you realise that garbage has appeared in your files due to some broken fsck you ran last week, just grab the copy from the day prior to the problem from the backup.
BTRFS and ZFS are just more modern alternative to what was previously achieved with regular posix-compliant filesystems and a combo of rsync + hardlinks + cron. rsync isn't a magical backup bullet because it can make bit perfect copy by checksumming everything, its a good backup solution because when configured correctly, the hardlinks can keep older versions without eating too much space.
So yes, ZFS will be a good solution. Not because it has some magical immunity to problems making it a *substitute* to classic backup (yes, it is resilient to some type of problems, that comes as a plus). But because it can configured to work as an *actual backup*: something which help you move back in time to before the problem hapenned. (As does BTRFS, and as did the rsync/hardlink/cronjob trio before CoW filesytem became the latest craze).
Or in their case specific case, KDE could have a bunch of servers whose "git --mirror" lags behind on purpose, so it contains older versions of the git repository.
Compared to the other solution (btrfs/zfs, rsync+hardlinks+cronjob, etc.) it is a bit more expensive as it require more ressource (this setup doesn't use any form of copy-on-write to reduce the duplication)
The key take-home that everybody seems to be missing, is VIRTUAL MACHINE. It doesn't matter what filesystem they were using, if the filesystem itself isn't the cause of its own corruption.
Now as for data protection in this kind of problem, there are multiple kinds of protection that can and should be implemented here. To start with, it was a git server. Mirroring over git does build in a degree of safety, because you can step back in time to before the corruption began. Now if the git servers were being mirrored at the filesystem level, that would just replicate corruption and become impossible to recover.
Backups are a second mechanism for ensuring data safety. Incremental backups using an incremental filesystem like btrfs, for example, of just brutal permanent backups to which you can roll back.
The people at KDE *are* competent. They know they way around administration. But they met 2 sets of problems.
- Technical constraints. It's not easy to deploy a perfect backup solution with limited resources.
- A corner case: They hit a situation where GIT doesn't behave as it they though it would and corruption got passed around.
Speaking of file systems that's also a problem they pointed out. One of the problems they had is that there is no way to make sure that the files on the filesystem are consistent while git is live (well, that's normal in a way). The only way would be to force git to check its consistency before make the tarball, rsync'ing, etc.
But that is computationally expensive (and again they have limited resource. Again they know they way around administration. Any good sysadmin can in theory setup a perfect backup system. The problem is doing it on the scale of KDE while on a shoestring budget).
Either you freeze the whole GIT for a long time and wait until everything settles down in order to have consistent file to backup (but that blocks users).
Or you need a dedicated machine, which pulls (imperfect) mirror of the repo, then does the consistency checks and then only performs a classic filesystem backup at this point (but in this cas, this require CPU cycles, which aren't cheap in a data center).
The alternative they have is keep whole old git mirrors. Now their problem is that with ~25GB per whole repository, that's around 1TB of backup data. And, although a simple 1TB drive is cheap in a parts shop, it's going to be quite expensive again in a data center.
In the end they decided to:
- update their current backup strategy so it doesn't hit the corner case that caused all this mess, and put better strategy in place to detect when this kind of situation manage to happen (intelligently monitor the repos-list and trigger a warning if anything weird happens to it).
- keep a couple of older mirrors (within what they can afford. It will only be a couple of backups. Not a few dozen all the range of daily/weekly/monthly backups)
- a new dedicated server which does the mirror + check consistency + classic file-backup cycle.