On Windows, we’ve had the defrag
tool and others, that happily works on a drive even while it is in use, even the OS disk.
On Linux, I know of the fsck
command but that requires the drive in question to be unmounted. Not great when you want to check a running server. I do not want to stop my server and boot it from USB, just to run a disk check. I can’t imagine that’s what the data centers are doing, either!
Surely some Linux tool exists that can do some basic checks on a running system?
Then what are they doing? It seems very cumbersome to have to take a drive offline for routine maintenance.
They don’t do anything.
They have lots and lots of redundancy, and when enough drive fails, they decommission the entire server and/or rack.
Them big players play at a very different scale than the rest of us.
Hardware-backed RAID, with error monitoring and patrol read. iSCSI or similar to present that to a virtualization layer. VMFS or similar atop that. Files atop that to represent virtual drives. Virtual machines atop that.
Patrol read starts catching errors long before SMART will. Those drives get replicated to (and replaced by) hot spares, online. Failing drives then get replaced with new hot spares.
But all of that is irrelevant, because at the enterprise level, they are scaling their applications horizontally, with distributed containers. So even if they needed to do fsck at the guest filesystem level (or even if they weren’t using virtualization) they would just redeploy the containers to a different node and then direct traffic away from the one that needs the maintenance.
We don’t do maintenance, we just have redundancy, and backups, then replace failed components.