In the world of cloud computing and hardware virtual machines we are told not to worry about hardware failures. Disks come and go beneath our virtual machine volumes, the NIC in our VM is virtual and might correspond to some sliver of time in a set of aggregated links.
But what if you operate your own datacenter? These things must fail, right? A disk is, after all, a stack of tiny pizza pans spinning at 10k RPM with metal wands shooting electrical currents onto its surface. That can't be safe. There's also a bunch of other garbage between your application and its data - cables, SAS expanders, HBAs, and backplanes! Can those fail too?
It turns out everything can fail, and everything does fail. Sometimes catastrophically and all at once!
This session is about real life data path hardware failures from ambient checksum errors to motherboard-eating power surges.
Kody is a software engineer at Joyent where he enjoys analyzing and developing models of complex software systems. At Joyent Kody has created software to monitor thousands of application instances, helped recover hundreds of terabytes of user data from cloud storage, fixed numerous low-level performance problems, and witnessed the failure of every hardware component in the IO path.