Circuit breakers broke bad, workload moved but array flipped out under heavy load

Salesforce.com has revealed that a bug in the firmware of its storage arrays was behind last week's data loss incident.

That mess took the company's NA14 instance offline, so it took steps to move it into a Chicago data centre.

Once these timeout conditions began, a single database write was unable to successfully complete, which caused the file discrepancy condition to become present in the database.

The data loss came about because while Our internal backup processes are designed to be near real-time, however the local copy of the database had not yet completed.

Salesforce says the circuit breakers that started the mess passed March 2016 tests, but have been replaced anyway.

We do know, thanks to a 2013 post by site reliability engineer Claude Johnson that Salesforce has in the past used ZFS and Solaris-powered servers for storage.

The text above is a summary, you can read full article here.