In a recent blog post, Google and Seagate are using machine learning in order to improve fault tolerance with disk. Google knows full well the impact of failure that can cause major outages in infrastructure.

Google has teamed up with Seagate to work together to improve the whole setup. Hard disks are still being installed in huge quantities in large data centers. Most likely another EB or two of disks will be sold by 2024. The vast amount of storage does come at a high cost per PB has to consider server costs and building costs etc.

There are millions of disks deployed in operation that generate terabytes (TBs) of raw telemetry data. This includes billions of rows of hourly SMART(Self-Monitoring, Analysis and Reporting Technology) data and host metadata, such as repair logs, Online Vendor Diagnostics (OVD) or Field Accessible Reliability Metrics (FARM) logs, and manufacturing data about each disk drive.

The Google Cloud AI Services team (Professional Services), along with Accenture, helped Seagate build a proof of concept based on the two most common drive types. 

In the past, when we flagged a disk problem, the main fix was to repair the disk on site using software. But this procedure was expensive and time-consuming. It required draining the data from the drive, isolating the drive, running diagnostics, and then re-introducing it to traffic. This is the same process used at Hardcore Games when a disk is flagged.