MYSQL DATABASE ERROR

MySQL 8 crashed with the database at 95GB. Ordinarily it is 10MB. For a larger server environment this might be consider silent data corruption, given there is no systemic crashes etc.

Manual checking did not show any issues. Ordinarily the SSD controller is able to handle TRIM transparently. So how did the database become so large is eye opening. The du command can find the amount of storage used by everything on the file system.

sudo du -a | sort -n

The MySQL files were 95GB which is insane. No evidence of a SQL injection attack or any of the usual suspects. In short this should not be happening at anytime. The SQL OPTIMIZE can clean up a table easily so this is a really bizzare problem

More problems will take time to correct. It took 48 hours to recover the site files and install Linux fresh. Day one was recovering as much as possible and making enough space for MYSQLDUMP. The site were downloaded by FTP fine, Day two was installing Linux fresh and preparing it for WordPress which is a complex task as well.

The SSD was unwritable as there was no free space. The 240GB SSD should be adequate but for some reason the MySQL files dominated the storage when checking the system for what was eating up the space. I have inquired on several forums and I will continue to investigate.

For now the site will be using plain URLs until the settings and .htaccess are cleaned up and allow the site to function without errors or JSON problems.

Media is also being cleaned up. Thumbnails are now giving way to using whatever is available and simply resize it dynamically for the time being.

URLs are presently simple until I discover what is wrong to enable pretty URLs without HTTP 404 errors. The plain URLS were all that worked.

Edit /etc/apache2/apache2.conf and change the override to all fixes the URL rewriting.

<Directory /var/www/>
        Options Indexes FollowSymLinks
        AllowOverride All
        Require all granted
</Directory>

Now you need to restart Apache2 to enable the change:

sudo systemctl restart apache2

now the permalink URLs will work properly.

A new feature showing the guests and bots gives the site a better view of the traffic to users over the day. There is an administrator view but by showing uses a widget seems to be viable too. This toll shows anyone who visits a clear story as to the site. Nobody else does it so this is an innovative option. Immediately the new tool is very eye opening with the number of crawlers mauling the site.

While working to fix the site you webmaster did not hardly eat anything. Lots of very strong coffee was needed to figure out problems to get the site back in business.

FACEBOOK AND DATA ERRORS

Facebook obviously has a larger database than Hardcore Games. Silent Data Corruption can cause extensive problems. This is especially devastating at Facebook scale but engineering teams at the social giant have discovered strategies to keep a local problem from going global. A single hardware-rooted error can cascade into a massive problem when multiplied at hyperscale and for Facebook, keeping this at bay takes a combination of hardware resiliency, production detection mechanisms, and a broader fault-tolerant software architecture.

Engineers found that many of the cascading errors are the result of CPUs in production but not always due to the “soft errors” of radiation or synthetic fault injection. Rather, they find these can happen randomly on CPUs in repeatable ways. Although ECC is useful, this is focused on problems in SRAM but other elements are susceptible.

Facebook used a few reference application examples to highlight the impact of silent data corruption at scale, including an example with a Spark workflow that runs millions of computations of wordcount computations per day along with FB’s compression application, which similar millions of compression/decompression computations daily. 

Facebook has frequently posted many questions for database vendors to consider in their offerings.