blog
Automate Database Health Check
Ensuring the database is healthy is one of the critical and important things that a database administrator must do. If we neglect the maintenance of the database, there is a higher probability that we run into a problem; for example, a database performance issue caused by a workload that has changed over time, or misconfiguration that leads to a data breach.
Regular checking of the configuration, resource utilization, backup and restore procedures, data security, query performance can help avoid database issues. We need to have a standard database check for our database environment, so we can monitor if the database is still under control.
What is a Database Health Check
A database health check consists of a number of tasks to check the condition of our database. Why do we need to run a health check? We need to identify and fix any issues or anomalies in our database environment, whether it is a performance issue, configuration issue, or something that can cause an outage.
We can breakdown the health check into a few categories:
-
Security check, to lock down access to the database and ensure the traffic comes from a trusted network with the right privileges.
-
Configuration check, to ensure the configuration meets the standard criteria that have been defined by the organization.
-
Performance check, to ensure that the database is making use of the hardware resources and is responsive to applications.
-
Backup and Restore procedures, to ensure that the backup we took from the database can be restored.
From those categories, we can make a breakdown of what we need to check in the database. This is very important, so we can have a detailed check on every aspect. For example:
-
Database Security
-
Compare the user and privileges on the database with the user matrix access we have
-
Check the whitelist IP address on the database, whether traffic is coming from a trusted network
-
Ensure audit logging of the database is enabled
-
Configuration Check
-
Check the SSL already in place
-
Ensure database configuration is correct (both permission and ownership)
-
Performance check
-
Check the cache hit ratio of the database
-
Ensure the database connection still enough to handle the traffic
-
Backup and Restore procedure
-
Proper backup schedule, which delivers the agreed RPO
-
Ensure we test backups so we know the data is restorable
Based on the above list, we can create a script to check those items on each type of database (eg. MySQL, PostgreSQL, MongoDB). Each database type will obviously have different commands.
Automating Database Healthchecks
We do not want to run a repetitive task on a weekly or monthly basis, the database health check is a time-consuming task. We run the script on every database node, so automation of the health checks saves us quite some time.
Based on the list of health check script, we can create bash script to run the tasks and schedule it with cron. Below is just sample of a simple database health check report:
#!/bin/sh
# Simple database check report
username = "audit_user"
password = "pwd001"
hostname = "db01-payment"
mycnf = "/etc/mysql/my.cnf"
dt=$(date '+%d/%m/%Y %H:%M:%S');
audit_name = "MySQL_Healthcheck_audit_report_"$dt
# check the queries
/bin/mysql -h $hostname -u $username -p $password --skip-column-names -B -e "SHOW STATUS LIKE 'Queries'" > $audit_name
# check open table cache hit ratio
/bin/mysql -h $hostname -u $username -p $password --skip-column-names -B -e "SHOW STATUS LIKE 'Table_open_cache_hits'" >> $audit_name
# check the ssl session mode
/bin/mysql -h $hostname -u $username -p $password --skip-column-names -B -e "SHOW STATUS LIKE 'Ssl_session_cache_mode'" >> $audit_name
# check the buffer pool size
cat $mycnf | grep "innodb_buffer_pool_size" >> $audit_name
#check ssl key in my.cnf
cat $mycnf | grep "ssl_key" >> $audit_name
# check permission of my.conf
ls -ltr $mycnf >> $audit_name
Health checks can also be automated by configuration management tools such as Ansible, Salt,Chef, or Puppet.
Automating Database Healthchecks with ClusterControl
ClusterControl is an ops platform for databases, it shows server health, performance, or availability problems within their database environment, all from a central console. It supports automating database health checks through Operational Reports. You can generate or schedule the operational reports, as well as incident reports. There are few types of report as shown below:
The Daily System Report gives you insight about your current database cluster with consists of various information such as node service status, backup status, the uptime of nodes, top queries summary.
Package Upgrade Report gives you a summary of available packages for the upgrade from the repository manager.
Schema Changes Report compares database changes in a table structure that happened between two different generated reports.
Backup Reports gives you information about backup summary and details e.g., last backup created, the status of the backup, backup verification status, and backup retention period.
Apart from the Operational Reports, there are also the Advisors which gives you insight about the CPU, disk, database connections, etc., as below:
The Notifications via email and alert through configured third-party channels give insight on things that go wrong (e.g. failed backups, unrestorable backups, node failures).
Schema Analyzer gives information about duplicate/redundant Indexes, tables without Primary Keys, and tables using MyISAM storage engine. Redundant indexes can be particularly good to know about, as they bloat the size of the database (and backups), and can slow down updates on the table.