How to Manage Large Databases Effectively

Paul Namuag

One of the biggest concerns when dealing with and managing databases is its data and size complexity. Often, organizations get concerned about how to deal with growth and manage growth impact because the database management fails off. Complexity comes with concerns that were not addressed initially and were not seen, or could be overlooked because the technology being currently used shall be able to handle by itself. Managing a complex and large database has to be planned accordingly especially when the type of data you are managing or handling is expected to grow massively either anticipated or in an unpredictable manner. The main goal of planning is to avoid unwanted disasters, or shall we say keep out of going up in smokes! In this blog we will cover how to efficiently manage large databases. 

Data Size does Matter

The size of the database matters as it has an impact on performance and its management methodology. How the data is processed and stored will contribute to how the database will be managed, which applies to both in transit and at rest data. For many large organisations, data is gold, and growth in data could have a drastic change in the process. Therefore, it’s vital to have prior plans to handle growing data in a database.

In my experience working with databases, I've witnessed customers having issues dealing with performance penalties and managing extreme data growth. Questions arise whether to normalize the tables vs denormalizing the tables. 

Normalizing Tables

Normalizing tables maintains data integrity, reduces redundancy, and makes it easy to organize the data into a more efficient way to manage, analyze, and extract. Working with normalized tables yields efficiency, especially when analyzing the data flow and retrieving data either by SQL statements or working with programming languages such as C/C++, Java, Go, Ruby, PHP,  or Python interfaces with the MySQL Connectors. 

Although concerns with normalized tables possess performance penalty and can slow the queries due to series of joins when retrieving the data. Whereas denormalized tables, all you have to consider for optimization relies on the index or the primary key to store data into the buffer for quicker retrieval than performing multiple disks seeks. Denormalized tables require no joins, but it sacrifices data integrity, and database size tends to get bigger and bigger.

When your database is large consider having a DDL (Data Definition Language) for your database table in MySQL/MariaDB. Adding a primary or unique key for your table requires a table rebuild. Changing a column data type also requires a table rebuild as the algorithm applicable to be applied is only ALGORITHM=COPY. 

If you're doing this in your production environment, it can be challenging. Double the challenge if your table is huge. Imagine a million or a billion numbers of rows. You cannot apply an ALTER TABLE statement directly to your table. That can block all incoming traffic that shall need to access the table currently you are applying the DDL. However, this can be mitigated by using pt-online-schema-change or the great gh-ost. Nevertheless, it requires monitoring and maintenance while doing the process of DDL.

Sharding and Partitioning

With sharding and partitioning, it helps segregate or segment the data according to their logical identity. For example, by segregating based on date, alphabetical order, country, state, or primary key based on the given range. This helps your database size to be manageable. Keep your database size up to its limit that it's manageable to your organization and your team. Easy to scale if necessary or easy to manage, especially when a disaster occurs.

When we say manageable, also consider the capacity resources of your server and also your engineering team. You cannot work with large and big data with few engineers. Working with big data such as 1000 databases with large numbers of data sets requires a huge demand of time. Skill wise and expertise is a must. If cost is an issue, that's the time that you can leverage third party services that offer managed services or paid consultation or support for any such engineering work to be catered.

Character Sets and Collation

Character sets and collations affect data storage and performance, especially on the given character set and collations selected. Each character set and collations has its purpose and mostly requires different lengths. If you have tables requiring other characters sets and collations due to character encoding, the data to be stored and processed for your database and tables or even with columns.

This affects how to manage your database effectively. It impacts your data storage and as well performance as stated earlier. If you have understood the kinds of characters to be processed by your application, take note of the character set and collations to be used. LATIN types of characters sets shall suffice mostly for the alphanumeric type of characters to be stored and processed. 

If it's inevitable, sharding and partitioning helps to at least mitigate and limit the data to avoid bloating too much data in your database server. Managing very large data on a single database server can affect efficiency, especially for backup purposes, disaster and recovery, or data recovery as well in case of data corruption or data lost.

Database Complexity Affects Performance 

A large and complex database tends to have a factor when it comes to performance penalty. Complex, in this case, means that the content of your database consists of mathematical equations, coordinates, or numerical and financial records. Now mixed these records with queries that are aggressively using the mathematical functions native to its database. Take a look at the example SQL (MySQL/MariaDB compatible) query below,

SELECT
    ATAN2( PI(),
		SQRT( 
			pow(`a`.`col1`-`a`.`col2`,`a`.`powcol`) + 
			pow(`b`.`col1`-`b`.`col2`,`b`.`powcol`) + 
			pow(`c`.`col1`-`c`.`col2`,`c`.`powcol`) 
		)
	) a,
    ATAN2( PI(),
		SQRT( 
			pow(`b`.`col1`-`b`.`col2`,`b`.`powcol`) - 
			pow(`c`.`col1`-`c`.`col2`,`c`.`powcol`) - 
			pow(`a`.`col1`-`a`.`col2`,`a`.`powcol`) 
		)
	) b,
    ATAN2( PI(),
		SQRT( 
			pow(`c`.`col1`-`c`.`col2`,`c`.`powcol`) * 
			pow(`b`.`col1`-`b`.`col2`,`b`.`powcol`) / 
			pow(`a`.`col1`-`a`.`col2`,`a`.`powcol`) 
		)
	) c
FROM
    a
LEFT JOIN `a`.`pk`=`b`.`pk`
LEFT JOIN `a`.`pk`=`c`.`pk`
WHERE
    ((`a`.`col1` * `c`.`col1` + `a`.`col1` * `b`.`col1`)/ (`a`.`col2`)) 
    between 0 and 100
AND
    SQRT(((
		(0 + (
			(((`a`.`col3` * `a`.`col4` + `b`.`col3` *  `b`.`col4` + `c`.`col3` + `c`.`col4`)-(PI()))/(`a`.`col2`)) * 
			`b`.`col2`)) -
		`c`.`col2) * 
		((0 + (
			((( `a`.`col5`* `b`.`col3`+ `b`.`col4` * `b`.`col5` + `c`.`col2` `c`.`col3`)-(0))/( `c`.`col5`)) * 
			 `b`.`col3`)) - 
		`a`.`col5`)) +
		((
			(0 + (((( `a`.`col5`* `b`.`col3` + `b`.`col5` * PI() + `c`.`col2` / `c`.`col3`)-(0))/( `c`.`col5`)) * `b`.`col5`)) - 
			`b`.`col5` ) * 
			((0 + (((( `a`.`col5`* `b`.`col3` + `b`.`col5` * `c`.`col2` + `b`.`col2`  / `c`.`col3`)-(0))/( `c`.`col5`)) * -20.90625)) - `b`.`col5`)) +
		(((0 + (((( `a`.`col5`* `b`.`col3` + `b`.`col5` * `b`.`col2` +`a`.`col2`  / `c`.`col3`)-(0))/( `c`.`col5`)) *  `c`.`col3`)) - `b`.`col5`) * 
		((0 + (((( `a`.`col5`* `b`.`col3` + `b`.`col5` * `b`.`col2`5 + `c`.`col3`  / `c`.`col2`)-(0))/( `c`.`col5`)) *  `c`.`col3`)) - `b`.`col5`
	))) <=600
ORDER BY
    ATAN2( PI(),
		SQRT( 
			pow(`a`.`col1`-`a`.`col2`,`a`.`powcol`) + 
			pow(`b`.`col1`-`b`.`col2`,`b`.`powcol`) + 
			pow(`c`.`col1`-`c`.`col2`,`c`.`powcol`) 
		)
	) DESC
 

Consider that this query is applied on a table ranging from a million rows. There is a huge possibility that this can stall the server, and it could be resource intensive causing danger to the stability of your production database cluster. Involved columns tend to be indexed to optimise and make this query performant. However, adding indexes to the referenced columns for optimal performance doesn't guarantee the efficiency of managing your large databases. 

When handling complexity, the more efficient way is to avoid rigorous usage of complex mathematical equations and aggressive usage of this built-in complex computational capability. This can be operated and transported through complex computations using backend programming languages instead of using the database. If you have complex computations, then why not store these equations in the database, retrieve the queries, organize it into a more easy to analyze or debug when needed. 

Are You Using the Right Database Engine?

A data structure affects the performance of the database server based on the combination of the query given and the records that are read or retrieved from the table. The database engines within MySQL/MariaDB support InnoDB and MyISAM which use B-Trees, while NDB or Memory database engines use Hash Mapping. These data structures have its asymptotic notation which the latter express the performance of the algorithms used by these data structures. We call these in Computer Science as Big O notation which describes the performance or complexity of an algorithm. Given that InnoDB and MyISAM use B-Trees, it uses O(log n) for search. Whereas, Hash Tables or Hash Maps uses O(n). Both share the average and worst case for its performance with its notation. 

Now back on the specific engine, given the data structure of the engine, the query to be applied based on the target data to be retrieved of course affects the performance of your database server. Hash tables cannot do range retrieval, whereas B-Trees is very efficient for doing these types of searches and also it can handle large amounts of data. 

Using the right engine for the data you store, you need to identify what type of query you apply for these specific data you store. What type of logic that these data shall formulate when it transforms into a business logic. 

Dealing with 1000's or thousands of databases, using the right engine in combination of your queries and data that you want to retrieve and store shall deliver good performance. Given that you have predetermined and analyzed your requirements for its purpose for the right database environment.

Right Tools to Manage Large Databases

It is very hard and difficult to manage a very large database without a solid platform that you can rely upon. Even with good and skilled database engineers, technically the database server you are using is prone for human error. One mistake of any changes to your configuration parameters and variables might result in a drastic change causing to degrade the server's performance.

Performing backup to your database on a very large database could be challenging at times. There are occurrences that backup might fail for some strange reasons. Commonly, queries that could stall the server where the backup is running cause to fail. Otherwise, you have to investigate the cause of it.

Using automation such as Chef, Puppet, Ansible, Terraform, or SaltStack can be used as your IaC to provide quicker tasks to perform. While using other third-party tools as well to help you from monitoring and providing high quality graph images. Alert and alarm notification systems are also very important to notify you from issues that can occur from warning to critical status level. This is where ClusterControl is very useful in this kind of situation.

ClusterControl offers ease to manage a large number of databases or even with sharded types of environments. It has been tested and installed a thousand times and has been running into productions providing alarms and notifications to the DBAs, engineers, or DevOps operating the database environment. Ranging from staging or development, QAs, to production environment.

ClusterControl also can perform a backup and restore. Even with large databases, it can be efficient and easy to manage since the UI provides scheduling and also has options to upload it to the cloud (AWS, Google Cloud, and Azure). 

There's also a backup verification and a lot of options such as encryption and compression. See the screenshot below for example (creating a Backup for MySQL using Xtrabackup):

Conclusion

Managing large databases such as a thousand or more can be efficiently done, but it must be determined and prepared beforehand. Using the right tools such as automation or even subscribing to managed services helps drastically. Although it incurs cost, the turnaround of the service and budget to be poured to acquire skilled engineers can be reduced as long as the right tools are available.

 
ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.