Severalnines

If you’re looking for a monitoring system, you probably read about many different options with different features and different costs based on these features.

Manage Engine Applications Manager is an application performance management solution that proactively monitors business applications and helps businesses ensure their revenue-critical applications meet end-user expectations.

In this blog, we’ll take a look at some of the features of these products so you’ll be able to have an overview to help choose the correct one based on your requirements.

Database Monitoring Features Comparison

Manage Engine Applications Manager

There are three different versions of the product:

Free: Supports monitoring up to 5 apps or servers
Professional: Supports integrated performance monitoring for a heterogeneous set of applications
Enterprise: Supports large deployments with its distributed monitoring capability

It can be installed on both Windows and Linux operating systems, and it can monitor not only Databases but also Applications, Mail Servers, Virtualization, and more.

ClusterControl

Like the previous one, there are three different versions of the product:

Free Community: Great for deployment & monitoring. No limit on the number of servers but there is a limit on the available features
Advanced: For high availability and scalability requirements
Enterprise: With enterprise-grade and security features

It can be installed only on Linux operating systems, and it’s only for Database and Load Balancer servers.

The Installation Process

Manage Engine Applications Manager Installation Process

The installation process can be hard for a standard user, as the documentation doesn’t have a step-by-step guide and it’s not clear about the packages required.

Let’s see an example of this installation on CentOS 8.

It’s not mentioned in the documentation (at least I didn’t find it), but it requires the following packages: tar, unzip, and hostname. You need to install it for you own, otherwise, as the installer won’t install it, you’ll receive an error message like:

/opt/ManageEngine_ApplicationsManager_64bit.bin: line 686: tar: command not found

Then, you need to run the installer with the ”-i console” flag using a privileged user (non-root):

$ sudo /opt/ManageEngine_ApplicationsManager_64bit.bin -i console

During the installation process, you can choose the Professional or Enterprise edition for your trial period. After your 30-day free trial ends, your installation will automatically convert to the free edition unless you have a commercial license:

===============================================================================

Edition Selection

-----------------

  ->1- Professional Edition

    2- Enterprise Edition(Distributed Setup)

    3- Free Edition

ENTER THE NUMBER FOR YOUR CHOICE, OR PRESS <ENTER> TO ACCEPT THE DEFAULT::

It supports different languages that you can choose here:

===============================================================================

Language Selection

------------------

  ->1- English

    2- Simplified Chinese

    3- Japanese

    4- Vietnamese

    5- French

    6- German

    7- European Spanish

    8- korean

    9- Hungarian

   10- Traditional Chinese

ENTER THE NUMBER FOR YOUR CHOICE, OR PRESS <ENTER> TO ACCEPT THE DEFAULT::

You can also add a license (if you have one), specify the web server and SSL port, local database (for this it supports PostgreSQL or Microsoft SQL Server), installation path, and if you want to register for technical support. You’ll see a summary before starting the installation process:

===============================================================================

Pre-Installation Summary

------------------------

Please Review the Following Before Continuing:

Product Name:

    ManageEngine Applications Manager14

Install Folder:

    /opt/ManageEngine/AppManager14

Link Folder:

    /root

Type Of Installation:

    PROFESSIONAL EDITION

DB Back-end :

    pgsql

Web Server Port :

    "9090"

Disk Space Information (for Installation Target):

    Required:  549,437,924 Bytes

    Available: 13,418,307,584 Bytes

PRESS <ENTER> TO CONTINUE:

When you receive the “Installation Complete” message, you’ll be ready to start it running the “startApplicationsManager.sh” script located in the installation path:

$ cd /opt/ManageEngine/AppManager14

$ sudo ./startApplicationsManager.sh

##########################################################################

 Note:It is recommended to start the product in nohup mode.

 Usage : nohup sh startApplicationsManager.sh &

##########################################################################

AppManager Info: Temporary image files are removed

This evaluation copy is valid for 29 days

[Tue May 05 01:28:31 UTC 2020] Starting Applications Manager "Primary" Server Modules, please wait ...

[Tue May 05 01:28:34 UTC 2020] Process : Site24x7IntegrationProcess [ Started ]

[Tue May 05 01:28:34 UTC 2020] Process : AMScriptProcess [ Started ]

[Tue May 05 01:28:35 UTC 2020] Process : AMExtProdIntegrationProcess [ Started ]

[Tue May 05 01:28:35 UTC 2020] Process : AuthMgr [ Started ]

[Tue May 05 01:28:35 UTC 2020] Process : AMDataCleanupProcess [ Started ]

[Tue May 05 01:28:35 UTC 2020] Process : DBUserStorageServer [ Started ]

[Tue May 05 01:28:35 UTC 2020] Process : NmsPolicyMgr [ Started ]

[Tue May 05 01:28:36 UTC 2020] Process : StartRelatedServices [ Started ]

[Tue May 05 01:28:36 UTC 2020] Process : AMUrlMonitorProcess [ Started ]

[Tue May 05 01:28:36 UTC 2020] Process : NMSMServer [ Started ]

[Tue May 05 01:28:36 UTC 2020] Process : NmsAuthManager [ Started ]

[Tue May 05 01:28:36 UTC 2020] Process : WSMProcess [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : APMTracker [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : RunJSPModule [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : StandaloneApplnProcess [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : AMRBMProcess [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : ApplnStandaloneBE [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : AMDistributionProcess [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : OAuthRefreshAccessToken [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : DiscoveryProcess [ Started ]

[Tue May 05 01:28:38 UTC 2020] Process : AMCAMProcess [ Started ]

[Tue May 05 01:28:39 UTC 2020] Process : NMSSAServer [ Started ]

[Tue May 05 01:28:39 UTC 2020] Process : AMServerStartUp [ Started ]

[Tue May 05 01:28:42 UTC 2020] Process : Collector [ Started ]

[Tue May 05 01:28:42 UTC 2020] Process : DBServer [ Started ]

[Tue May 05 01:28:43 UTC 2020] Process : MapServerBE [ Started ]

[Tue May 05 01:28:43 UTC 2020] Process : NmsConfigurationServer [ Started ]

[Tue May 05 01:28:44 UTC 2020] Process : AMFaultProcess [ Started ]

[Tue May 05 01:28:44 UTC 2020] Process : AMEventProcess [ Started ]

[Tue May 05 01:28:56 UTC 2020] Process : AMServerFramework [ Started ]

[Tue May 05 01:29:07 UTC 2020] Process : AMDataArchiverProcess [ Started ]

[Tue May 05 01:29:08 UTC 2020] Process : MonitorsAdder [ Started ]

[Tue May 05 01:29:11 UTC 2020] Process : EventFE [ Started ]

[Tue May 05 01:29:11 UTC 2020] Process : AlertFE [ Started ]

[Tue May 05 01:29:11 UTC 2020] Process : NmsMainFE [ Started ]

Verifying connection with web server... verified

Applications Manager started successfully.

Please connect your client to the web server on port: 9090

Now you can access the UI using the default user and password (admin/admin):

ClusterControl Installation Process

There are different installation methods as it’s mentioned in the documentation. In the case of a manual installation, the required packages are specified in the same documentation, and there is a step-by-step guide for all the process.

Let’s see an example of this installation on CentOS 8 using the automatic installation script.

$ wget http://www.severalnines.com/downloads/cmon/install-cc

$ chmod +x install-cc

$ sudo ./install-cc   # omit sudo if you run as root

The installation script will attempt to automate the following tasks:

Install and configure a local MySQL server (used by ClusterControl to store monitoring data)
Install and configure the ClusterControl controller package via package manager
Install ClusterControl dependencies via package manager
Configure Apache and SSL
Configure ClusterControl API URL and token
Configure ClusterControl Controller with minimal configuration options
Enable the CMON service on boot and start it up

Running the mentioned script, you’ll receive a question about sending diagnostic data:

$ sudo ./install-cc

!!

Only RHEL/Centos 6.x|7.x|8.x, Debian 7.x|8.x|9.x|10.x, Ubuntu 14.04.x|16.04.x|18.04.x LTS versions are supported

Minimum system requirements: 2GB+ RAM, 2+ CPU cores

Server Memory: 1024M total, 922M free

MySQL innodb_buffer_pool_size set to 512M



Severalnines would like your help improving our installation process.

Information such as OS, memory and install success helps us improve how we onboard our users.

None of the collected information identifies you personally.

!!

=> Would you like to help us by sending diagnostics data for the installation? (Y/n):

Then, it’ll start installing the required packages. The next question is about the hostname that will be used:

=> The Controller hostname will be set to 192.168.100.131. Do you want to change it? (y/N):

When the local database is installed, the installer will secure it creating a root password that you must enter:

=> Starting database. This may take a couple of minutes. Do NOT press any key.

Redirecting to /bin/systemctl start mariadb.service

=> Securing the MySQL Server ...

=> !! In order to complete the installation you need to set a MySQL root password !!

=> Supported special password characters: ~!@#$%^&*()_+{}<>?

=> Press any key to proceed ...

And a CMON user password, which will be used by ClusterControl:

=> Set a password for ClusterControl's MySQL user (cmon) [cmon]

=> Supported special characters: ~!@#$%^&*()_+{}<>?

=> Enter a CMON user password:

That’s it. In this way, you’ll have all in place without installing or configuring anything manually.

=> ClusterControl installation completed!

Open your web browser to http://192.168.100.131/clustercontrol and

enter an email address and new password for the default Admin User.

Determining network interfaces. This may take a couple of minutes. Do NOT press any key.

Public/external IP => http://10.10.10.10/clustercontrol

Installation successful. If you want to uninstall ClusterControl then run install-cc --uninstall.

The first time you access the UI, you will need to register for the 30-day free trial period.

After your 30-day free trial ends, your installation will automatically convert to the community edition unless you have a commercial license.

Database Monitoring Usage Comparison

Manage Engine Applications Manager

To start using it, you need to add a new monitor using the corresponding section, where you have different options to be used. As we mentioned in the features section, it allows you to monitor different things like Applications, Databases, Virtualization, and even more.

Let’s say you want to monitor a MySQL instance. For this, you’ll need to add a MySQL Java Connector in the Application Manager directory, in “/opt/ManageEngine/AppManager14/working/mysql/MMMySQLDriver”, and then, restart the Applications Manager software. Then, you need to create the user to access the database.

To add this monitor, you must specify the display name, hostname/IP address, database port, credentials, and database to be monitored. The database service must be running, and the database and user must be created previously.

ClusterControl

To add the first database node to be monitored, you must go to the deploy/import section. ClusterControl requires SSH access to the remote node for both deploy and import actions.

For the import process, you’ll need to use a database admin user, and specify vendor, version, database port, and Hostname/IP address of the node/nodes.

For deployment, you just need to specify the user that will be created during the installation process. ClusterControl will also install the database software and required packages in this process, so you don’t need to perform any manual configuration or installation.

You can also choose between different database vendors and versions, and a basic configuration like database port and datadir.

Finally, you must define the topology to be deployed. The available topologies depend on the selected technology.

Monitoring Your Database

Database Monitoring with Manage Engine Applications Manager

Let’s see an example of monitoring a MySQL database. In this case, you can see first, an overview of the database node, with some basic metrics.

You can go to the Database tab, to see specific information about the database that you’re monitoring:

If you take a look at the Replication section, in this case, it says “Replication is not enabled”:

But actually, there is a master-slave replication up and running... There is nothing related to this issue in the documentation, so, as it’s not working, let’s continue to the following section: “Performance”, where you’ll have a list of the top queries.

Then, the “Session” section, where you’ll have the current sessions:

And finally, information about the database configuration:

Database Monitoring with ClusterControl

Like the previous case, let’s see an example of monitoring a MySQL database. In this case, you can see first, an overview of the database node, with some basic metrics.

You have different dashboards here, that you can customize based on your requirements. Then, in the “Node” section, you can see host/database metrics, top process, and configuration for each node.

If you go to the “Dashboards” section, you’ll have more detailed information about your database, load balancer, or host, with more useful metrics.

You can also check the “Topology View” section, where you can see the status of all the environment, or even perform actions over the nodes.

In the “Query Monitor” section, you can see the Top Queries, Running Queries, and Query Outliers.

Then, in the “Performance” section, you have information about your database performance, configuration variables, schema analyzer, transaction log, and even more.

In the same section, you can check the database growth, that will show the Data Size and Index Size for each database.

You can check the “Log” section, to monitor not only the ClusterControl log but also the Operating System and Database logs, so you don’t need to access the server to check this.

Database Alarms & Notifications

Manage Engine Applications Manager Notifications

A good monitoring system requires alarms to alert you in case of failure. This system has its own alarm system where you must configure actions to be run when the alarm is generated.

You can integrate it with another Manage Engine product called AlarmsOne, to centralize it. This is a separate product, so it has its own price/plan.

ClusterControl Notifications

It also has an alarm system using Advisors. ClusterControl has some predefined advisors that could be modified if needed, but in general, it’s not necessary so you don’t need to do any manual task. You can also use the Developer Studio tool to manage or create a new script.

It has integration with 3rd party tools like Slack or PagerDuty, so you can receive notifications there too.

Conclusion

According to the features mentioned above, we can say Applications Manager is a good option to monitor both applications and databases in a basic way. It supports different languages, and it has also support not only for Linux but also for Windows as the Operating System. The installation process, however, can be very challenging for inexperienced users as it requires too many manual actions and configurations, the documentation is not well written and the monitoring options and metrics are basic.

On the other hand, we can say ClusterControl is an all-in-one management system with a lot of features, but only for databases and load balancer servers, and only available for Linux Operating System. In this case, the installation is really easy using the automatic installation script (it doesn’t require extra manual configuration or installation), the documentation has step-by-step guides, and it’s a complete monitoring system with dashboards and several metrics that could be useful for you.

You can perform not only monitoring tasks but also deployment, scale, management, and even more. The monitoring features of ClusterControl are also free and part of the Community Edition.

Tags:

monitoring

database monitoring

alerts

Integrations

MySQL user and privilege management is very critical for authentication, authorization and accounting purposes. Since MySQL 8.0, there are now two types of database user privileges:

Static privileges - The common global, schema and administrative privileges like SELECT, ALTER, SUPER and USAGE, built into the server.
Dynamic privileges - New in MySQL 8.0. A component that can be registered and unregistered at runtime which provides better control over global privileges. For example, instead of assigning SUPER privilege only for configuration management purposes, that particular user is better be granted with SYSTEM_VARIABLES_ADMIN privilege only.

Creating a database schema with its respective user is the very initial step to start using MySQL as your database server. Most applications that use MySQL as the datastore require this task to be done before the application could work as intended. To use with an application, commonly a MySQL user is configured to have full privileges (ALL PRIVILEGES) on the schema level, meaning the database user used by the application has the freedom to perform any actions on the assigned database.

In this blog post, we are going to compare and contrast MySQL database user management features between MySQL Workbench and ClusterControl.

MySQL Workbench - Database User Management

For MySQL Workbench, you can find all the user management stuff under Administration -> Management -> User and Privileges. You should see a list of existing users on the left-side while on the right-side is the authentication and authorization configuration section for the selected user:

MySQL supports over 30 static privileges and it is not easy to understand and remember them all. MySQL Workbench has a number of preset administrative roles, which is very helpful when assigning sufficient privileges to a database user. For example, if you would like to create a MySQL user specifically to perform backup activities using mysqldump, you may pick the BackupAdmin role and the related global privileges will be assigned to the user accordingly:

To create a new database user, click on the "Add Account" button and supply necessary information under the "Login" tab. You may add some more resource restrictions under the "Account Limit" tab. If the user is only for a database schema and not intended for any administrative role (strictly for application usage), you may skip the "Administrative Roles" tab and just configure the "Schema Privileges".

Under the "Schema Privileges" section, one can pick a database schema (or define the matching pattern) by clicking "Add Entry". Then, press the "Select ALL" button to allow all rights (except GRANT OPTION) which is similar to "ALL PRIVILEGES" option statement:

A database user will not be created in the MySQL server until you have applied the changes, by clicking the "Apply" button.

ClusterControl - Database and Proxy User Management

ClusterControl database and user management is a bit more straightforward than what MySQL Workbench offers. While MySQL Workbench is more developer friendly, ClusterControl is focused more on what SysAdmins and DBAs are interested in, more like common administration stuff for those who are already familiar with MySQL roles and privileges.

To create a database user, go to Manage -> Schemas and Users -> Users -> Create New User. You will be presented with the following user creation wizard:

Creating a user in ClusterControl requires you to fill up all necessary fields in one page, unlike MySQL Workbench which involved a number of clicks to achieve similar results. ClusterControl also supports creating a user with "REQUIRE SSL" syntax, to enforce the particular user to access only via SSL encryption channel.

ClusterControl provides an aggregated view on all database users in a cluster, eliminating you to login to every individual server to look for a particular user:

A simple rollover on the privileges box reveals all privileges that have been assigned to this user. ClusterControl also provides a list of inactive users, user accounts that have not been used since the last server restart:

The above list gives us a clear summary of which users are worth to exist, allowing us to manage the user more efficiently. DBAs can then ask the developer whether the inactive database user is still necessary to be active, otherwise the user account can be locked or dropped.

If you are having a ProxySQL load balancer in between, you might know that ProxySQL has its own MySQL user management to allow it to be passed through it. There are a number of different settings and variables if compared to the common MySQL user configurations e.g, default hostgroup, default schema, transaction persistence, fast forward and many more. ClusterControl provides a graphical user interface in managing ProxySQL database users, improving the experience and efficiency of managing your proxy and database users at once:

When creating a new database user via ProxySQL management page, ClusterControl will automatically create the corresponding user on both ProxySQL and MySQL. However, when dropping a MySQL user from ProxySQL, the corresponding database user will remain on the MySQL server.

Advantages & Disadvantages

ClusterControl supports multiple database vendors so you will get a similar user experience dealing with other database servers. ClusterControl also supports creating a database user on multiple hosts at once, where it will make sure the created user exists on all database servers in the cluster. ClusterControl has a cleaner way when listing out user accounts, where you can see all necessary information right in the listing page. However, user management requires active subscription and is not available in the community edition. It does not support all platforms that MySQL can run, particularly only certain Linux distributions like CentOS, RHEL, Debian and Ubuntu.

The strongest advantage of MySQL Workbench is that it is free, and can be used together with schema management and administration. It's built to be more friendly to developers and DBAs and has the advantage of being built and backed by the Oracle team, who owns and maintains MySQL server. It also provides much clearer guidance with description on most of the input fields, especially in the critical parts like authentication and privilege management. The preset administrative role is a neat way of granting a set of privileges to a user, based on the work the user must carry out on the server. On the down side, MySQL Workbench is not a cluster friendly tool since every management connection is tailored to one endpoint MySQL server. Thus, it doesn't provide a centralized view of all users in the cluster. It also doesn't support creating users with SSL enforcement.

Both of these tools do not support the new MySQL 8.0 dynamic privileges syntax e.g, BACKUP_ADMIN, BINLOG_ADMIN, SYSTEM_VARIABLES_ADMIN, etc.

The following table highlights notable features for both tools for easy comparison:

User Management Aspect	MySQL Workbench	ClusterControl
Supported OS for MySQL server	Linux Windows FreeBSD Open Solaris Mac OS	Linux (Debian, Ubuntu, RHEL, CentOS)
MySQL vendor	Oracle Percona	Oracle Percona MariaDB Codership
Support user management for other software		ProxySQL
Multi-host user management	No	Yes
Aggregated view of users in a database cluster	No	Yes
Show inactive users	No	Yes
Create user with SSL	No	Yes
Privilege and role description	Yes	No
Preset administrative role	Yes	No
MySQL 8.0 dynamic privileges	No	No
Cost	Free	Subscription required for management features

We hope that these blog posts will help you determine what tools suit best to manage your MySQL databases and users.

Tags:

The Amazon Relational Database Service (AWS RDS) is a fully-managed database service which can support multiple database engines. Among those supported are PostgreSQL, MySQL, and MariaDB. ClusterControl, on the other hand, is a database management and automation software which also supports backup handling for PostgreSQL, MySQL, and MariaDB open source databases.

While RDS has been widely embraced by many companies, some might not be familiar with how their Point-in-time Recovery (PITR) works and how it can be used.

Several of the database engines used by Amazon RDS have special considerations when restoring from a specific point in time, and in this blog we'll cover how it works for PostgreSQL, MySQL, and MariaDB. We'll also compare how it differs with the PITR function in ClusterControl.

What is Point-in-Time Recovery (PITR)

If you are not yet familiar with Disaster Recovery Planning (DRP) or Business Continuity Planning (BCP), you should know that PITR is one of the important standard practices for database management. As mentioned in our previous blog, Point In Time Recovery (PITR) involves restoring the database at any given moment in the past. To be able to do this, we will need to restore a full backup and then PITR takes place by applying all the changes that happened at a specific point in time you want to recover.

Point-in-time Recovery (PITR) with AWS RDS

AWS RDS handles PITR differently than the traditional way common to an on-prem database. The end result shares the same concept, but with AWS RDS the full backup is a snapshot, it then applies the PITR (which is stored in S3), and then launches a new (different) database instance.

The common way requires you to either use a logical (using pg_dump, mysqldump, mydumper) or a physical (Percona Xtrabackup, Mariabackup, pg_basebackup, pg_backrest) for your full backup before you apply the PITR.

AWS RDS will require you to launch a new DB instance, whereas the traditional approach allows you to flexibly store the PITR on the same database node where backup was taken or target a different (existing) DB instance that needs recovery or to a fresh DB instance.

Upon creation of your AWS RDS instance automated backups will be turned on. Amazon RDS automatically performs a full daily snapshot of your data. Snapshot schedules can be set during creation at your preferred backup window. While automated backups are turned on, AWS also captures transaction logs to Amazon S3 every 5 minutes recording all your DB updates. Once you initiate a point-in-time recovery, transaction logs are applied to the most appropriate daily backup in order to restore your DB instance to the specific requested time.

How To Apply a PITR with AWS RDS

Applying PITR can be done in three different ways. You can use AWS Management Console, the AWS CLI, or the Amazon RDS API once the DB instance is available. You must also take into consideration that the transaction logs are captured every five minutes which is then stored in AWS S3.

Once you restore a DB instance, the default DB security group (SG) is applied to the new DB instance. If you need the custom db SG, you can explicitly define this using the AWS Management Console, the AWS CLI modify-db-instance command, or the Amazon RDS API ModifyDBInstance operation after the DB instance is available.

PITR requires that you need to identify the most latest restorable time for a DB instance. To do this, you can use the AWS CLI describe-db-instances command and look at the value returned in the LatestRestorableTime field for the DB instance. For example,

[root@ccnode ~]# aws rds describe-db-instances --db-instance-identifier database-s9s-mysql|grep LatestRestorableTime

            "LatestRestorableTime": "2020-05-08T07:25:00+00:00",

Applying PITR with AWS Console

To apply PITR in AWS Console, login to AWS Console→ go to Amazon RDS → Databases → Select (or click) your desired DB instance, then click Actions. See below,

Once you attempt to restore via PITR, the console UI will notify you what's the most latest restorable time you can set. You can use the latest restorable time or specify your desired target date and time. See below:

It's quite easy to follow but it requires you to pay attention and fill in the desired specifications you need for the new instance to be launched.

Applying PITR with AWS CLI

Using the AWS CLI can be quite handy especially if you need to incorporate this with your automation tools for your CI/CD pipeline. To do this, you can start simply with,

[root@ccnode ~]# aws rds restore-db-instance-to-point-in-time \

>     --source-db-instance-identifier  database-s9s-mysql \

>     --target-db-instance-identifier  database-s9s-mysql-pitr \

>     --restore-time 2020-05-08T07:30:00+00:00

{

    "DBInstance": {

        "DBInstanceIdentifier": "database-s9s-mysql-pitr",

        "DBInstanceClass": "db.t2.micro",

        "Engine": "mysql",

        "DBInstanceStatus": "creating",

        "MasterUsername": "admin",

        "DBName": "s9s",

        "AllocatedStorage": 18,

        "PreferredBackupWindow": "00:00-00:30",

        "BackupRetentionPeriod": 7,

        "DBSecurityGroups": [],

        "VpcSecurityGroups": [

            {

                "VpcSecurityGroupId": "sg-xxxxx",

                "Status": "active"

            }

        ],

        "DBParameterGroups": [

            {

                "DBParameterGroupName": "default.mysql5.7",

                "ParameterApplyStatus": "in-sync"

            }

        ],

        "DBSubnetGroup": {

            "DBSubnetGroupName": "default",

            "DBSubnetGroupDescription": "default",

            "VpcId": "vpc-f91bdf90",

            "SubnetGroupStatus": "Complete",

            "Subnets": [

                {

                    "SubnetIdentifier": "subnet-exxxxx",

                    "SubnetAvailabilityZone": {

                        "Name": "us-east-2a"

                    },

                    "SubnetStatus": "Active"

                },

                {

                    "SubnetIdentifier": "subnet-xxxxx",

                    "SubnetAvailabilityZone": {

                        "Name": "us-east-2c"

                    },

                    "SubnetStatus": "Active"

                },

                {

                    "SubnetIdentifier": "subnet-xxxxxx",

                    "SubnetAvailabilityZone": {

                        "Name": "us-east-2b"

                    },

                    "SubnetStatus": "Active"

                }

            ]

        },

        "PreferredMaintenanceWindow": "fri:06:01-fri:06:31",

        "PendingModifiedValues": {},

        "MultiAZ": false,

        "EngineVersion": "5.7.22",

        "AutoMinorVersionUpgrade": true,

        "ReadReplicaDBInstanceIdentifiers": [],

        "LicenseModel": "general-public-license",

        "OptionGroupMemberships": [

            {

                "OptionGroupName": "default:mysql-5-7",

                "Status": "pending-apply"

            }

        ],

        "PubliclyAccessible": true,

        "StorageType": "gp2",

        "DbInstancePort": 0,

        "StorageEncrypted": false,

        "DbiResourceId": "db-XXXXXXXXXXXXXXXXX",

        "CACertificateIdentifier": "rds-ca-2019",

        "DomainMemberships": [],

        "CopyTagsToSnapshot": false,

        "MonitoringInterval": 0,

        "DBInstanceArn": "arn:aws:rds:us-east-2:042171833148:db:database-s9s-mysql-pitr",

        "IAMDatabaseAuthenticationEnabled": false,

        "PerformanceInsightsEnabled": false,

        "DeletionProtection": false,

        "AssociatedRoles": []

    }

}

Both of these approaches take time to create or prepare the database instance until it will be available and viewable in the list of database instances in your AWS RDS console.

AWS RDS PITR Limitations

When using AWS RDS you are tied to them as a vendor. Moving your operations out their system can be troublesome. Here's are some things you have to consider:

The level of vendor-lock in when using AWS RDS
Your only option to recover via PITR requires you to launch a new instance running on RDS
No way you can recover using PITR process to an external node not in RDS
Requires you to learn and be familiar with their tools and security framework.

How To Apply A PITR with ClusterControl

ClusterControl performs PITR in a simple, yet straightforward, fashion (but requires you have to enable or set the prerequisites so PITR can be used). As discussed earlier, PITR for ClusterControl works differently than AWS RDS. Here a list of where PITR can be applied using ClusterControl (as of version 1.7.6):

Applies after the full backup based on the available backup method solutions we support for PostgreSQL, MySQL, and MariaDB databases.
- For PostgreSQL, only pg_basebackup backup method is supported and compatible to work with PITR
- For MySQL or MariaDB, only xtrabackup/mariabackup backup method is supported and compatible to work with PITR
Applicable for MySQL or MariaDB databases, PITR applies only if the source node of the full backup is the target node to be recovered.
MySQL or MariaDB databases requires that you have binary logging enabled
Applicable for PostgreSQL databases, PITR applies only to the active master/primary and requires that you have to enable WAL archiving.
PITR can only be applied when restoring an existing full backup

Backup Management for ClusterControl is applicable for environments where databases are not fully managed and requires SSH access which is totally different from AWS RDS. Although they share the same result which is to recover data, the backup solutions that are present in ClusterControl cannot be applicable in AWS RDS. ClusterControl also does not support RDS as well for management and monitoring.

Using ClusterControl for PITR in PostgreSQL

As mentioned earlier of the prerequisites to leverage the PITR, you must have to enable WAL archiving. This can be achieve by clicking the gear icon as shown below:

Since PITR can be applied right after a full backup, you can only run find this feature under the Backup list where you can attempt to restore an existing backup. To do that, the sequence of screenshots will show you how to do it:

Then restore it on the same host as the source of the backup as taken,

Then just specify the date and time,

Once you are set and specify the date and time, ClusterControl will then restore the backup then apply the PITR once the backup is done. You can also verify this by inspecting the job activity logs just like below,

Using ClusterControl for PITR in MySQL/MariaDB

PITR for MySQL or MariaDB does not differ from the approach we have above for PostgreSQL. However, there's no WAL archiving equivalence nor a button or option you can set that is required to enable the PITR functionality. Since MySQL and MariaDB require that a PITR can be applied using binary logs, in ClusterControl, this can be handled under Manage tab. See below:

Then specify the log_bin variable with the corresponding boolean value. For example,

Once the log_bin is set on the node, ensure that you have the full backup taken on the same node where you will also apply the process of PITR. This is stated earlier in the prerequisites. Alternatively, you can also just edit the configuration files (/etc/my.cnf or /etc/mysql/my.cnf) and add the log_bin=ON under the [mysqld] section, for example.

When binary logs are enabled and a full backup is available, you can then do the PITR process same as how PostgreSQL UI but with different fields that you can fill in. You can specify the date and time or specify based on the binlog's file and position (or x & y position). See below:

ClusterControl PITR Limitations

In case you’re wondering what you can and cannot do for PITR in ClusterControl, here's the list below:

There's no current s9s CLI tool which supports the PITR process, so it's not possible to automate or integrate to your CI/CD pipeline.
No PITR support for external nodes
No PITR support when the source of the backup is different from the target node
There's no such periodic notification of what's the most latest period of time you can apply for PITR

Conclusion

Both tools have different approaches and different solutions for the target environment. The key takeaways is that AWS RDS has its own PITR which is faster, but is applicable only if your database is hosted under RDS and you are tied to a vendor lock in.

ClusterControl allows you to freely apply the PITR process to whatever data center or on-premise as long as the prerequisites are taken into consideration. It's goal is to recover the data. Regardless of its limitations, it's based on how you will use the solution in accordance to the architectural environment you are using.

Tags:

When your database workload is over stressed, you first want to look at what queries are running in an attempt to see the pattern of the query. Is it write heavy? Read heavy? Where is the bottleneck?

Identifying Query Issues

To figure it out you can enable the general log or the slow log to try to capture the queries which are running and writing to the file. You also can read from the binary log (as binary log captures all the changes in the database) and look at reads directly from the running processlist in the database. You can even capture the query from the network perspective by looking through tcpdump.

What to do next? You can analyze the query that is written to general log file, slow log file, binary log to check if there is something interesting going on (eg. bottleneck in the query).

Percona has a tool to analyze these type queries, named pt-query-digest. It is included when you install the Percona Toolkit, a collection of utilities tools that help DBA to manage their databases. In this blog we will take a look at this tool and how it compares to the Query Management features of ClusterControl.

Installation Procedure

Percona repositories support two packages Linux Distribution for setup, which is Debian-based and RPM-based package distribution. The installation are just simple as shown below :

Debian-based package (Ubuntu, Debian)

Configure Percona package repositories by download the package

wget https://repo.percona.com/apt/percona-release_latest.generic_all.deb

And then install the downloaded package using dpkg

sudo dpkg -i percona-release_latest.generic_all.deb

After that, just run the installation from package manager

sudo apt-get install percona-toolkit

RPM-based package (RHEL, CentOS, Oracle Enterprise Linux, Amazon AMI)

Configure Percona package repositories by installing the rpm package directly.

sudo yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm

After that, just run the installation from package manager

sudo apt-get install percona-toolkit

Percona utilities will be installed in your machine, and you just need to use it.

Query Workload Analyze

There are several ways to generate the statistics from the query workload using pt-query-digest, below is the command how to do it using a slow query file, general file, show processlist in database, and read through binary log.

Generate from show processlist database

pt-query-digest --processlist h=localhost,D=sbt,u=sbtest,p=12qwaszx --output slowlog > /tmp/slow_query.log

Generate from the slow query files / general query file

pt-query-digest mysql-slow.log > /tmp/slow_query.log

Generate from binary log. Before you run the pt-query-digest, you need to extract the binary log into readable format using mysqlbinlog. Don’t forget to add --type option and type binlog as the source.

pt-query-digest --type binlog mysql-bin.000001.txt > slow_query.log

After finish generating the file, you will see the report as shown below :

# 12s user time, 170ms system time, 27.44M rss, 221.79M vsz

# Current date: Sun May 10 21:40:47 2020

# Hostname: n2

# Files: mysql-1

# Overall: 94.92k total, 47 unique, 2.79k QPS, 27.90x concurrency ________

# Time range: 2020-05-10 21:39:37 to 21:40:11

# Attribute          total     min     max     avg     95%  stddev  median

# ============     ======= ======= ======= ======= ======= ======= =======

# Exec time           949s     6us      1s    10ms    42ms    42ms     2ms

# Lock time            31s       0      1s   327us    80us    11ms    22us

# Rows sent         69.36k       0     490    0.75    0.99   11.30       0

# Rows examine     196.34k       0     490    2.12    0.99   21.03    0.99

# Rows affecte      55.28k       0      15    0.60    0.99    1.26       0

# Bytes sent        13.12M      11   6.08k  144.93  299.03  219.02   51.63

# Query size        15.11M       5     922  166.86  258.32   83.13  174.84



# Profile

# Rank Query ID                      Response time  Calls R/Call V/M   Ite

# ==== ============================= ============== ===== ====== ===== ===

#    1 0xCE367F5CFFCAF46E816F682E... 162.6485 17.1%   199 0.8173  0.03 SELECT order_line? stock?

#    2 0x360F872745C81781F8F75EDE... 107.4898 11.3% 14837 0.0072  0.02 SELECT stock?

#    3 0xE0CE1933D0392DA3A42FAA7C... 102.2281 10.8% 14866 0.0069  0.03 SELECT item?

#    4 0x492B86BCB2B1AE1278147F95...  98.7658 10.4% 14854 0.0066  0.04 INSERT order_line?

#    5 0x9D086C2B787DC3A308043A0F...  93.8240  9.9% 14865 0.0063  0.02 UPDATE stock?

#    6 0x5812BF2C6ED2B9DAACA5D21B...  53.9681  5.7%  1289 0.0419  0.05 UPDATE customer?

#    7 0x51C0DD7AF0A6D908579C28C0...  44.3869  4.7%   864 0.0514  0.03 SELECT customer?

#    8 0xFFFCA4D67EA0A788813031B8...  41.2123  4.3%  3250 0.0127  0.01 COMMIT

#    9 0xFDDEE3813C59881488D9C47F...  36.0707  3.8%  1180 0.0306  0.02 UPDATE customer?

#   10 0x8FBBE0AFA061755CCC1C27AB...  31.6417  3.3%  1305 0.0242  0.03 UPDATE orders?

#   11 0x8AA6EB56551923DB9A49E40A...  23.3281  2.5%  1522 0.0153  0.04 SELECT customer? warehouse?

#   12 0xF34C10B3DA8DB048A630D4C7...  21.1662  2.2%  1305 0.0162  0.03 UPDATE order_line?

#   13 0x59DBA67188951C532AFC2598...  20.8006  2.2%  1503 0.0138  0.33 INSERT new_orders?

#   14 0xDADBEB0FBFA537F5D8722F42...  17.2802  1.8%  1290 0.0134  0.03 SELECT customer?

#   15 0x597A805ADA793440507F3334...  16.4394  1.7%  1516 0.0108  0.03 INSERT orders?

#   16 0x1B1EA568857A6AAC6544B44A...  13.9560  1.5%  1309 0.0107  0.05 SELECT new_orders?

#   17 0xCE3EDD98779478DE17154DCE...  12.1470  1.3%  1322 0.0092  0.05 INSERT history?

#   18 0x9DFD75E88091AA333A777668...  11.6842  1.2%  1311 0.0089  0.05 SELECT orders?

# MISC 0xMISC                         39.6393  4.2% 16334 0.0024   0.0 <29 ITEMS>



# Query 1: 6.03 QPS, 4.93x concurrency, ID 0xCE367F5CFFCAF46E816F682E53C0CF03 at byte 30449473

# This item is included in the report because it matches --limit.

# Scores: V/M = 0.03

# Time range: 2020-05-10 21:39:37 to 21:40:10

# Attribute    pct   total     min     max     avg     95%  stddev  median

# ============ === ======= ======= ======= ======= ======= ======= =======

# Count          0     199

# Exec time     17    163s   302ms      1s   817ms   992ms   164ms   816ms

# Lock time      0     9ms    30us   114us    44us    84us    18us    36us

# Rows sent      0     199       1       1       1       1       0       1

# Rows examine  39  76.91k     306     468  395.75  441.81   27.41  381.65

# Rows affecte   0       0       0       0       0       0       0       0

# Bytes sent     0  15.54k      79      80   79.96   76.28       0   76.28

# Query size     0  74.30k     382     384  382.35  381.65       0  381.65

# String:

# Databases    sbt

# Hosts        localhost

# Last errno   0

# Users        sbtest

# Query_time distribution

#   1us

#  10us

# 100us

#   1ms

#  10ms

# 100ms  ################################################################

#    1s  ####

#  10s+

# Tables

#    SHOW TABLE STATUS FROM `sbt` LIKE 'order_line6'\G

#    SHOW CREATE TABLE `sbt`.`order_line6`\G

#    SHOW TABLE STATUS FROM `sbt` LIKE 'stock6'\G

#    SHOW CREATE TABLE `sbt`.`stock6`\G

# EXPLAIN /*!50100 PARTITIONS*/

SELECT COUNT(DISTINCT (s_i_id))

                        FROM order_line6, stock6

                       WHERE ol_w_id = 1

                         AND ol_d_id = 1

                         AND ol_o_id < 3050

                         AND ol_o_id >= 3030

                         AND s_w_id= 1

                         AND s_i_id=ol_i_id

                         AND s_quantity < 18\G



# Query 2: 436.38 QPS, 3.16x concurrency, ID 0x360F872745C81781F8F75EDE9DD44246 at byte 30021546

# This item is included in the report because it matches --limit.

# Scores: V/M = 0.02

# Time range: 2020-05-10 21:39:37 to 21:40:11

# Attribute    pct   total     min     max     avg     95%  stddev  median

# ============ === ======= ======= ======= ======= ======= ======= =======

# Count         15   14837

# Exec time     11    107s    44us   233ms     7ms    33ms    13ms     3ms

# Lock time      1   522ms    15us   496us    35us    84us    28us    23us

# Rows sent     20  14.49k       1       1       1       1       0       1

# Rows examine   7  14.49k       1       1       1       1       0       1

# Rows affecte   0       0       0       0       0       0       0       0

# Bytes sent    28   3.74M     252     282  264.46  271.23    8.65  258.32

# Query size    19   3.01M     209     215  213.05  212.52    2.85  212.52

# String:

# Databases    sbt

# Hosts        localhost

# Last errno   0

# Users        sbtest

# Query_time distribution

#   1us

#  10us  #

# 100us  ##

#   1ms  ################################################################

#  10ms  #############

# 100ms  #

#    1s

#  10s+

# Tables

#    SHOW TABLE STATUS FROM `sbt` LIKE 'stock9'\G

#    SHOW CREATE TABLE `sbt`.`stock9`\G

# EXPLAIN /*!50100 PARTITIONS*/

SELECT s_quantity, s_data, s_dist_01 s_dist

                                                      FROM stock9

                                                    WHERE s_i_id = 60407 AND s_w_id= 2 FOR UPDATE\G

As you can see on the above pt-query-digest report result, we can divided into 3 parts.

Summary Report

There is much information you can find in the summary report, starting from the hostname server, the date you execute the command, information related to the query were logged, QPS, and time frame capture. Beside that, you also can see statistics of timing on each Attribute.

# 12s user time, 170ms system time, 27.44M rss, 221.79M vsz

# Current date: Sun May 10 21:40:47 2020

# Hostname: n2

# Files: mysql-1

# Overall: 94.92k total, 47 unique, 2.79k QPS, 27.90x concurrency ________

# Time range: 2020-05-10 21:39:37 to 21:40:11

# Attribute          total     min     max     avg     95%  stddev  median

# ============     ======= ======= ======= ======= ======= ======= =======

# Exec time           949s     6us      1s    10ms    42ms    42ms     2ms

# Lock time            31s       0      1s   327us    80us    11ms    22us

# Rows sent         69.36k       0     490    0.75    0.99   11.30       0

# Rows examine     196.34k       0     490    2.12    0.99   21.03    0.99

# Rows affecte      55.28k       0      15    0.60    0.99    1.26       0

# Bytes sent        13.12M      11   6.08k  144.93  299.03  219.02   51.63

# Query size        15.11M       5     922  166.86  258.32   83.13  174.84

Query Profiling Based on Rank

You can see useful information in the profiling query.

# Profile

# Rank Query ID                      Response time  Calls R/Call V/M   Ite

# ==== ============================= ============== ===== ====== ===== ===

#    1 0xCE367F5CFFCAF46E816F682E... 162.6485 17.1%   199 0.8173  0.03 SELECT order_line? stock?

#    2 0x360F872745C81781F8F75EDE... 107.4898 11.3% 14837 0.0072  0.02 SELECT stock?

There is a lot of information such as the queries running, response time of the query (including the percentage calculation), how many calls the query is making, and reads per call.

Query Distribution

Query distribution statistics describe detailed information based on query profiling rank, you can see the QPS concurrency, statistics information related to the query Attribute.

# Query 1: 6.03 QPS, 4.93x concurrency, ID 0xCE367F5CFFCAF46E816F682E53C0CF03 at byte 30449473

# This item is included in the report because it matches --limit.

# Scores: V/M = 0.03

# Time range: 2020-05-10 21:39:37 to 21:40:10

# Attribute    pct   total     min     max     avg     95%  stddev  median

# ============ === ======= ======= ======= ======= ======= ======= =======

# Count          0     199

# Exec time     17    163s   302ms      1s   817ms   992ms   164ms   816ms

# Lock time      0     9ms    30us   114us    44us    84us    18us    36us

# Rows sent      0     199       1       1       1       1       0       1

# Rows examine  39  76.91k     306     468  395.75  441.81   27.41  381.65

# Rows affecte   0       0       0       0       0       0       0       0

# Bytes sent     0  15.54k      79      80   79.96   76.28       0   76.28

# Query size     0  74.30k     382     384  382.35  381.65       0  381.65

# String:

# Databases    sbt

# Hosts        localhost

# Last errno   0

# Users        sbtest

# Query_time distribution

#   1us

#  10us

# 100us

#   1ms

#  10ms

# 100ms  ################################################################

#    1s  ####

#  10s+

# Tables

#    SHOW TABLE STATUS FROM `sbt` LIKE 'order_line6'\G

#    SHOW CREATE TABLE `sbt`.`order_line6`\G

#    SHOW TABLE STATUS FROM `sbt` LIKE 'stock6'\G

#    SHOW CREATE TABLE `sbt`.`stock6`\G

# EXPLAIN /*!50100 PARTITIONS*/

SELECT COUNT(DISTINCT (s_i_id))

                        FROM order_line6, stock6

                       WHERE ol_w_id = 1

                         AND ol_d_id = 1

                         AND ol_o_id < 3050

                         AND ol_o_id >= 3030

                         AND s_w_id= 1

                         AND s_i_id=ol_i_id

                         AND s_quantity < 18\G

There is also information regarding query time distribution, host, user, and database.

Query Monitoring with ClusterControl

ClusterControl has a Query Monitoring feature you can find in the Query Monitor tab as shown below.

You can see information related to the query that is executed in the database, including statistical information and execution time. You can also configure the Query Monitor Setting which is still on the same page. There is an option to enable the slow query and queries not using index by clicking on Settings

You just need to set the Long Query Time, which is the threshold of the query that categorizes for long based on execution time. Also there is an option to enable the query that is not using indexes.

Conclusion

Monitoring and analyzing the query workload can be beneficial so you know and understand your database workload, both pt-query-digest and the ClusterControl Query Monitor provide information related to the query running in the database to help you achieve that understanding.

Tags:

In PostgreSQL, many DDL commands can take a very long time to execute. PostgreSQL has the ability to report the progress of DDL commands during command execution. Since PostgreSQL 9.6, it has been possible to monitor the progress of running manual VACUUM and autovacuum using a dedicated system catalog (called pg_stat_progress_vacuum).

PostgreSQL 12 has added support for monitoring the progress of a few more commands like CLUSTER, VACUUM FULL,CREATE INDEX, and REINDEX.

Currently, the progress reporting facility is available only for command as below.

VACUUM command
CLUSTER command
VACUUM FULL command
CREATE INDEX command
REINDEX command

Why is the Progress Reporting Feature in PostgreSQL Important?

This feature is very important for operators when doing some long-running operations, because it is possible to not blindly wait for an operation to finish.

This is a very useful feature to get some insight like:

How much total work there is
How much work already done

Progress reporting feature is also useful when doing performance workload analysis, this is also proving to be useful in evaluating VACUUM job processing for tuning system-level parameters or relation level once depending on load pattern.

Supported Commands and system catalog

DDL Command	System Catalog	Supported PostgreSQL Version
VACUUM	pg_stat_progress_vacuum	9.6
VACUUM FULL	pg_stat_progress_cluster	12
CLUSTER	pg_stat_progress_cluster	12
CREATE INDEX	pg_stat_progress_create_index	12
REINDEX	pg_stat_progress_create_index	12

How to Monitor the Progress of the VACUUM Command

Whenever the VACUUM command is running, the pg_stat_progress_vacuum view will contain one row for each backend (including autovacuum worker processes) that is currently vacuuming. The view to check the progress of running VACUUM and VACCUM FULL commands is different because the operation phases of both commands are different.

Operation Phases of the VACUUM Command

Initializing
Scanning heap
Vacuuming indexes
Vacuuming heap
Cleaning up indexes
Truncating heap
Performing final cleanup

This view is available in PostgreSQL 12 which gives the following information:

postgres=# \d pg_stat_progress_vacuum ;

           View "pg_catalog.pg_stat_progress_vacuum"

       Column       |  Type   | Collation | Nullable | Default

--------------------+---------+-----------+----------+---------

 pid                | integer |           |          |

 datid              | oid     |           |          |

 datname            | name    |           |          |

 relid              | oid     |           |          |

 phase              | text    |           |          |

 heap_blks_total    | bigint  |           |          |

 heap_blks_scanned  | bigint  |           |          |

 heap_blks_vacuumed | bigint  |           |          |

 index_vacuum_count | bigint  |           |          |

 max_dead_tuples    | bigint  |           |          |

 num_dead_tuples    | bigint  |           |          |

Example:

postgres=# create table test ( a int, b varchar(40), c timestamp );

CREATE TABLE

postgres=# insert into test ( a, b, c ) select aa, bb, cc from generate_series(1,10000000) aa, md5(aa::varchar) bb, now() cc;

INSERT 0 10000000

postgres=# DELETE FROM test WHERE mod(a,6) = 0;

DELETE 1666666

Session 1:

postgres=# vacuum verbose test;

[. . . waits for completion . . .]

Session 2:

postgres=# select * from pg_stat_progress_vacuum;

-[ RECORD 1 ]------+--------------

pid                | 22800

datid              | 14187

datname            | postgres

relid              | 16388

phase              | scanning heap

heap_blks_total    | 93458

heap_blks_scanned  | 80068

heap_blks_vacuumed | 80067

index_vacuum_count | 0

max_dead_tuples    | 291

num_dead_tuples    | 18

Progress reporting for CLUSTER and VACUUM FULL

CLUSTER and VACUUM FULL command use the same code paths for the relation rewrite, so you can check the progress of both commands using the view pg_stat_progress_cluster.

This view is available in PostgreSQL 12 and it shows the following information:

postgres=# \d pg_stat_progress_cluster

           View "pg_catalog.pg_stat_progress_cluster"

       Column        |  Type   | Collation | Nullable | Default

---------------------+---------+-----------+----------+---------

 pid                 | integer |           |          | 

 datid               | oid     |           |          | 

 datname             | name    |           |          | 

 relid               | oid     |           |          | 

 command             | text    |           |          | 

 phase               | text    |           |          | 

 cluster_index_relid | bigint  |           |          | 

 heap_tuples_scanned | bigint  |           |          | 

 heap_tuples_written | bigint  |           |          | 

 heap_blks_total     | bigint  |           |          | 

 heap_blks_scanned   | bigint  |           |          | 

 index_rebuild_count | bigint  |           |          |

Operation Phases of CLUSTER Command

Initializing
Seq scanning heap
Index scanning heap
Sorting tuples
Writing new heap
Swapping relation files
Rebuilding index
Performing final cleanup

Example:

postgres=# create table test as select a,md5(a::text) as txt, now() as date from generate_series(1,3000000) a;

SELECT 3000000

postgres=# create index idx1 on test(a);

CREATE INDEX

postgres=# create index idx2 on test(txt);

CREATE INDEX

postgres=# create index idx3 on test(date);

CREATE INDEX

Now execute the CLUSTER table command and see the progress in pg_stat_progress_cluster.

Session 1:

postgres=# cluster verbose test using idx1;

[. . . waits for completion . . .]

Session 2:

postgres=# select * from pg_stat_progress_cluster;

 pid  | datid | datname  | relid | command |      phase       | cluster_index_relid | heap_tuples_scanned | heap_tuples_written | heap_blks_total | heap_blks_scanned | index_rebuild_count 

------+-------+----------+-------+---------+------------------+---------------------+---------------------+---------------------+-----------------+-------------------+---------------------

 1273 | 13586 | postgres | 15672 | CLUSTER | rebuilding index |               15680 |             3000000 |             3000000 |               0 |                 0 |                   2

(1 row)

Progress Reporting for CREATE INDEX and REINDEX

Whenever the CREATE INDEX or REINDEX command is running, the pg_stat_progress_create_index view will contain one row for each backend that is currently creating indexes. The progress reporting feature allows to track also the CONCURRENTLY flavors of CREATE INDEX and REINDEX. The internal execution phases of CREATE INDEX and REINDEX commands are the same, so you can check the progress of both commands using the same view.

postgres=# \d pg_stat_progress_create_index 

        View "pg_catalog.pg_stat_progress_create_index"

       Column       |  Type   | Collation | Nullable | Default

--------------------+---------+-----------+----------+---------

 pid                | integer |           |          | 

 datid              | oid     |           |          | 

 datname            | name    |           |          | 

 relid              | oid     |           |          | 

 phase              | text    |           |          | 

 lockers_total      | bigint  |           |          | 

 lockers_done       | bigint  |           |          | 

 current_locker_pid | bigint  |           |          | 

 blocks_total       | bigint  |           |          | 

 blocks_done        | bigint  |           |          | 

 tuples_total       | bigint  |           |          | 

 tuples_done        | bigint  |           |          | 

 partitions_total   | bigint  |           |          | 

 partitions_done    | bigint  |           |          |

Operation Phases of CREATE INDEX / REINDEX

Initializing
Waiting for writers before build
Building index
Waiting for writers before validation
Index validation: scanning index
Index validation: sorting tuples
Index validation: scanning table
Waiting for old snapshots
Waiting for readers before marking dead
Waiting for readers before dropping

Example:

postgres=# create table test ( a int, b varchar(40), c timestamp );

CREATE TABLE



postgres=# insert into test ( a, b, c ) select aa, bb, cc from generate_series(1,10000000) aa, md5(aa::varchar) bb, now() cc;

INSERT 0 10000000



postgres=# CREATE INDEX idx ON test (b);

CREATE INDEX

Session 1:

postgres=# CREATE INDEX idx ON test (b);

[. . . waits for completion . . .]

Session 2:

postgres=# SELECT * FROM pg_stat_progress_create_index;

-[ RECORD 1 ]------+-------------------------------

pid                | 19432

datid              | 14187

datname            | postgres

relid              | 16405

index_relid        | 0

command            | CREATE INDEX

phase              | building index: scanning table

lockers_total      | 0

lockers_done       | 0

current_locker_pid | 0

blocks_total       | 93458

blocks_done        | 46047

tuples_total       | 0

tuples_done        | 0

partitions_total   | 0

partitions_done    | 0



postgres=# SELECT * FROM pg_stat_progress_create_index;

-[ RECORD 1 ]------+---------------------------------------

pid                | 19432

datid              | 14187

datname            | postgres

relid              | 16405

index_relid        | 0

command            | CREATE INDEX

phase              | building index: loading tuples in tree

lockers_total      | 0

lockers_done       | 0

current_locker_pid | 0

blocks_total       | 0

blocks_done        | 0

tuples_total       | 10000000

tuples_done        | 4346240

partitions_total   | 0

partitions_done    | 0

Conclusion

PostgreSQL version 9.6 onward has the ability to report the progress of certain commands during command execution. This is a really nice feature for DBA’s, Developers, and users to check the progress of long running commands. This reporting capability may extend for some other commands in future. You can read more about this new feature in the PostgreSQL documentation.

Tags:

PostgreSQL

postgres

vacuum

In a previous blog, we looked at the differences between ManageEngine Applications Manager and ClusterControl, examining the main features of each and comparing them. In this blog we will focus on the monitoring of HAProxy, how to monitor an HAProxynode and compare the specific monitoring features of these two tools.

For this blog, we’ll assume you already have Applications Manager or ClusterControl installed.

HAProxy Monitoring Usage Comparison

Manage Engine Applications Manager

To start monitoring your HAProxy node, it must be installed previously, as you only can import it here. In your Applications Manager server, go to New Monitor -> Add New Monitor. You’ll see all the available options to monitor, so you need to choose the HAProxy option under the Web Server/Services section.

Now you must specify the following information of your HAProxy node:

Display Name: It’ll be used to identify the node.
Hostname/IP Address: Of the existing HAProxy node.
Admin Port: It’s specified in the HAProxy configuration file.
Credentials: Admin credentials if needed.
Stats URL: URL to access the HAProxy stats.

Before pressing “Add Monitor”, you can test the credentials to confirm that it’s working correctly.

After you have your HAProxy monitored by Application Manager, you can access it from the Home section.

ClusterControl

In this case, it’s not necessary to have the HAProxy node installed, as you can deploy it using ClusterControl. We’ll assume you have a Database Cluster added into ClusterControl, so if you go to the cluster actions, you’ll see the Add Load Balancer option.

In this step, you can choose if you want to Deploy or Import it.

For the deployment, you must specify the following information:

Server Address: IP Address for your HAProxy server.
Listen Port (Read/Write): Port for read/write traffic.
Listen Port (Read-Only): Port for read-only traffic.
Policy: It can be:
- leastconn: The server with the lowest number of connections receives the connection
- roundrobin: Each server is used in turns, according to their weights
- source: The source IP address is hashed and divided by the total weight of the running servers to designate which server will receive the request
Build from Source: You can choose Install from a package manager or build from source.
And you need to select which servers you want to add to the HAProxy configuration and some additional information like:
Role: It can be Active or Backup.
Include: Yes or No.
Connection address information.

Also, you can configure Advanced Settings like Admin User, Backend Name, Timeouts, and more.

For the import action, you must specify the current HAProxy information, like:

Server Address: IP Address for your HAProxy server.
Port: HAProxy admin port.
Admin User/Admin Password: HAProxy admin credentials.
HAProxy Config: HAProxy configuration file location.
Stats Socket: HAProxy stats socket.

Most of these values are auto-filled with the default values, so if you’re using a default HAProxy configuration, you shouldn’t change anything.

When you finish the configuration and confirm the deploy or import process, you can follow the progress in the Activity section on the ClusterControl UI.

Monitoring Your HAProxy Node

HAProxy Monitoring with Manage Engine Applications Manager

If you go into the HAProxy node, you’ll see an Availability History section.

In the Performance tab, you’ll have useful information about the HAProxy performance per hour plus a graph showing the Response Time.

If you press on the HAProxy link under the Healthy History section, you’ll access more detailed information about it, with different metrics and graphs.

In the Monitor Information tab, you’ll see the data added during the import process.

Then, you have the Listener, Frontend, Backend, and Server tabs, where you have metrics about each section, like Session Utilization, Transaction Details, Response Times, and even more.

Finally, in the Configuration tab, you’ll see some HAProxy configuration values like max connections and version.

HAProxy Monitoring with ClusterControl

When you have your HAProxy node added into ClusterControl, you can go to ClusterControl -> Select Cluster -> Nodes -> HAProxy node, and check the current status.

You can also check the Topology section, to have a complete overview of the environment.

But if you want to see more detailed information about your HAProxy node, you have the Dashboards section.

Here, you can’t only see all the necessary metrics to monitor the HAProxy node, but also to monitor all the environment using the different Dashboards.

Alarms & Notifications

Manage Engine Applications Manager Notifications

As we mentioned in the previous related blog, this system has its own alarm system where you must configure actions to be run when the alarm is generated.

You can configure alarms and actions, and you can also integrate it with their own Alarm System called AlarmsOne (a different product).

ClusterControl Notifications

It also has an alarm system using advisors. ClusterControl comes with some predefined advisors, but you can modify it or even create a new one using the Developer Studio integrated tool.

It has integration with 3rd party tools like Slack or PagerDuty, so you can receive notifications there too.

Command Line Monitoring

Applications Manager CLI

Unfortunately, this system doesn’t have a command-line tool that allows you to monitor applications or databases from the command line.

ClusterControl CLI (s9s)

For scripting and automating tasks, or even if you just prefer the command line, ClusterControl has the s9s tool. It's a command-line tool for managing your database cluster.

$ s9s node --cluster-id=8 --list --long

STAT VERSION    CID CLUSTER HOST            PORT COMMENT

coC- 1.7.6.3910   8 My1     192.168.100.131 9500 Up and running.

?o-- 2.12.0       8 My1     192.168.100.131 9090 Process 'prometheus' is running.

soM- 8.0.19       8 My1     192.168.100.132 3306 Up and running.

soS- 8.0.19       8 My1     192.168.100.133 3306 Up and running.

ho-- 1.8.15       8 My1     192.168.100.134 9600 Process 'haproxy' is running.

Total: 5

With this tool, you can perform all the tasks that you have in the ClusterControl UI, and even more. You can check the documentation to have more examples and information about the usage of this powerful tool.

Conclusion

As you could see, both systems are useful to monitor an HAProxy node. They have graphs, metrics, and alarms to help you to know the current status of your HAProxy node. The main differences between them are the possibility of ClusterControl to deploy the HAProxy node itself, avoiding manual tasks, and also the ClusterControl CLI feature, that allows you to import/deploy, manage, or monitor everything from the command line.

Apart from that, both solutions are a good way to keep your systems monitored all the time

Tags:

ProxySQL is a very popular proxy in MySQL environments. It comes with a nice set of features including read/write splitting, query caching and query rewriting. ProxySQL stores its configuration in SQLite database, configuration changes can be applied on runtime and are performed through SQL commands. This increases the learning curve and could be a blocker for some people that would like to just install it and get it running.

This is a reason why a couple of tools exist that can help you to manage ProxySQL. Let’s take a look at one of them, proxysql-admin, and compare it with features available for ProxySQL in ClusterControl.

proxysql-admin

Proxysql-admin is a tool that comes included in the ProxySQL when installed from Percona repositories. It is dedicated to making the setup of Percona XtraDB Cluster in ProxySQL easier. You can define the setup in the configuration file (/etc/proxysql-admin.cnf) or through arguments to the proxysql-admin command. It is possible to:

Configure hostgroups (reader, writer, backup writer, offline) for PXC
Create monitoring user in ProxySQL and PXC
Create application user in ProxySQL and PXC
Configure ProxySQL (maximum running connections, maximum transactions behind)
Synchronize users between PXC and ProxySQL
Synchronize nodes between PXC and ProxySQL
Create predefined (R/W split) query rules for users imported from PXC
Configure SSL for connections from ProxySQL to the backend databases
Define a single writer or round robin access to the PXC

As you can see, this is by no means a complex tool, it focuses on the initial setup. Let’s take a look at couple examples.

root@vagrant:~# proxysql-admin --enable



This script will assist with configuring ProxySQL for use with

Percona XtraDB Cluster (currently only PXC in combination

with ProxySQL is supported)



ProxySQL read/write configuration mode is singlewrite



Configuring the ProxySQL monitoring user.

ProxySQL monitor user name as per command line/config-file is proxysql-monitor



The monitoring user is already present in Percona XtraDB Cluster.



Would you like to enter a new password [y/n] ? n



Monitoring user 'proxysql-monitor'@'10.%' has been setup in the ProxySQL database.



Configuring the Percona XtraDB Cluster application user to connect through ProxySQL

Percona XtraDB Cluster application user name as per command line/config-file is proxysql_user



Application user 'proxysql_user'@'10.%' already present in PXC.



Adding the Percona XtraDB Cluster server nodes to ProxySQL



Write node info

+------------+--------------+------+--------+

| hostname   | hostgroup_id | port | weight |

+------------+--------------+------+--------+

| 10.0.0.152 | 10           | 3306 | 1000   |

+------------+--------------+------+--------+



ProxySQL configuration completed!



ProxySQL has been successfully configured to use with Percona XtraDB Cluster



You can use the following login credentials to connect your application through ProxySQL



mysql --user=proxysql_user -p --host=localhost --port=6033 --protocol=tcp

Above shows the initial setup. As you can see, a singlewriter (default) mode was used, monitoring and application users have been configured and the whole server configuration was prepared.

root@vagrant:~# proxysql-admin --status



mysql_galera_hostgroups row for writer-hostgroup: 10

+--------+--------+---------------+---------+--------+-------------+-----------------------+------------------+

| writer | reader | backup-writer | offline | active | max_writers | writer_is_also_reader | max_trans_behind |

+--------+--------+---------------+---------+--------+-------------+-----------------------+------------------+

| 10     | 11     | 12            | 13      | 1      | 1           | 2                     | 100              |

+--------+--------+---------------+---------+--------+-------------+-----------------------+------------------+



mysql_servers rows for this configuration

+---------------+-------+------------+------+--------+--------+----------+---------+-----------+

| hostgroup     | hg_id | hostname   | port | status | weight | max_conn | use_ssl | gtid_port |

+---------------+-------+------------+------+--------+--------+----------+---------+-----------+

| writer        | 10    | 10.0.0.153 | 3306 | ONLINE | 1000   | 1000     | 0       | 0         |

| reader        | 11    | 10.0.0.151 | 3306 | ONLINE | 1000   | 1000     | 0       | 0         |

| reader        | 11    | 10.0.0.152 | 3306 | ONLINE | 1000   | 1000     | 0       | 0         |

| backup-writer | 12    | 10.0.0.151 | 3306 | ONLINE | 1000   | 1000     | 0       | 0         |

| backup-writer | 12    | 10.0.0.152 | 3306 | ONLINE | 1000   | 1000     | 0       | 0         |

+---------------+-------+------------+------+--------+--------+----------+---------+-----------+

Here is the output of the default configuration of the PXC nodes in ProxySQL.

ClusterControl

ClusterControl is, in comparison to the proxysql-admin, a way more complex solution. It can deploy a ProxySQL load balancer and preconfigure it according to the user requirements.

When deploying you can define administrator user and password, monitoring user and you can as well import one of the existing MySQL users (or create a new one if this is what you need) for the application to use. It is also possible to import ProxySQL configuration from other ProxySQL that you already have in the cluster. It makes the deployment faster and more efficient.

What is also important to mention is that ClusterControl can deploy ProxySQL in both MySQL and Galera Clusters. It can be used with MySQL, Percona and MariaDB flavours of MySQL.

Once deployed, ClusterControl gives you options to fully manage ProxySQL via an easy to use GUI.

You can monitor your ProxySQL instance.

You can check the heavier queries executed through ProxySQL. It is also possible to create a query rule based on the exact query.

ClusterControl configures ProxySQL for a read/write split. It is also possible to add custom query rules based on your requirements and application configuration.

Compared to proxysql-admin, ClusterControl gives you full control over the server configuration. You can add new servers, you can move them around host groups as you want. You can create new hostgroups (and then, for example, create new query rules for them).

It is also possible to manage users in ProxySQL. You can edit existing users, import new users that exist in the backend database.

Bulk import is also possible to accomplish. You can also create new users on both ProxySQL and backend databases.

ClusterControl can also be used to reconfigure ProxySQL. You can modify all of the variables through a simple UI with search option.

As you can see, ClusterControl comes with in-depth management features for ProxySQL. It allows you to deploy and manage ProxySQL instances with ease.

Tags:

The low-cost, cloud-based solution to backing up open source databases enables startups, small businesses and individual professionals to enjoy enterprise-level peace of mind.

PRESS RELEASE UPDATED: MAY 20, 2020 10:00 EDT

STOCKHOLM, May 20, 2020 (Newswire.com) - Among the countless number of malware threats affecting businesses, ransomware is the biggest offender, costing organizations over $7.5 billion in 2019 alone. Cyber-attacks affect more than just large companies, and Backup Ninja, a product of Severalnines, is the most simple, secure and cost-effective solution for small businesses to combat these threats. The software enables users to backup the world’s most popular open source databases locally or in the cloud, providing a safe and secure backup to minimize the impact caused by ransomware.

Ransomware attacks are costly, and small businesses suffer disproportionately due to less resources or a lack of sophisticated security tools and management. As the prevalence of ransomware increases, so too have the ransom demands; in 2019, ransom dollar amounts increased 37%. The downtime, often caused by a company’s unwillingness or inability to pay, can cost up to 23 times more than the ransom amount itself, and in 2019 those costs have soared over 200%.

”Small businesses are attractive targets because they have information that cybercriminals want, and they typically lack the security infrastructure of larger businesses,” said the U.S. Small Business Association (sba.gov) “According to a recent SBA survey, 88% of small business owners felt their business was vulnerable to a cyberattack. Yet many businesses can’t afford professional IT solutions; they have limited time to devote to cybersecurity, or they don’t know where to begin”

For cybercriminals, there’s no more cost-effective option to hold a business hostage. It’s highly transmissible, as many clients often unknowingly spread the malware onto other devices, and many of the vulnerabilities that are preyed upon are of the company’s own doing. Weak passwords, poor access management, a lack of security training, and aggressive phishing email campaigns are all avenues of opportunity that cybercriminals regularly exploit for monetary gain.

Backup Ninja aims to provide an equally cost-cost effective but far more simple and sophisticated solution to combat ransomware through its safe and secure backup software. Whether stored locally or in the cloud, database integrity is always preserved to ensure a seamless restoration of data. If the threat of troublesome malware should arise, businesses won’t have to pay any ransoms, and downtime is kept to an absolute minimum.

Backup Ninja uses advanced TLS encryption for operations and encrypts stored databases using AES-256 encryption,” said Vinay Joosery, CEO of Severalnines. “Backup Ninja is simple enough for the smallest website database and feature-rich enough to support enterprise-grade requirements. Any application can take advantage of our service, as long as we offer support for the database you are using in your application.”

Businesses can protect themselves, maintain productivity, as well as profitability, with Backup Ninja, and never have to pay to retrieve their own data again.

For more information, visit Backup Ninja.

About Severalnines

Severalnines provides automation and management software for open source database clusters. We help companies deploy their databases in any environment and manage all the operational aspects to achieve high-scale availability. Severalnines' products are used by System Administrators, Developers, and DBAs of all skill levels to provide a fully complete database lifecycle; freeing them from the complexity and learning curves that are typically associated with highly-available database setups.

The company has enabled tens of thousands of deployments to date via its popular product ClusterControl for customers like ABSA Bank, BT, Cisco, HP, IBM Research, NHS, Orange, Ping Identity, Technicolor, and VodafoneZiggo. Severalnines is a private company headquartered in Stockholm, Sweden with employees operating remotely around the world. To see who is using Severalnine’s products, visit, https://severalnines.com/about-us/customers

Contact:

Forrest Lymburner
forrest@severalnines.com
+1 347-809-3407

Tags:

While there are various ways to recover your PostgreSQL database, one of the most convenient approaches to restore your data from a logical backup. Logical backups play a significant role for Disaster and Recovery Planning (DRP). Logical backups are backups taken, for example using pg_dump or pg_dumpall, which generate SQL statements to obtain all table data that is written to a binary file.

It is also recommended to run periodic logical backups in case your physical backups fail or are unavailable. For PostgreSQL, restoring can be problematic if you are unsure of what tools to use. The backup tool pg_dump is commonly paired with the restoration tool pg_restore.

pg_dump and pg_restore act in tandem if disaster occurs and you need to recover your data. While they serve the primary purpose of dump and restore, it does require you to perform some extra tasks when you need to recover your cluster and do a failover (if your active primary or master dies due to hardware failure or VM system corruption). You'll end up to find and utilize third party tools which can handle failover or automatic cluster recovery.

In this blog, we'll take a look at how pg_restore works and compare it to how ClusterControl handles backup and restore of your data in case disaster happens.

Mechanisms of pg_restore

pg_restore is useful when obtaining the following tasks:

paired with pg_dump for generating SQL generated files containing data, access roles, database and table definitions
restore a PostgreSQL database from an archive created by pg_dump in one of the non-plain-text formats.
It will issue the commands necessary to reconstruct the database to the state it was in at the time it was saved.
has the capability to be selective or even to reorder the items prior to being restored based on the archive file
The archive files are designed to be portable across architectures.
pg_restore can operate in two modes.
- If a database name is specified, pg_restore connects to that database and restores archive contents directly into the database.
- or, a script containing the SQL commands necessary to rebuild the database is created and written to a file or standard output. Its script output has equivalence to the format generated by pg_dump
Some of the options controlling the output are therefore analogous to pg_dump options.

Once you have restored the data, it's best and advisable to run ANALYZE on each restored table so the optimizer has useful statistics. Although it acquires READ LOCK, you might have to run this during a low traffic or during your maintenance period.

Advantages of pg_restore

pg_dump and pg_restore in tandem has capabilities which are convenient for a DBA to utilize.

pg_dump and pg_restore has the capability to run in parallel by specifying the -j option. Using the -j/--jobs <number-of-jobs> allows you to specify how many running jobs in parallel can run especially for loading data, creating indexes, or create constraints using multiple concurrent jobs.
It's quiet handy to use, you can selectively dump or load specific database or tables
It allows and provides a user flexibility on what particular database, schema, or reorder the procedures to be executed based on the list. You can even generate and load the sequence of SQL loosely like prevent acls or privilege in accordance to your needs. There are plenty of options to suit your needs.
It provides you capability to generate SQL files just like pg_dump from an archive. This is very convenient if you want to load to another database or host to provision a separate environment.
It's easy to understand based on the generated sequence of SQL procedures.
It’s a convenient way to load data in a replication environment. You don't need your replica to be restaged since the statements are SQL which were replicated down to the standby and recovery nodes.

Limitations of pg_restore

For logical backups, the obvious limitations of pg_restore along with pg_dump is the performance and speed when utilizing the tools. It might be handy when you want to provision a test or development database environment and load your data, but it's not applicable when your data set is huge. PostgreSQL has to dump your data one by one or execute and apply your data sequentially by the database engine. Although you can make this loosely flexible to speed up like specifying -j or using --single-transaction to avoid impact to your database, loading using SQL still has to be parsed by the engine.

Additionally, the PostgreSQL documentation states the following limitations, with our additions as we observed these tools (pg_dump and pg_restore):

When restoring data to a pre-existing table and the option --disable-triggers is used, pg_restore emits commands to disable triggers on user tables before inserting the data, then emits commands to re-enable them after the data has been inserted. If the restore is stopped in the middle, the system catalogs might be left in the wrong state.
pg_restore cannot restore large objects selectively; for instance, only those for a specific table. If an archive contains large objects, then all large objects will be restored, or none of them if they are excluded via -L, -t, or other options.
Both tools are expected to generate a huge amount of size (files, directory, or tar archive) especially for a huge database.
For pg_dump, when dumping a single table or as plain text, pg_dump does not handle large objects. Large objects must be dumped with the entire database using one of the non-text archive formats.
If you have tar archives generated by these tools, take note that tar archives are limited to a size less than 8 GB. This is an inherent limitation of the tar file format. Therefore this format cannot be used if the textual representation of a table exceeds that size. The total size of a tar archive and any of the other output formats is not limited, except possibly by the operating system.

Using pg_restore

Using pg_restore is quite handy and easy to utilize. Since it is paired in tandem with pg_dump, both these tools work sufficiently well as long as the target output suits the other. For example, the following pg_dump won't be useful for pg_restore,

[root@testnode14 ~]# pg_dump --format=p --create  -U dbapgadmin -W -d paultest -f plain.sql

Password:

This result will be a psql compatible which looks like as follows:

[root@testnode14 ~]# less plain.sql 

--

-- PostgreSQL database dump

--



-- Dumped from database version 12.2

-- Dumped by pg_dump version 12.2



SET statement_timeout = 0;

SET lock_timeout = 0;

SET idle_in_transaction_session_timeout = 0;

SET client_encoding = 'UTF8';

SET standard_conforming_strings = on;

SELECT pg_catalog.set_config('search_path', '', false);

SET check_function_bodies = false;

SET xmloption = content;

SET client_min_messages = warning;

SET row_security = off;



--

-- Name: paultest; Type: DATABASE; Schema: -; Owner: postgres

--



CREATE DATABASE paultest WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8';




ALTER DATABASE paultest OWNER TO postgres;

But this will fail for pg_restore as there's no plain format to follow:

[root@testnode14 ~]# pg_restore -U dbapgadmin --format=p -C -W -d postgres plain.sql 

pg_restore: error: unrecognized archive format "p"; please specify "c", "d", or "t"

[root@testnode14 ~]# pg_restore -U dbapgadmin --format=c -C -W -d postgres plain.sql 

pg_restore: error: did not find magic string in file header

Now, let's go to more useful terms for pg_restore.

pg_restore: Drop and Restore

Consider a simple usage of pg_restore which you have drop a database, e.g.

postgres=# drop database maxtest;

DROP DATABASE

postgres=# \l+

                                                                    List of databases

   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   |  Size   | Tablespace |                Description                 

-----------+----------+----------+-------------+-------------+-----------------------+---------+------------+--------------------------------------------

 paultest  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 83 MB   | pg_default | 

 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 8209 kB | pg_default | default administrative connection database

 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +| 8049 kB | pg_default | unmodifiable empty database

           |          |          |             |             | postgres=CTc/postgres |         |            | 

 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | postgres=CTc/postgres+| 8193 kB | pg_default | default template for new databases

           |          |          |             |             | =c/postgres           |         |            | 

(4 rows)

Restoring it with pg_restore it very simple,

[root@testnode14 ~]# sudo -iu postgres pg_restore  -C  -d postgres /opt/pg-files/dump/f.dump

The -C/--create here states that create the database once it's encountered in the header. The -d postgres points to the postgres database but it doesn't mean it will create the tables to postgres database. It requires that the database has to exist. If -C is not specified, table(s) and records will be stored to that database referenced with -d argument.

Restoring Selectively By Table

Restoring a table with pg_restore is easy and simple. For example, you have two tables namely "b" and "d" tables. Let's say you run the following pg_dump command below,

[root@testnode14 ~]# pg_dump --format=d --create  -U dbapgadmin -W -d paultest -f pgdump_inserts

Password:

Where the contents of this directory will look like as follows,

[root@testnode14 ~]# ls -alth pgdump_inserts/

total 16M

-rw-r--r--. 1 root root  14M May 15 20:27 3696.dat.gz

drwx------. 2 root root   59 May 15 20:27 .

-rw-r--r--. 1 root root 2.5M May 15 20:27 3694.dat.gz

-rw-r--r--. 1 root root 4.0K May 15 20:27 toc.dat

dr-xr-x---. 5 root root  275 May 15 20:27 ..

If you want to restore a table (namely "d" in this example),

[root@testnode14 ~]# pg_restore -U postgres -Fd  -d paultest -t d pgdump_inserts/

Shall have,

paultest=# \dt+

                   List of relations

 Schema | Name | Type  |  Owner   | Size  | Description

--------+------+-------+----------+-------+-------------

 public | d    | table | postgres | 51 MB |

(1 row)

pg_restore: Copying Database Tables to a Different Database

You may even copy the contents of your existing database and have it on your target database. For example, I have the following databases,

paultest=# \l+ (paultest|maxtest)

                                                  List of databases

   Name   |  Owner   | Encoding |   Collate   |    Ctype    | Access privileges |  Size   | Tablespace | Description 

----------+----------+----------+-------------+-------------+-------------------+---------+------------+-------------

 maxtest  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                   | 84 MB   | pg_default | 

 paultest | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                   | 8273 kB | pg_default | 

(2 rows)

The paultest database is an empty database while we're going to copy what's inside maxtest database,

maxtest=# \dt+

                   List of relations

 Schema | Name | Type  |  Owner   | Size  | Description

--------+------+-------+----------+-------+-------------

 public | d    | table | postgres | 51 MB |

(1 row)



maxtest=# \dt+

                   List of relations

 Schema | Name | Type  |  Owner   | Size  | Description 

--------+------+-------+----------+-------+-------------

 public | b    | table | postgres | 69 MB | 

 public | d    | table | postgres | 51 MB | 

(2 rows)

To copy it, we need to dump the data from maxtest database as follows,

[root@testnode14 ~]# pg_dump --format=t --create  -U dbapgadmin -W -d maxtest -f pgdump_data.tar

Password:

Then load or restore it as follows,

Now, we got data on paultest database and the tables have been stored accordingly.

postgres=# \l+ (paultest|maxtest)

                                                 List of databases

   Name   |  Owner   | Encoding |   Collate   |    Ctype    | Access privileges |  Size  | Tablespace | Description 

----------+----------+----------+-------------+-------------+-------------------+--------+------------+-------------

 maxtest  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                   | 153 MB | pg_default | 

 paultest | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                   | 154 MB | pg_default | 

(2 rows)

paultest=# \dt+

                   List of relations

 Schema | Name | Type  |  Owner   | Size  | Description 

--------+------+-------+----------+-------+-------------

 public | b    | table | postgres | 69 MB | 

 public | d    | table | postgres | 51 MB | 

(2 rows)

Generate a SQL file With Re-ordering

I have seen a lot of usage with pg_restore but it seems that this feature is not usually showcased. I found this approach very interesting as it allows you to order based on what you don't want to include and then generate an SQL file from the order you want to proceed.

For example, we'll use the sample pgdump_data.tar we have generated earlier and create a list. To do this, run the following command:

[root@testnode14 ~]# pg_restore  -l pgdump_data.tar  > my.list

This will generate a file as shown below:

[root@testnode14 ~]# cat my.list 

;

; Archive created at 2020-05-15 20:48:24 UTC

;     dbname: maxtest

;     TOC Entries: 13

;     Compression: 0

;     Dump Version: 1.14-0

;     Format: TAR

;     Integer: 4 bytes

;     Offset: 8 bytes

;     Dumped from database version: 12.2

;     Dumped by pg_dump version: 12.2

;

;

; Selected TOC Entries:

;

204; 1259 24811 TABLE public b postgres

202; 1259 24757 TABLE public d postgres

203; 1259 24760 SEQUENCE public d_id_seq postgres

3698; 0 0 SEQUENCE OWNED BY public d_id_seq postgres

3560; 2604 24762 DEFAULT public d id postgres

3691; 0 24811 TABLE DATA public b postgres

3689; 0 24757 TABLE DATA public d postgres

3699; 0 0 SEQUENCE SET public d_id_seq postgres

3562; 2606 24764 CONSTRAINT public d d_pkey postgres

Now, let's re-order it or shall we say I have removed the creation of SEQUENCE and also the creation of the constraint. This would look like as follows,

TL;DR

...

;203; 1259 24760 SEQUENCE public d_id_seq postgres

;3698; 0 0 SEQUENCE OWNED BY public d_id_seq postgres

TL;DR

….

;3562; 2606 24764 CONSTRAINT public d d_pkey postgres

To generate the file in SQL format, just do the following:

[root@testnode14 ~]# pg_restore -L my.list --file /tmp/selective_data.out pgdump_data.tar

Now, the file /tmp/selective_data.out will be a SQL generated file and this is readable if you use psql, but not pg_restore. What's great about this is you can generate an SQL file in accordance to your template on which data only can be restored from an existing archive or backup taken using pg_dump with the help of pg_restore.

PostgreSQL Restore with ClusterControl

ClusterControl does not utilize pg_restore or pg_dump as part of it’s featureset. We use pg_dumpall to generate logical backups and, unfortunately, the output is not compatible with pg_restore.

There are several other ways to generate a backup in PostgreSQL as seen below.

There's no such mechanism where you can selectively store a table, a database, or copy from one database to another database.

ClusterControl does support Point-in-Time Recovery (PITR), but this doesn't allow you to manage data restore as flexible as with pg_restore. For all the list of backup methods, only pg_basebackup and pgbackrest are PITR capable.

How ClusterControl handles restore is that it has the capability to recover a failed cluster as long as Auto Recovery is enabled as shown below.

Once the master fails, the slave can automatically recover the cluster as ClusterControl performs the failover (which is done automatically). For the data recovery part, your only option is to have a cluster-wide recovery which means it's coming from a full backup. There's no capability to selectively restore on the target database or table you only wanted to restore. If you want to do that, restore the full backup, it's easy to do this with ClusterControl. You can go to the Backup tabs just as shown below,

You'll have a full list of successful and failed backups. Then restoring it can be done by choosing the target backup and clicking the "Restore" button. This will allow you to restore on an existing node registered within ClusterControl, or verify on a stand alone node, or create a cluster from the backup.

Conclusion

Using pg_dump and pg_restore simplifies the backup/dump and restore approach. However, for a large-scale database environment, this might not be an ideal component for disaster recovery. For a minimal selection and restoring procedure, using the combination of pg_dump and pg_restore provides you the power to dump and load your data according to your needs.

For production environments (especially for enterprise architectures) you might use the ClusterControl approach to create a backup and restore with automatic recovery.

A combination of approaches is also a good approach. This helps you lower your RTO and RPO and at the same time leverage on the most flexible way to restore your data when needed.

Tags:

There are many tools used in Database Administration that help simplify the management of open source databases. The advantage of using these types of applications is the availability menus from various objects in the database (such as tables, indexes, sequences, procedures, views, triggers) so that you do not have to use the command line when using a native database client. You just simply browse the menu, and it will immediately appear on the screen.

In this blog, we will review one of the third party Database Management applications for PostgreSQL called pgAdmin. It is an open source database management tool that is useful for database administration, ranging from creating tables, indexes, views, triggers, stored procedures. Besides that, pgAdmin can also monitor the database for information related to Sessions, Transactions per Seconds, and Locking.

pgAdmin Monitoring

There are some metrics in pgAdmin that can be valuable insight to understand the current state of the database. Here’s the display metrics on pgAdmin.

In the Dashboard, you can monitor information related to incoming connections to the database through Server Sessions. Information related to commit transactions, rollbacks and total transactions per second in the database can be seen in the Transactions per Seconds screen. Tuples in contains information related to total tuples insert, update, delete in the database. Tuples out contain tuples information that is returned to the client from the database. Tuples itself is a term in PostgreSQL for rows. Metrics Block I / O contains information related to Disk information, both total read and fetched blocks from the database cache.

Server activity contains information related to running sessions, locking that occurs in the database, prepared statements from queries, and database configuration. As shown in the picture below.

In Properties, you can see information related to the PostgreSQL database that is being accessed, such as the database name, server type, database version, ip address, and the username used.

The SQL contains information related to the generated SQL script created from a selected object as follows:

The information in the highlighted object is displayed in great detail, as it contains a script to reconstruct an object.

In the Statistics tab, the information related to statistics collected from each object running in the database are displayed on the menu.

As an example, the above table contains information regarding Tuples (inserted, updated, deleted, live, dead). There is also information related to vacuum and auto-analyze.

Vacuum runs to clean dead tuples in the database and reclaim disk storage used by dead tuples. While auto-analyze functions to generate statistics on objects so the optimizer can precisely determine the execution plan of a query.

ClusterControl PostgreSQL Monitoring

ClusterControl has various metrics related to the PostgreSQL database which can be found on the Overview, Nodes, Dashboard, Query Monitor, and Performance tabs. The following metrics display in ClusterControl.

The Overview section contains information related to server load metrics ranging from connection, number of insert, delete, update, commit & rollback and connection. In addition, there is information such as health nodes, the replication status of the PostgreSQL database, and also information related to server utilization as shown in the figure below.

The Nodes tab provides graph-related information on the server side starting from CPU Utilization, Memory, Disk Usage, Network, and Swap Usage.

The Dashboard has several metrics options such as System Overview, Cluster Overview, and PostgreSQL Overview. For each option there are various metrics that are related to the running system condition. For example, in the PostgreSQL Overview metrics, there is information ranging from Load averages from the database, Memory Available, and Network transmission and receiving as shown below.

The Query Monitor contains information related to running queries that run on the database. We can find out what queries are running, how long is the execution time, source client address information, and the state of the session. Besides that, there is a Kill session feature, where we can terminate the session that causes the database to experience delays. The following is the display from Query Monitor:

In addition to running queries, we can also view Query Statistics information, starting from Access by Sequential or index scan, Table I / O Statistics, Index I / O Statistics, Database Size, Top 10 Largest Tables.

The Performance tab contains information related to database variables and the value currently running, besides that there is an Advisor to provide input related to the follow-up of the warning that occurred.

The growth of databases and tables can also be monitored on the DB Growth menu, you can predict storage needs or other actions that will be performed by analyzing the metrics of the growth of these databases and tables.

PostgreSQL Administration Tasks with pgAdmin

pgAdmin has various features for database administration and objects that are in the database ranging from creating tables, indexes, users, and tablespaces. The various features of pgAdmin are very useful for both Developer and DBA, because they make it very easy to manage database objects. Following is the appearance of the Menu Tree in pgAdmin.

You can just do right click on the object to be highlighted, then there will be actions that can be done from that object. For example, highlighting Database, then you can create a new database like this:

There will be a dialog box to fill in the database name information, the owner of the database to be created, the encoding that will be used, the tablespace that will be used by the database, security access to the database.

What users have the right to access, and what privileges will be given.

PostgreSQL Administration Tasks with ClusterControl

ClusterControl can also create users and privileges that will be given to User Management as shown in the following figure.

With ClusterControl you can deploy highly available PostgreSQL databases. You can also change the configuration related to the database parameters and the ACL ip address that has the right to access the database in the Configuration menu.

Tags:

When in production, an application should provide a timely response to the user for the purpose of improving user interaction with your application. At times, however, database queries may start to lag hence taking a longer latency for a response to reach the user or rather the throughput operation terminated due to surpassing the set average timeout.

In this blog we are going to learn how you can identify these problems in MongoDB, ways to fix them whenever they arise and what are the possible strategies to undertake so that this may not happen again.

More often, what leads to slow query responses is degraded CPU capacity that is unable to withstand the underlying working set. Working set in this case is the amount of data and indexes that will be subjected to a throughput instance hence active at that moment. This is especially considered in capacity planning when one expects the amount of data involved to increase over time and the number of users engaging with your platform.

Identifying a Slow Query Problem

There are two ways you can identify slow queries in MongoDB.

Using the Profiler
Using db.currentOp() helper

Using the MongoDB Profiler

Database profiler in MongoDB is a mechanism for collecting detailed information about Database Commands executed against a running mongod instance that is: throughput operations (Create, Read, Update and Delete) and the configuration & administration commands.

The profiler utilizes a capped collection named system.profile where it writes all the data. This means, when the collection is full in terms of size, the older documents are deleted to give room for new data.

The Profiler is off by default but depending on the profiling level one can enable it on a per-database or per instance. The possible profiling levels are:

0 - the profiler is off hence does not collect any data.
1 - the profiler collects data for operations that take longer than the value of slowms
2- the profiler collects data for all operations.

However, enabling profiling generates a performance impact on the database and disk usage especially when the profiling level is set to 2 . One should consider any performance implications before enabling and configuring the profiler on a production deployment.

To set the profiling, we use the db.setProfilingLevel() helper such as:

db.setProfilingLevel(2)

A sample document that will be stored in the system.profile collection will be:

{ "was" : 0, "slowms" : 100, "sampleRate" : 1.0, "ok" : 1 }

The “ok”:1 key-value pair indicates that the operation succeeded whereas slowms is the threshold time in milliseconds an operation should take and by default is 100ms.

To change this value

db.setProfilingLevel(1, { slowms: 50 })

To query for data against the system.profile collection run:

db.system.profile.find().pretty()

Using db.currentOp()helper

This function lists the current running queries with very detailed information such as how long they have been running. On a running mongo shell, you run the comment for example:

db.currentOp({“secs_running”: {$gte: 5}})

Where secs_running is the filtering strategy so that only operations that have taken more than 5 seconds to perform will be returned, reducing the output. This is often used when the CPU’s health can be rated 100% due to adverse performance impact it may implicate on the database. So by changing the values you will learn which queries are taking long to execute.

The returned documents have the following as the keys of interest:

query: what the query entails
active: if the query is still in progress.
ns: collection name against which the query is to be executed
secs_running: duration the query has taken so far in seconds

By highlighting which queries are taking long, you have identified what is overloading the CPU.

Interpreting Results and Fixing the Issues

As we have described above, query latency is very dependent on the amount of data involved which will otherwise lead to inefficient execution plans. This is to say, for example if you don’t use indexes in your collection and want to update certain records, the operation has to go through all the documents rather than filtering for only those that match the query specification. Logically, this will take longer time hence leading to a slow query. You can examine an inefficient execution plan by running: explain(‘executionStats’) which provides statistics about the performance of the query. From this point you can learn how the query is utilizing the index besides providing a clue if the index is optimal.

If the explain helper returns

{

   "queryPlanner" : {

         "plannerVersion" : 1,

         ...

         "winningPlan" : {

            "stage" : "COLLSCAN",

            ...

         }

   },

   "executionStats" : {

      "executionSuccess" : true,

      "nReturned" : 3,

      "executionTimeMillis" : 0,

      "totalKeysExamined" : 0,

      "totalDocsExamined" : 10,

      "executionStages" : {

         "stage" : "COLLSCAN",

         ...

      },

      ...

   },

   ...

}

queryPlanner.winningPlan.stage: COLLSCAN key value indicates that the mongod had to scan the entire collection document to identify the results hence it becomes an expensive operation hence leading to slow queries.

executionStats.totalKeysExamined:0means the collection is not utilizing indexing strategy

For a given query, the number of documents involved should be close to zero. If the number of documents is quite large there are two possibilities:

Not using indexing with the collection
Using an index which is not optimal.

To create an index for a collection run the command:

db.collection.createIndex( { quantity: 1 } )

Where quantity is an example field you have selected to be optimal for the indexing strategy.

If you want to learn more about indexing and which indexing strategy to use, check on this blog

Conclusion

Database performance degradation can be easily portrayed by having slow queries which is the least expectation we would want platform users to encounter. One can identify slow queries in MongoDB by enabling the profiler and configuring it to its some specifications or executing db.currentOp() on a running mongod instance.

By looking at the time parameters on the returned result, we can identify which queries are lagging. After identifying these queries, we use the explain helper on these queries to get more details for example if the query is using any index.

Without indexing, the operations become expensive since a lot of documents need to be scanned through before applying the changes. With this setback, the CPU will be overworked hence resulting in slow querying and rising CPU spikes.

The major mistake that leads to slow queries is inefficient execution planning which can be easily resolved by using an index with the involved collection.

Tags:

Databases are all about queries. You store your data in them and then you have to be able to retrieve it in some way. Here come queries - you write them in some language, structured or not, in that way you define what data you want to retrieve. Ideally, those queries would be fast, after all we don’t want to wait for our data. There are many tools that let you understand how your queries behave and how they perform. In this blog post we will compare pgDash and ClusterControl. In both cases query performance is just a part of the functionality. Without further due let’s take a look at them.

What is pgDash?

pgDash is a tool dedicated to monitoring PostgreSQL and monitoring of the query performance is one of the available functionalities.

pgDash requires pg_stat_statements to get the data. It is possible to show queries on a per database basis. You can define which columns should be visible (by default some of them are not shown, to make the data easier to read). You can see multiple types of data like execution time (average, max, min, total) but also information about temporary blocks, rows accessed, disk access and buffer hit. This creates a nice insight into how a given query performs and what could be the reason why it does not perform in an efficient way. You can sort the data using any column looking for queries that, for example, are the slowest ones or which write the most temporary blocks.

If needed, you can look up queries executed in a defined time window.

The granularity here is one minute.

For every query on the list you can click and see more detailed statistics.

You can see the exact query, some data on it (disk access, shared buffer access, temporary blocks access). It is also possible to enable testing and storing the execution plan for the queries. Finally you can see the graphs showing how the performance of the query changed in time.

Overally, pgDash presents a nice insight into the query performance metrics in PostgreSQL.

ClusterControl PostgreSQL Query Monitoring & Management

ClusterControl comes with Query Monitor which is intended to give users insight into the performance of their queries. Query Monitor can be used for PostgreSQL but also for MySQL and Galera Cluster.

ClusterControl PostgreSQL Query Management

ClusterControl shows data aggregated across all databases and hosts in the cluster. The list of queries contains information about performance-related metrics. Number of occurrences, examined rows, temporary tables, maximum, average and total execution time. The list can be sorted using some of the columns (occurrences, max, average, standard deviation and total execution time).

PostgreSQL Query Management ClusterControl

Each query can be clicked on, it shows full query text, some additional details and the general optimization hints.

ClusterControl also comes with the Query Outliers module.

PostgreSQL Query Outliers - ClusterControl

If there are any queries that deviate from the average performance of that particular query type, they will be shown in this section, allowing the user to better understand which queries behave inconsistently and try to find the root cause for this.

PostgreSQL Table and Index Metrics

On top of data directly related to the query performance, both tools provide information about other internals that may affect query performance.

pgDash has a “Tools” section in which you can collect information about indexes, table size and bloat:

Similar data is available In ClusterControl, in Query Statistics:

It is possible to check the I/O statistics for table and indexes, table and index bloat, unused or duplicated indexes. You can also check which tables are more likely to be accessed using index or sequential scans. You can as well check the size of the largest tables and databases.

Conclusion

We hope this short blog gives you insight into how ClusterControl compares with pgDash in features related to query performance. Please keep in mind that ClusterControl is intended not only to assist you with performance monitoring but also to build and deploy HA stacks for multiple Open Source databases, perform the configuration management, define and execute backup schedules and many more features. If you are interested in ClusterControl, you can download it for free.

Tags:

Many would agree that having a graphical user interface is more efficient and less prone to human error when managing or administering a system. Graphical user interface (GUI) greatly helps reduce the steep learning curve required to get up to speed, especially if the software or system is new and complex to the end-user. For MySQL, the installer or packages only comes with a command line interface (CLI) out-of-the-box. However, there is a handful of softwares available in the market that provides a GUI including the one created by the MySQL team themselves called MySQL Workbench.

In this blog post, we are going to look into the graphical user interface aspects of MySQL Workbench and ClusterControl. Both tools have their own advantages and strengths, where some feature sets are overlapping since both tools support management, monitoring, and administration features to certain degrees.

MySQL Workbench GUI

MySQL Workbench is one of the most popular and free Graphical User Interface (GUI) tools to manage and administer a MySQL server. It is a unified visual tool built for database architects, developers, and DBAs. MySQL Workbench provides SQL development tools and data modeling, with comprehensive administration tools for server configuration, user administration, backup, and much more. It's written in C++ and supports Windows, MacOS, Linux (Ubuntu, RHEL, Fedora) and also source code where you compile it by yourself.

MySQL Workbench assumes you have an already running MySQL server, and the user uses it as the graphical user interface to manage your MySQL server. You can perform most of the database management and administration tasks with Workbench like service control, configuration/user/session/connection/data management, as well as SQL development and data modelling. The management features have been covered in the previous blog posts of this series, Database User Management and Configuration Management.

In terms of monitoring, the Performance Dashboard provides quick views of MySQL performance on key server, network, and InnoDB metrics:

You can mouse over the various graphs and visuals to get more information about the sampled values, refreshed every 3 seconds. Note that Workbench does not store the sampling data anywhere thus the graphs are populated from the monitoring collected at the current time you access the dashboard until it is closed.

One of the MySQL Workbench strengths is its data modeling and design feature. It enables you to create models of your database schema graphically, reverse and forward engineer between a schema and a live database, and edit all aspects of your database using the comprehensive editor. The following screenshot shows the entity-relationship (ER) diagram built and visualized with Workbench of Sakila sample database:

Another notable feature is database migration wizard, which allows you to migrate tables and data from a supported database system like Microsoft SQL Server, Microsoft Access, PostgreSQL, Sybase ASE, Sybase SQL Anywhere and SQLite to MySQL:

This tool can save DBA and developer time with its visual, point and click ease of use around all phases of configuring and managing a complex migration process. This migration wizard can also be used to copy databases from one MySQL server to another and also to upgrade to the latest version of MySQL using logical upgrade.

ClusterControl GUI

ClusterControl comes with two user interfaces - GUI and CLI. The graphical user interface, also known as ClusterControl UI is built on top of LAMP stack technologies. Thus, it requires extra steps to prepare, install and configure all the dependencies for a MySQL database server, Apache web server and PHP. To make sure all dependencies are met and configured correctly, it's recommended to install ClusterControl on a clean fresh host using the installer script available on the website.

Once installed, open your preferred web browser and go to http://ClusterControl_server_IP_address/clustercontrol and start creating the admin user and password. The next step is to either deploy a new database cluster or import an existing database cluster into it.

ClusterControl groups database servers per cluster, even for standalone database nodes. It focuses more on the low-level system administration responsibility on automation, management, monitoring and scaling of your database servers and clusters. One of the cool GUI features is cluster topology visualization, which gives us a high-level look on how the current database architecture looks like, including the load-balancer tier:

The Topology view provides a real-time summary of the cluster/node state, replication data flow and the relationship among members in the cluster. You might know for MySQL replication, the database role and replication flow is very critical, especially after a topology changes event like master failure, slave promotion or switchover happened.

ClusterControl provides many step-by-step wizards to help users to deploy, manage and configure their database servers. Most of the difficult and complex tasks are configurable via this wizard like deploying a cluster, importing a cluster, adding a new database node, deploying a load balancer, scheduling a backup, restoring a backup and performing backup verification. For example, if you would like to schedule a backup, there are different steps involved depending on the chosen backup method, the chosen backup destination and many other variables. The UI will dynamically get updated according to the chosen options, as highlighted by the following schedule backup screenshot:

In the above screenshot, we can tell that there are 4 major steps to schedule this kind of backup based on the inputs specified in the first (pick whether to create or schedule a backup) and the second step (this page). The third step is about configuring xtrabackup (the chosen backup method on this page), the last step is about configuring the backup destination to cloud (the chosen backup destination on this page). Configuring advanced settings is really not an obstacle using ClusterControl. If you are unsure about all of the advanced options, just accept the default values which commonly suit general purpose backups.

Although the graphical interface is a web-based application, all monitoring and trending components like graphs, histograms, status and variable grids are updated in real-time with customizable range and refresh rate settings to suit your monitoring needs:

Advantages & Disadvantages

MySQL Workbench is relatively easy to install with no dependencies running as a standalone application. It has all the necessary features to manage and administer database objects required for your application. It is free and open source and backed by the team who maintains the MySQL server itself. New MySQL features are usually first supported by MySQL Workbench before the masses adopt it.

On the downside, MySQL Workbench does not support mobile or tablet versions. However, there are other comparable tools available on the respective apps store. The performance monitoring features for MySQL Workbench are useful (albeit simple) highlighting only the common metrics plus the monitoring data is not stored for future reference.

The ClusterControl GUI is a web-based application which is accessible from all devices that can run the supported web browsers whether it's on a normal PC, laptop, smartphones or tablets. It supports managing multiple database vendors, systems and versions and it stores all monitoring data in its database which can be used to track past events with proactive alerting capabilities. In terms of management, ClusterControl offers a basic schema and user management, but far superior for other advanced management features like configuration, automatic recovery, switchover, replication, node scaling, and load balancer management.

On the drawbacks, ClusterControl is dependent on a number of software programs to work smoothly. These include a properly tuned MySQL server, Apache web server, and also PHP modules. It also requires regular software updates to keep up with all the changes introduced by many vendors it supports. ClusterControl end-user targets are Sysadmins and DevOps, therefore it does not have many GUI features to manage the database objects (tables, views, routines, etc) and SQL development like SQL editor, highlighter and formatter.

The following table compares some of the notable graphical user interface features on both tools:

Aspect	MySQL Workbench	ClusterControl
Monitoring	Basic performance monitoring Query monitoring	Advanced and customizable performance monitoring Query monitoring Database growth
Alerting	No	Email 3rd party integrations like Pagerduty, Telegram and webhooks
Management	Simple backup Configuration Schema Basic service control Advanced database object	Advanced backup/restore Configuration Schema Upgrade Advanced service control
Deployment	No	Database cluster Load balancer On-premises and on-cloud
Data modelling and design	Yes	No
SQL development	Yes	No
Database migration tool	Yes	No
Step-by-step wizards	Yes	Yes
Topology view	No	Yes
Cost	Community edition (free) Standard/Enterprise editions (commercial)	Community edition (free) Enterprise edition (subscription)

As a summary of these MySQL Workbench Alternatives blog series, MySQL Workbench is a better tool to administer your database objects like schema, tables and users while ClusterControl is a better tool to manage your database system and infrastructure. We hope this comparison will help you decide which tool is the best for your MySQL graphical user interface client.

Tags:

Managing backups could be a complex and risky task to do in a manual way. You must know that the backup is working according to your backup policy as you don’t want to be in the situation that you need the backup and it’s not working or it doesn’t exist. That will be a big problem for sure. So, the best here is to use a battle-tested backup management application, to avoid any issue in case of failure.

PGHoard is a PostgreSQL backup daemon and restore system that stores backup data in cloud object stores. It supports PostgreSQL 9.3 or later, until PostgreSQL 11, the latest supported version right now. The current PGHoard version is 2.1.0, released in May 2019 (1 year ago).

ClusterControl is an agentless management and automation software for database clusters. It helps deploy, monitor, manage, and scale your database server/cluster directly from the ClusterControl UI or using the ClusterControl CLI. It includes backup management features and supports PostgreSQL 9.6, 10, 11, and 12 versions. The current ClusterControl version is 1.7.6, released last month, in April 2020.

In this blog, we’ll compare PGHoard with the ClusterControl Backup Management feature and we’ll see how to install and use both systems. For this, we’ll use an Ubuntu 18.04 server and PostgreSQL11 (as it’s the latest supported version for using PGHoard). We’ll install PGHoard in the same database server, and import it to ClusterControl.

Backups Management Features Comparison

PGHoard

Some of the most important PGHoard features are:

Automatic periodic base backups
Automatic transaction log backups
Standalone Hot Backup support
Cloud object storage support (AWS S3, Google Cloud, OpenStack Swift, Azure, Ceph)
Backup restoration directly from object storage, compressed and encrypted
Point-in-time-recovery (PITR)
Initialize a new standby from object storage backups, automatically configured as a replicating hot-standby
Parallel compression and encryption

One of the ways to use it is to have a separate backup machine, so PGHoard can connect with pg_receivexlog to receive WAL files from the database. Another mode is to use pghoard_postgres_command as a PostgreSQL archive_command. In both cases, PGHoard creates periodic base backups using pg_basebackup.

ClusterControl

Let’s see also some of the most important features of this system:

User-friendly UI
Backup and Restore (in the same node or in a separate one)
Schedule Backups
Create a cluster from Backup
Automatic Backup Verification
Compression
Encryption
Automatic Cloud Upload
Point-in-time-recovery (PITR)
Different backup methods (Logical, Physical, Full, Incremental, etc)
Backup Operational Reports

As this is not only a Backup Management system, we’ll also mention different important features not just the Backup related ones:

Deploy/Import databases: Standalone, Cluster/Replication, Load Balancers
Scaling: Add/Remove Nodes, Read Replicas, Cluster Cloning, Cluster-to-Cluster Replication
Monitoring: Custom Dashboards, Fault Detection, Query Monitor, Performance Advisors, Alarms and Notifications, Develop Custom Advisors
Automatic Recovery: Node and Cluster Recovery, Failover, High Availability Environments
Management: Configuration Management, Database Patch Upgrades, Database User Management, Cloud Integration, Ops Reports, ProxySQL Management
Security: Key Management, Role-Based Access Control, Authentication using LDAP/Active Directory, SSL Encryption

The recommended topology is to have a separate node to run ClusterControl, to make sure that, in case of failure, you can take advantage of the auto-recovery and failover ClusterControl features (among others useful features).

System Requirements

PGHoard

According to the documentation, PGHoard can backup and restore PostgreSQL versions 9.3 and above. The daemon is implemented in Python and works with CPython version 3.5 or newer. The following Python modules could be required depends on the requirements:

psycopg2 to look up transaction log metadata
requests for the internal client-server architecture
azure for Microsoft Azure object storage
botocore for AWS S3 (or Ceph-S3) object storage
google-api-client for Google Cloud object storage
cryptography for backup encryption and decryption (version 0.8 or newer required)
snappy for Snappy compression and decompression
zstandard for Zstandard (zstd) compression and decompression
systemd for systemd integration
swiftclient for OpenStack Swift object storage
paramiko for sftp object storage

There is no mention of the supported Operating System, but it says that it was tested on modern Linux x86-64 systems, but should work on other platforms that provide the required modules.

ClusterControl

The following software is required by the ClusterControl server:

MySQL server/client
Apache web server (or nginx)
mod_rewrite
mod_ssl
allow .htaccess override
PHP (5.4 or later)
RHEL: php, php-mysql, php-gd, php-ldap, php-curl
Debian: php5-common, php5-mysql, php5-gd, php5-ldap, php5-curl, php5-json
Linux Kernel Security (SElinux or AppArmor) - must be disabled or set to permissive mode
OpenSSH server/client
BASH (recommended: version 4 or later)
NTP server - All servers’ time must be synced under one time zone
socat or netcat - for streaming backups

And it supports different operating systems:

Red Hat Enterprise Linux 6.x/7.x/8.x
CentOS 6.x/7.x/8.x
Ubuntu 12.04/14.04/16.04/18.04 LTS
Debian 7.x/8.x/9.x/10.x

If ClusterControl is installed via installation script (install-cc) or package manager (yum/apt), all dependencies will be automatically satisfied.

For PostgreSQL, it supports 9.6/10.x/11.x/12.x versions. You can find a complete list of the supported databases in the documentation.

It just requires Passwordless SSH access to the database nodes (using private and public keys) and a privileged OS user (it could be root or sudo user).

The Installation Process

PGHoard Installation Process

We’ll assume you have your PostgreSQL database up and running, so let’s install the remaining packages. PGHoard is a Python package, so after you have the required packages installed, you can install it using the pip command:

$ apt install postgresql-server-dev-11 python3 python3-pip python3-snappy

$ pip3 install pghoard

As part of this installation process, you need to prepare the PostgreSQL instance to work with this tool. For this, you’ll need to edit the postgresql.conf to allow WAL archive and increase the max_wal_senders:

wal_level = logical

max_wal_senders = 4

archive_mode = on

archive_command = pghoard_postgres_command --mode archive --site default --xlog %f

This change will require a database restart:

$ service postgresql restart

Now, let’s create a database user for PGHoard:

$ psql

CREATE USER pghoard PASSWORD 'Password' REPLICATION;

And add the following line in the pg_hba.conf file:

host    replication  pghoard  127.0.0.1/32/32  md5

Reload the database service:

$ service postgresql reload

To make it work, you’ll need to create a JSON configuration file for PGHoard. We’ll see this in the next “Usage” section.

ClusterControl Installation Process

There are different installation methods as it’s mentioned in the documentation. In the case of manual installation, the required packages are specified in the same documentation, and there is a step-by-step guide for all the process.

Let’s see an example using the automatic installation script.

$ wget http://www.severalnines.com/downloads/cmon/install-cc

$ chmod +x install-cc

$ sudo ./install-cc   # omit sudo if you run as root

The installation script will attempt to automate the following tasks:

Install and configure a local MySQL server (used by ClusterControl to store monitoring data)
Install and configure the ClusterControl controller package via package manager
Install ClusterControl dependencies via package manager
Configure Apache and SSL
Configure ClusterControl API URL and token
Configure ClusterControl Controller with minimal configuration options
Enable the CMON service on boot and start it up

Running the mentioned script, you’ll receive a question about sending diagnostic data:

$ sudo ./install-cc

!!

Only RHEL/Centos 6.x|7.x|8.x, Debian 7.x|8.x|9.x|10.x, Ubuntu 14.04.x|16.04.x|18.04.x LTS versions are supported

Minimum system requirements: 2GB+ RAM, 2+ CPU cores

Server Memory: 1024M total, 922M free

MySQL innodb_buffer_pool_size set to 512M

Severalnines would like your help improving our installation process.

Information such as OS, memory and install success helps us improve how we onboard our users.

None of the collected information identifies you personally.

!!

=> Would you like to help us by sending diagnostics data for the installation? (Y/n):

Then, it’ll start installing the required packages. The next question is about the hostname that will be used:

=> The Controller hostname will be set to 192.168.100.116. Do you want to change it? (y/N):

When the local database is installed, the installer will secure it creating a root password that you must enter:

=> Starting database. This may take a couple of minutes. Do NOT press any key.

Redirecting to /bin/systemctl start mariadb.service

=> Securing the MySQL Server ...

=> !! In order to complete the installation you need to set a MySQL root password !!

=> Supported special password characters: ~!@#$%^&*()_+{}<>?

=> Press any key to proceed ...

And a CMON user password, which will be used by ClusterControl:

=> Set a password for ClusterControl's MySQL user (cmon) [cmon]

=> Supported special characters: ~!@#$%^&*()_+{}<>?

=> Enter a CMON user password:

That’s it. In this way, you’ll have all in place without installing or configuring anything manually.

=> ClusterControl installation completed!

Open your web browser to http://192.168.100.116/clustercontrol and enter an email address and new password for the default Admin User.

Determining network interfaces. This may take a couple of minutes. Do NOT press any key.

Public/external IP => http://10.10.10.10/clustercontrol

Installation successful. If you want to uninstall ClusterControl then run install-cc --uninstall.

The first time you access the UI, you will need to register for the 30-day free trial period.

After your 30-day free trial ends, your installation will automatically convert to the community edition unless you have a commercial license.

Backups Management Usage

PGHoards Usage

After this tool is installed, you need to create a JSON file (pghoard.json) with the PGHoard configuration. This is an example:

{

"backup_location": "/var/lib/pghoard",

"backup_sites": {

"default": {

"nodes": [

{

"host": "127.0.0.1",

"password": "Password",

"port": 5432,

"user": "pghoard"

}

],

"object_storage": {

"storage_type": "local",

"directory": "./backups"

},

"pg_data_directory": "/var/lib/postgresql/11/main/"

}

}

}

In this example, we’ll take a backup and store it locally, but you can also configure a cloud account and store it there:

"object_storage": {

"aws_access_key_id": "AKIAQTUN************",

"aws_secret_access_key": "La8YZBvN********************************",

"bucket_name": "pghoard",

"region": "us-east-1",

"storage_type": "s3"

},

You can find more details about the configuration in the documentation.

Now, let run the backup using this JSON file:

$ pghoard --short-log --config pghoard.json

INFO pghoard initialized, own_hostname: 'pg1', cwd: '/root'

INFO Creating a new basebackup for 'default' because there are currently none

INFO Started: ['/usr/lib/postgresql/11/bin/pg_receivewal', '--status-interval', '1', '--verbose', '--directory', '/var/lib/pghoard/default/xlog_incoming', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='pghoard'"], running as PID: 19057

INFO Started: ['/usr/lib/postgresql/11/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--verbose', '--pgdata', '/var/lib/pghoard/default/basebackup_incoming/2020-05-21_13-13_0', '--wal-method=none', '--progress', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='pghoard'"], running as PID: 19059, basebackup_location: '/var/lib/pghoard/default/basebackup_incoming/2020-05-21_13-13_0/base.tar'

INFO Compressed 83 byte open file '/var/lib/pghoard/default/xlog_incoming/00000003.history' to 76 bytes (92%), took: 0.001s

INFO 'UPLOAD' transfer of key: 'default/timeline/00000003.history', size: 76, origin: 'pg1' took 0.001s

INFO Compressed 16777216 byte open file '/var/lib/postgresql/11/main/pg_wal/000000030000000000000009' to 799625 bytes (5%), took: 0.175s

INFO 'UPLOAD' transfer of key: 'default/xlog/000000030000000000000009', size: 799625, origin: 'pg1' took 0.002s

127.0.0.1 - - [21/May/2020 13:13:31] "PUT /default/archive/000000030000000000000009 HTTP/1.1" 201 -

INFO Compressed 16777216 byte open file '/var/lib/pghoard/default/xlog_incoming/000000030000000000000009' to 799625 bytes (5%), took: 0.190s

INFO 'UPLOAD' transfer of key: 'default/xlog/000000030000000000000009', size: 799625, origin: 'pg1' took 0.028s

INFO Compressed 16777216 byte open file '/var/lib/pghoard/default/xlog_incoming/00000003000000000000000A' to 789927 bytes (5%), took: 0.109s

INFO 'UPLOAD' transfer of key: 'default/xlog/00000003000000000000000A', size: 789927, origin: 'pg1' took 0.002s

INFO Compressed 16777216 byte open file '/var/lib/postgresql/11/main/pg_wal/00000003000000000000000A' to 789927 bytes (5%), took: 0.114s

INFO 'UPLOAD' transfer of key: 'default/xlog/00000003000000000000000A', size: 789927, origin: 'pg1' took 0.002s

127.0.0.1 - - [21/May/2020 13:13:32] "PUT /default/archive/00000003000000000000000A HTTP/1.1" 201 -

INFO Ran: ['/usr/lib/postgresql/11/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--verbose', '--pgdata', '/var/lib/pghoard/default/basebackup_incoming/2020-05-21_13-13_0', '--wal-method=none', '--progress', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='pghoard'"], took: 1.940s to run, returncode: 0

INFO Compressed 24337408 byte open file '/var/lib/pghoard/default/basebackup_incoming/2020-05-21_13-13_0/base.tar' to 4892408 bytes (20%), took: 0.117s

INFO 'UPLOAD' transfer of key: 'default/basebackup/2020-05-21_13-13_0', size: 4892408, origin: 'pg1' took 0.008s

In the “backup_location” directory (in this case /var/lib/pghoard), you’ll find a pghoard_state.json file with the current state:

$ ls -l /var/lib/pghoard

total 48

drwxr-xr-x 6 root root  4096 May 21 13:13 default

-rw------- 1 root root 42385 May 21 15:25 pghoard_state.json

And a site directory (in this case called “default/”) with the backup:

$ ls -l /var/lib/pghoard/default/

total 16

drwxr-xr-x 2 root root 4096 May 21 13:13 basebackup

drwxr-xr-x 3 root root 4096 May 21 13:13 basebackup_incoming

drwxr-xr-x 2 root root 4096 May 21 13:13 xlog

drwxr-xr-x 2 root root 4096 May 21 13:13 xlog_incoming

You can check the backup list using the follogin command:

$ pghoard_restore list-basebackups --config pghoard.json

Available 'default' basebackups:

Basebackup                                Backup size    Orig size  Start time

----------------------------------------  -----------  -----------  --------------------

default/basebackup/2020-05-21_13-13_0            4 MB        23 MB  2020-05-21T13:13:31Z

ClusterControl Usage

For this, we’ll assume you have your PostgreSQL database cluster imported in ClusterControl or you deployed it using this system.

In ClusterControl, select your cluster and go to the "Backup" section, then, select “Create Backup”.

For this example, we’ll use the “Schedule Backup” option. When scheduling a backup, in addition to selecting the common options like method or storage, you also need to specify schedule/frequency.

You must choose one method, the server from which the backup will be taken, and where you want to store it. You can also upload your backup to the cloud (AWS, Google, or Azure) by enabling the corresponding button.

Then you need to specify the use of compression, encryption, and the retention of your backup. In this step, you can also enable the “Verify Backup” feature which allows you to confirm that the backup is usable by restoring it in a different node.

If you enable the “Upload backup to the cloud option”, you will see a section to specify the cloud provider and the credentials. If you don’t have integrated your cloud account with ClusterControl you must go to ClusterControl -> Integrations -> Cloud Providers to add it.

On the backup section, you can see the progress of the backup, and information like the method, size, location, and more.

ClusterControl Command Line (s9s)

For scripting and automating tasks, or even if you just prefer the command line, ClusterControl has the s9s tool. It's a command-line tool for managing your database cluster. Let’s see an example of how to create and list backups using this tool:

$ s9s backup --list --cluster-id=40 --long --human-readable

$ s9s backup --create --backup-method=pg_basebackup --cluster-id=40 --nodes=192.168.100.125 --backup-directory=/tmp --wait

You can find more examples and information in the ClusterControl CLI documentation section.

Conclusion

As a conclusion of comparing these mentioned backup management systems, we can say that PGHoard is a free but complex solution for this task. You’ll need some time to understand how it works and how to configure it, as the official documentation is a bit poor on that. Also, it looks a bit out of date, as the latest release was 1 year ago. Moreover, ClusterControl is an all-in-one management system with a lot of features not only backup management, with an user-friendly and easy to use UI. It has community (with limited available features) and paid versions with a 30-day free trial period. The documentation is clear and complete, with examples and detailed information.

We hope this blog helps you to make the best decision to keep your data safe.

Tags:

Introduction to MongoDB

MongoDB was introduced back in 2009 by a company named 10gen. 10gen was later renamed to MongoDB Inc., the company which is responsible for the development of the software, and sells the enterprise version of this database. MongoDB Inc. handles all the support with its excellent enterprise-grade support team around the clock. They are committed to providing lifetime support, which means customers choose to use any version of MongoDB, and if they wish to upgrade, it would be supported anytime. It also provides them with an opportunity to be in sync with all the security fixes that the company offers round the clock.

MongoDB is well-known NoSQL databases that made a deep proliferation over the last decade or so, fueled by the explosive growth of the web and mobile applications running in the cloud. This new breed of internet-connected applications demands fast, fault-tolerant and scalable schema-less data storage which NoSQL databases can offer. MongoDB uses JSON to store data like documents that can vary in structure offerings, a dynamic, flexible schema. MongoDB designed for high availability and Scalability with auto-sharding. MongoDB is one of the popular open-source databases that arise under the NoSQL database, which is used for high volume data storage. MongoDB has the rows called documents that don't require a schema to be defined because the fields are created on the fly. The data model available within MongoDB allows hierarchical relationships representation, to store arrays, and other more complex structures more efficiently.

Introduction to Cassandra

Apache Cassandra is another well-known as a free and open-source, distributed, wide column store. Cassandra was introduced back in 2008 by a couple of developers from Facebook, which later released as an open-source project. It is currently being supported by the Apache Software Foundation, and Apache is presently maintaining this project for any further enhancements.

Cassandra is a NoSQL database management system designed to handle large amounts of data across many commodity servers and provide high availability with no single point of failure. Cassandra offers very robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra supports the distribution design of Amazon Dynamo with the data model of Google's Bigtable.

Similarities between MongoDB and Cassandra

With the brief introduction of these two NoSQL databases, let us review some of the similarities between these two databases:

Both MongoDB and Cassandra are NoSQL database types and open-source distribution.

None of these databases is a replacement to the traditional RDBMS database types.
Both of these databases are not compliant to ACID (Atomicity, Consistency, Isolation, Durability), which refers to properties of database transactions that guarantee database transactions are processed reliably.
Both of these databases support sharding horizontal partitioning.
Consistency and Normalization are two concepts that these two database types not satisfy (as these lean more towards the RDBMS database types)

MongoDB vs. Cassandra: Features

Both technologies play a vital role in their fields, with their similarities between MongoDB and Cassandra showing their common features and differences show, uniqueness of these technologies.

Figure 1 MongoDB vs. Cassandra – 8 Major Factors of Difference

Expressive Data Model

MongoDB provides a rich and expressive data model that is known as 'object-oriented' or 'data-oriented.' This data model can easily support and represent any data structure in the domain of the user. The data can have properties and can be nested in each other for multiple levels. Cassandra is more of a traditional data model with table structure, rows, and specific data type columns. This type is defined during the creation of the table. Anyhow, when we compare both the models, MongoDB tends to provide a rich data model. The figure below describes the typical high-level architectures of both databases in terms of its storage and replication levels.

Figure 2: Architecture diagram MongoDB vs. Cassandra

High Availability Master Node

MongoDB supports one master node in a cluster, which controls a set of slave nodes. If the master node goes down, a slave is elected as master and takes about 20-30 seconds for the same. During this delay time, the cluster will be down and will not be able to accept any input. Cassandra supports multiple master nodes in a cluster, and in the event one of the master nodes goes offline, its place will be taken by another master node. In comparison, Cassandra supports higher availability over MongoDB because it does not affect the cluster and is always available.

Secondary Indexes

MongoDB has more advantages compared to Cassandra if an application requires secondary indexes along with flexibility in the data model. Because of this, MongoDB is much easier to index any property of the data stored in the database. This property makes it easy to query. Cassandra has cursor support for the secondary indexes, which are limited to single columns and equality comparisons

Write Scalability

MongoDB supports only one master node. This master node in MongoDB only accepts the input, and the rest of the nodes in MongoDB are used as an output; therefore, if the data has to be written in the slave nodes and let it pass through the master node. Cassandra supports multiple master nodes in a cluster, which makes it suitable in the case of Scalability.

Query Language Support

Currently, MongoDB doesn't no support a query language. The queries in MongoDB are structured as JSON fragments. In contrast, Cassandra has a user-friendly set of queries which is known as CQL (Cassandra Query Language) and is easily adaptable by the developers who have prior knowledge of SQL. How are their queries different?

Selecting records from the customer table:

Cassandra:

SELECT * FROM customer;

MongoDB:

db.customer.find()

Inserting records into the customer table:

Cassandra:

INSERT INTO customer (custid, branch, status) VALUES('appl01', 'headquarters', 'A');

MongoDB:

db.customer.insert({ cust_id: 'appl01', branch: 'headquarters', status: 'A' })

Updating records in the customer table:

Cassandra:

UPDATE Customer SET branch = ‘headquarters' WHERE custage > 2;

MongoDB:

db.customer.update( { custage: { $gt: 2 } }, { $set: { branch: 'headquarters' } }, { multi: true } )

Native Aggregation

MongoDB has a built-in Aggregation framework which is used to run an ETL pipeline to transform the data stored in the database and also supports both small and medium data traffic. When there is increased complexity, the framework gets more difficult to debug as well, whereas Cassandra does not have an integrated aggregation framework. Cassandra utilized external tools such as Hadoop, Apache Spark, etc. Therefore, MongoDB is better than Cassandra when it comes to the built-in aggregation framework.

Schema-less Model

MongoDB provides the facility for a user is allowed to alter the enforcement of any schema on the database. Each database can be a different structure. It all depends on the program or the application to interpret the data. Whereas, Cassandra doesn't offer the facility to alter schemes but provides static typing where the user is required to define the type of the column in the beginning.

Performance Benchmark

Cassandra considers performing better in applications that require heavy data load since it can support multiple master nodes in a cluster. Whereas, MongoDB will not be ideal for applications with heavy data load as it can't scale with the performance. Based on the industry-standard benchmark created by Yahoo! called YCSB, MongoDB provides greater performance than Cassandra in all the tests they have executed, in some use cases by as much as 25x. When optimized for a balance of throughput and durability between Cassandra and MongoDB, MongoDB provides over 50% greater throughput in mixed workloads, and 2.5x greater throughput in read-dominant workloads compared to Cassandra.

MongoDB provides the most flexibility for ensuring durability for specific operations: users can opt for the durability optimized configuration for specific operations that are deemed critical but for which the additional latency is acceptable. For Cassandra, this change requires editing a server config file and a full restart of the database.

Conclusion

MongoDB is known best for workloads with lots of highly unstructured data. The scale and types of data that you will be working with MongoDB's flexible data structures will suit you better than Cassandra. To use MongoDB effectively, you will have to be able to manage with the possibility of some downtime if the master node fails, as well as with limited write speeds. And don't forget, you will also have to learn a new query language. In MongoDB, the complex data can be easily managed by using the JSON format support capabilities. This is a key differentiator for MongoDB when you compare it with Cassandra. In some situations, Cassandra can be considered the best database to implement when involving large amounts of data, speed optimization, and query execution. The comparison results of Cassandra and MongoDB, we will find that they have their respective advantages depending upon the implementation requirements and the volume of data to be dealt with.

Tags:

After developing your application and database model (when it is time to move the environment into production) there are a couple of things that need to be done first. Oftentimes developers fail to take into consideration additional important MongoDB steps before deploying the database into production. Consequently, it is in the production mode they end up encountering underlying setbacks that are not be presented in the development mode. Sometimes it may be too late or rather a lot of data would be lost if disaster strikes. Besides, some of the steps discussed here will enable one to gauge the database’s health and hence plan for necessary measures before disaster strikes.

Use the Current Version and Latest Drivers

Generally, latest versions in any technology come with improved features in regard to the underlying functionality than their predecessors. MongoDB’s latest versions are more robust and improved than their predecessors in terms of performance, scalability and memory capacity. The same applies for the related drivers since they are developed by the core database engineers and get updated more frequently even than the database itself.

Native extensions installed for your language can easily lay a platform for quick and standard procedures for testing, approving and upgrading the new drivers. There are also automotive software such as Ansible, Puppet, SaltStack and Chef that can be used for easy upgrade of the MongoDB in all your nodes without incurring professional expenses and time.

Also consider using the WiredTiger storage engine as it is the most developed with modern features that suit modern database expectations

Subscribe to a MongoDB mailing list to get the latest information in regard to changes to new versions & drivers and bug fixes hence keeping updated.

Use a 64-bit System to Run MongoDB

In 32-bit systems, MongoDB processes are limited to about 2.5GB of data because the database uses memory-mapped files for performance. This becomes a limitation for processes that might surpass the boundary leading to a crush. The core impact will be: in case of an error, you will not be able to restart the server till the time you remove your data or migrate your database to a higher system like the 64-bit hence a higher downtime for your application.

If you have to keep using a 32-bit system, your coding must be very simple to reduce the number of bugs and latency for throughput operations.

However for code complexities such as aggregation pipeline and geodata, it is advisable to use the 64-bit system.

Ensure Documents are Bounded to 16MB Size

MongoDB documents are limited to the 16MB size but you need not to get close to this limit as it will implicate some performance degradation. In practice, the documents are mostly KB or less in size. Document size is dependent on the data modelling strategy between embedding and referencing. Embedding is preferred where the document size is not expected to grow much in size. For instance, if you have a social media application where users post and it has comments, the best practice will be to have two collections one to hold post information.

  {

   _id:1,

   post: 'What is in your mind?',

   datePosted: '12-06-2019',

   postedBy:'xyz',

   likes: 10,

   comments: 30

}

and the other to hold comments for that post.

     {

   _id: ObjectId('2434k23k4'),

   postId: 1,

   dateCommented: '12-06-2019',

   commentedBy:'ABCD',

   comment: 'When will we get better again',

}

By having such data models, comments will be stored in a different collection from the post. This prevents the document in post collection from growing out of bound in case there will be so many comments. Ensure you avoid application patterns that would allow documents to grow unbounded.

Ensure Working Set Fits in Memory

The database may fail to read data from virtual memory (RAM) leading to page faults. Page faults will force the database to read data from a physical disk leading to increased latency and consequently a lag in the overall application performance. Page faults happen due to working with a large set that does not fit in memory. This may be as a result of some documents having an unbounded size or poor sharding strategy.Remedies for page faults will be:

Ensuring documents are bounded to the 16MB size.
Ensuring a good sharding strategy by selecting an optimal sharding key that will limit the number of documents a throughput operation will be subjected to.
Increase size of the MongoDB instance to accommodate more working sets.

Ensure you Have Replica Sets in Place

In the database world, it is not ideal to rely on a single database due to the fact that catastrophe may strike. Besides, you would expect an increase in the number of users to the database hence need to ensure high availability of data. Replication is a crucial approach for ensuring high availability in case of failover. MongoDB has the capability of serving data geographically: which means users from different locations will be served by the nearest cloud host as one way of reducing latency for requests.

In case the primary node fails, the secondary nodes can elect a new one to keep up with write operations rather than the application having a downtime during the failover. Actually, some cloud hosting platforms that are quite considerate with replication don’t support non-replicated MongoDB for production environments.

Enable Journaling

As much as journaling implicates some performance degradation, it is important as well. Journaling enhances write ahead operations which means in case the database fails in the process of doing an update, the update would have been saved somewhere and when it comes alive again, the process can be completed. Journaling can easily facilitate crash recovery hence should be turned on by default.

Ensure you Setup a Backup Strategy

Many businesses fail to continue after data loss due to no or poor backup systems. Before deploying your database into production ensure you have used either of these backup strategies:

Mongodump: optimal for small deployments and when producing backups filtered on specific needs.
Copying underlying: optimal for large deployments and efficient approach for taking full backups and restoring them.
MongoDB Management Service (MMS): provides continuous online backup for MongoDB as a fully managed service. Optimal for a sharded cluster and replica sets.

Backups files should also not be stored in the same host provider of the database. Backup Ninja is a service that can be used for this.

Be Prepared for Slow Queries

Hardly can one realize slow queries in the development environment due to the fact that little data is involved. However, this may not be the case in production considering that you will have many users or a lot of data will be involved. Slow queries may arise if you failed to use indexes or used an indexing key that is not optimal. Nevertheless, we should find a way that will show you the reason for slow queries.

We therefore resolve to enable MongoDB Query Profiler. As much as this can lead to performance degradation, the profiler will help in exposing performance issues. Before deploying your database, you need to enable the profiler for the collections you suspect might have slow queries, especially ones that involve documents with a lot of embedding.

Connect to a Monitoring Tool

Capacity planning is a very essential undertaking in MongoDB. You will also need to know the health of your db at any given time. For convenience, connecting your database to a monitoring tool will save you some time in realizing what you need to improve on your database with time. For instance, a graphical representation that indicates CPU slow performance as a result of increased queries will direct you to add more hardware resources to your system.

Monitoring tools also have an alerting system through mailing or short messages that conveniently update you on some issues before they heighten into catastrophe. Therefore, in production, ensure your database is connected to a monitoring tool.

ClusterControl provides free MongoDB monitoring in the Community Edition.

Implement Security Measures

Database security is another important feature that needs to be taken into account strictly. You need to protect the MongoDB installation in production by ensuring some pre-production security checklists are adhered to. Some of the considerations are:

Configuring Role-Based Access Control
Enabling Access Control and Enforce Authentication
Encrypting incoming and outgoing connections (TLS/SSL)
Limiting network exposure
Encrypting and protecting data
Have a track plan on access and changes to database configurations

Avoid external injections by running MongoDB with secure configuration options. For example, disabling server-side scripting if not using JavaScript server side operations such as mapReduce and $where. Use the JSON validator for your collection data through some modules like mongoose to ensure that all stored documents are in the valid BSON format.

Hardware and Software Considerations

MongoDB has few hardware prerequisites, since it is explicitly designed with great consideration on the commodity hardware necessary. The following are the main hardware deliberations for MongoDB you need to consider before deployment into production.

Assign adequate RAM and CPU
Use the WiredTiger storage engine. Designed to use filesystem cache and WiredTiger internal cache hence increased performance. For instance, when operating with a system of 4GB RAM the WiredTiger cache uses 1.5GB of the RAM ( 0.5 * (4GB -1GB) = 1.5GB) while a system with 1.2GB of RAM WiredTiger cache uses only 256MB.
NUMA Hardware. There are numerous operational issues which include slow performance and high system process usage, therefore, one should consider configuring a memory interleave policy.
Disk and Storage system: Use solid state Disk (SSDs): MongoDB shows better price-performance ratio with SATA SSD

Conclusion

Databases in production are very crucial for ensuring smooth running of a business hence should be treated with a lot of considerations. One should lay down some procedures that can help to reduce errors or rather provide an easy way of finding these errors. Besides, it is advisable to set up an alerting system that will show the database’s health with time for capacity planning and detecting issues before they mitigate into catastrophe.

Tags:

During the recent 24-hour Percona Live, multi-cloud was regularly one of the key topics. This is just another key indicator of the trend of many organizations who are switching their architectures or expanding their businesses to include multi-cloud database deployments.

Multi-cloud is considered by many as the new normal, but it has been growing largely over the last two years and the adoption of this approach is showing significant growth.

There are, however, certain risks and concerns which explain why some organizations have not switched their architecture or are not interested in capitalizing their infrastructure. These are often due to concerns about security and data autonomy.

What is Multi-Cloud?

Multi-cloud is a strategy in utilizing the cloud infrastructure of two or more cloud vendors (public or private) instead of relying on a single vendor. This is not limited to what the public clouds (such as what Amazon, Google, and Microsoft) are offering. It can also mean using a private cloud, which has a limited offering to select users or is offered over a private internal network for its computing services, or is offered with the additional control and customization available from dedicated resources over a computing infrastructure hosted on-premises. Being on a multi-cloud helps organizations avoid vendor lock-in and the ability to have a cloud-agnostic platform approach.

A common example of a multi-cloud for this scenario is an enterprise application that requires data backup and restore strategy to be stored in multiple locations. Another example being if customers demand that your application has to support a particular service on a different vendor once a restore process is done.

Our products ClusterControl and Backup Ninja are two of the few applications that are designed for the multi-cloud approach, not only for the infrastructure but also for multi-cloud database deployments.

Multi-cloud should not be confused with hybrid cloud. The latter refers to the presence of one or multiple cloud deployments associated with deployments running on-prem or some form of integration or orchestration between them, such as Kubernetes, for example.

What is a Multi-Cloud Database?

While a multi-cloud focuses on cloud vendors (either privately or publicly - for which the latter is the common place to host the database), a multi-cloud database specifies the term as it relates specifically to database operations.

It's a strategy to utilize multiple deployments of databases dispersed over different cloud vendors, whereas this service offering for databases is not limited to storing or processing data, but also involves data management such as backup and restore mechanisms, data recovery, data migration, etc. with added database enhancements and efficiency features such as query optimization or performance enhancements.

According to Percona's recent survey report, organizations globally reveal that their preference is to have multiple databases placed in multiple locations across multiple platforms. Many of these platforms offer SaaS and cloud solutions to startups as well as established enterprises. With the emergence of DBaaS sprouting globally from various companies that offer this service can entice companies to adopt and capitalize on this trend.

Do You Need a Multi-cloud Database?

Multi-Cloud databases are popular, largely due to the following reasons...

It’s cost-effective
It’s more efficient
Higher performance
Less skills required and less focus for database management and performance efficiency workloads
Better automation options
Features high-availability & scalability to avoid unwanted outages
Smarter tool options
Better security and faster phase enhancements
Maximum protection for your mission critical systems
Fully-managed services available (means less worry, less work and focus on the business application logic)

These are just some of the few key aspects why there is a sudden adoption of users for this technology. According to Flexera (acquired by Rightscale), among the top initiatives for capitalizing over the cloud is due to cost savings, as shown in the graph below...

Top Cloud Initiatives for 2020 - Flexera

Still, it shows that AWS emerges as the choice to run their infrastructure over the cloud with Azure and Google Cloud following among the top list of public cloud adoption for enterprises.

Public Cloud Adoption for Enterprises - Flexera

With the boom in DBaaS, other organizations are offering managed services for databases; especially the open-source databases.

MariaDB just recently delivered their SkySQL MariaDB Cloud Database.
MongoDB Atlas is a cloud-hosted MongoDB service on AWS, Azure, and Google Cloud in just a few clicks.
For PostgreSQL enthusiasts, there's ElephantSQL to start with.
Another provider which offers various database deployment services for your data infrastructure is Aiven. Aiven is doing a great job offering various database services (PostgreSQL, MySQL, Kafka, InfluxDB, etc.) with very easy to deploy your new database infrastructure by interacting over the UI.

Each of these approaches, at their core, are fundamentally the same. They remove the hassle of creating automation for CI/CD deployments allowing you to have your database infrastructure completely set up ready for production use in just a few minutes. It's very cost effective, as it does not need to capitalize on hardware infrastructure.

Disadvantages of Multi-Cloud Databases

Sounds enticing right? While multi-cloud databases might sound interesting, there are a number of concerns with the technology. Security and data autonomy is still a concern, especially for FinTech or applications indulging EHR or EMR systems.

Long term costs of running a multi-cloud database has yet to be fully studied. Wild predictions are not welcome, as putting such large investments could impact the business.

Exploring the various offerings from different vendors is a great idea, but you also have to consider things such as things being cloud-agnostic and avoiding vendor lock-in should you decide to explore other possibilities in the future.

Technology is always growing and when major changes occur, it's often hard to switch onto different platforms as it costs money and resources.

Multi-Cloud: The Right Tools are the Path to Success

Regardless of the advantages and benefits of multi-cloud approaches, there will be complexity and finding the right tool for your specific needs is paramount.

You need to choose the desired application and software that offers a seamless approach when deploying your multi-cloud databases, one that offers the capacity to oversee and observe your database performance and efficiency.

Databases are very flexible and customizable in accordance to your desired performance tuning and it is these features that will ultimately enrich your application mechanism and improve the delivery to your end users.

ClusterControl is a Multi-Cloud Database Deployment Platform

If you decide to leverage multi-cloud deployments of your database, then you must leverage smart tools with a wide coverage of multi-cloud platforms. ClusterControl is one of these tools. ClusterControl offers support for the deployment of your databases to AWS, Google Cloud, or Azure, in any combination.

Multi-Cloud databases in ClusterControl are quite straightforward. All you need to do is set up your cloud providers credentials just like below...

Multi-Cloud Database Deployments with ClusterControl

Then select the provider and provide the credentials to connect via API, for example:

Then, ClusterControl has a “Deploy in the Cloud” option.

Then select the desired database vendor on which you would like to deploy as seen below:

ClusterControl also offers a way to manually setup your desired multi-cloud infrastructure. This blog post, Deploying Secure Multicloud MySQL Replication on AWS and GCP with VPN, can show you how.

Conclusion

Multi-cloud database is now part of the new normal. Many enterprises and large organizations have revealed their interest in adopting and implementing a multi-cloud and deploying databases, not only in one provider, but onto different providers.

Avoiding vendor lock-in and increasing cost efficiency remain a top priority for this type of deployment, yet, a key takeaway for leveraging multi-cloud deployment of your databases relies on the kind of tools you use and how efficient they are in helping you manage your database once it’s deployed.

Tags:

A multi-cloud environment is a good option for a Disaster Recovery Plan (DRP), but it can be a time-consuming task as you need to configure the connectivity between the different cloud providers and you will then need to deploy and manage your database cluster in two different places.

In this blog, we will show how to perform a multi-cloud deployment for PostgreSQL in two of the most popular cloud providers at the moment, AWS and Google Cloud. For this task, we will use some of the features that ClusterControl can offer you, like Scaling, and Cluster-to-Cluster Replication.

We will assume you have a ClusterControl installation running and have already created two different cloud provider accounts.

Preparing Your Cloud Environment

First, you need to create your environment in your main Cloud Provider. In this case, we will use AWS with 2 PostgreSQL nodes:

Make sure you have the SSH and PostgreSQL traffic allowed from your ClusterControl server by editing your Security Group:

Then, go to the secondary Cloud Provider and create at least one virtual machine that will be the slave node. We will use the Google Cloud Platform with 1 PostgreSQL node.

And again, make sure you are allowing SSH and PostgreSQL traffic from your ClusterControl server:

In this case, we are allowing the traffic without any restriction on the source, but it is just an example and it is not recommended in real life.

Deploy a PostgreSQL Cluster in the Cloud

We will use ClusterControl for this task, so we are assuming you have it installed.

Go to your ClusterControl server, and select the option “Deploy”. If you already have a PostgreSQL instance running, then you need to select the “Import Existing Server/Database” instead.

When selecting PostgreSQL, you must specify User, Key or Password, and port to connect by SSH to your PostgreSQL nodes. You also need the name for your new cluster and if you want ClusterControl to install the corresponding software and configurations for you.

Please check the ClusterControl user requirements for more information about this step.

After setting up the SSH access information, you must define the database user, version, and datadir (optional). You can also specify which repository to use. In the next step, you need to add your servers to the cluster you are going to create.

When adding your servers, you can enter IP or hostname. In this step, you could also add the node placed in the secondary Cloud Provider, as ClusterControl doesn’t have any limitations about the network to be used, but to make it more clear, we will add it in the next section. The only requirement here is to have SSH access to the node.

In the last step, you can choose if your replication will be Synchronous or Asynchronous.

In case you are adding your remote node here, it is important to use Asynchronous replication, if not, your cluster could be affected by the latency or network issues.

You can monitor the creation status in the ClusterControl activity monitor.

Once the task is finished, you can see your new PostgreSQL cluster in the main ClusterControl screen.

Adding a Remote Slave Node in the Cloud

Once you have your cluster created, you can perform several tasks on it, like deploy/import a load balancer or a replication slave node.

Go to cluster actions and select “Add Replication Slave”:

Let’s use the “Add new Replication slave” option as we are assuming that the remote node is a fresh installation, if not, you can use the “Import existing Replication Slave” option instead.

Here, you only need to choose your Master server, enter the IP address for your new slave server, and the database port. Then, you can choose if you want ClusterControl to install the software and if the replication slave should be Synchronous or Asynchronous. Again, if you are adding a node in a different datacenter you should use Asynchronous replication to avoid issues related to the network performance.

In this way, you can add as many replicas as you want and spread read traffic between them using a load balancer, which you can also implement with ClusterControl.

You can monitor the replication slave creation in the ClusterControl activity monitor.

And check your final topology in the Topology View Section.

Cluster-to-Cluster Replication in the Cloud

Instead of using the “Add Replication Slave” option to have a Multi-Cloud environment, you can use the ClusterControl Cluster-to-Cluster Replication feature to add a remote cluster. At the moment, this feature has a limitation for PostgreSQL that allows you to have only one remote node, so it is pretty similar to the previous way, but we are working to remove that limitation soon in a future release.

To create a new Slave Cluster, go to ClusterControl -> Select Cluster -> Cluster Actions -> Create Slave Cluster.

The Slave Cluster will be created by streaming data from the current Master Cluster.

In this section, you must choose the master node of the current cluster from which the data will be replicated.

When you go to the next step, you must specify User, Key or Password, and port to connect by SSH to your servers. You also need a name for your Slave Cluster and if you want ClusterControl to install the corresponding software and configurations for you.

After setting up the SSH access information, you must define the database version, datadir, port, and admin credentials. As it will use streaming replication, make sure you use the same database version and credentials used in the Master Cluster. You can also specify which repository to use.

In this step, you need to add the server for the new Slave Cluster. For this task, you can enter both the IP Address or Hostname of the database node.

You can monitor the Slave Cluster creation in the ClusterControl activity monitor. Once the task is finished, you can see the cluster in the main ClusterControl screen.

Conclusion

These ClusterControl features will allow you to quickly set up replication between different Cloud Providers for a PostgreSQL database (and different technologies), and manage the setup in an easy and friendly way. About the communication between the Cloud Providers, for security reasons, you must restrict the traffic only from known sources, so only from Cloud Provider 1 to Cloud Provider 2 and vice versa.

Tags:

With high availability being paramount in today’s business reality, one of the most common scenarios for users to deal with is how to ensure that the database will always be available for the application.

Every service provider comes with an inherited risk of service disruption therefore one of the steps that can be taken are to rely on multiple providers to alleviate the risk and additional redundancy.

Cloud service providers are no different - they can fail and you should plan for this in the advance. What options are available for MariaDB Cluster? Let’s take a look at it in this blog post.

MariaDB Database Clustering in Multi-Cloud Environments

If SLA proposed by one cloud service provider is not enough, there’s always an option to create a disaster recovery site outside of that provider. Thanks to this, whenever one of the cloud providers experiences some service degradation, you can always switch to another provider and keep your database up and available.

One of the problems that are typical for multi-cloud setups is the network latency that’s unavoidable if we are talking about larger distances or, in general, multiple geographically separated locations. Speed of light is quite high but it is finite, every hop, every router also adds some latency into the network infrastructure.

MariaDB Cluster works great on low-latency networks. It is a quorum-based cluster where prompt communication between all nodes is required to keep the operations smooth. Increase in network latency will impact cluster operations, especially performance of the writes. There are several ways this problem can be addressed.

First we have an option to use separate clusters connected using asynchronous replication links. This allows us to almost forget about latency because asynchronous replication is significantly better suited to work in high latency environments.

Another option is that, given low latency networks between datacenters, you still might be perfectly fine to run a MariaDB Cluster spanning across several data centers. After all, multiple datacenters don’t always mean vast distances geographically-wise - you can as well use multiple providers located within the same metropolitan area, connected with fast, low-latency networks. Then we’ll be talking about latency increase to tens of milliseconds at most, definitely not hundreds. It all depends on the application but such an increase may be acceptable.

Asynchronous Replication Between MariaDB Clusters

Let’s take a quick look at the asynchronous approach. The idea is simple - two clusters connected with each other using asynchronous replication.

This comes with several limitations. For starters, you have to decide if you want to use multi-master or would you send all traffic to one datacenter only. We would recommend to stay away from writing to both datacenters and using master - master replication. This may lead to serious issues if you do not exercise caution.

If you decide to use the active - passive setup, you would probably want to implement some sort of a DNS-based routing for writes, to make sure that your application servers will always connect to a set of proxies located in the active datacenter. This might be achieved by either literally DNS entry that would be changed when failover is required or it can be done through some sort of a service discovery solution like Consul or etcd.

The main downside of the environment built using the asynchronous replication is the lack of ability to deal with network splits between datacenters. This is inherited from the replication - no matter what you want to link with the replication (single nodes, MariaDB Clusters), there is no way to go around the fact that replication is not quorum-aware. There is no mechanism to track the state of the nodes and understand the high level picture of the whole topology. As a result, whenever the link between two datacenters goes down, you end up with two separate MariaDB clusters that are not connected and that are both ready to accept traffic. It will be up to the user to define what to do in such a case. It is possible to implement additional tools that would monitor the state of the databases from outside (i.e. from the third datacenter) and then take actions (or do not take actions) based on that information. It is also possible to collocate tools that would share the infrastructure with databases but would be cluster-aware and could track the state of the datacenter connectivity and be used as the source of truth for the scripts that would manage the environment. For example, ClusterControl can be deployed in a three-node cluster, node per datacenter, that uses RAFT protocol to ensure the quorum. If a node losts the connectivity with the rest of the cluster it could be assumed that the datacenter has experienced network partitioning.

Multi-DC MariaDB Clusters

Alternative to the asynchronous replication could be an all-MariaDB Cluster solution that spans across multiple datacenters.

As stated at the beginning of this blog, MariaDB Cluster, just like every Galera-based cluster, will be impacted by the high latency. Having said that, it is perfectly acceptable to run it in “not-so-high” latency environments and expect it to behave properly, delivering acceptable performance. It all depends on the network throughput and design, distance between datacenters and application requirements. Such an approach will work great especially if we use segments to differentiate separate data centers. It allows MariaDB Cluster to optimize its intra cluster connectivity and reduce cross-DC traffic to the minimum.

The main advantage of this setup is that it relies on MariaDB Cluster to handle failures. If you use three data centers, you are pretty much covered against the split-brain situation - as long as there is a majority, it will continue to operate. It is not required to have a full-blown node in the third datacenter - you can as well use Galera Arbitrator, a daemon that acts as a part of the cluster but it does not have to handle any database operations. It connects to the nodes, takes part in the quorum calculation and may be used to relay the traffic should the direct connection between the two data centers not work.

In that case the whole failover process can be described as: define all nodes in the load balancers (all if data centers are close to each other, in other case you may want to add some priority for the nodes located closer to the load balancer) and that’s pretty much it. MariaDB Cluster nodes that form the majority will be reachable through any proxy.

Deploying a Multi-Cloud MariaDB Cluster Using ClusterControl

Let’s take a look at two options you can use to deploy multi-cloud MariaDB Clusters using ClusterControl. Please keep in mind that ClusterControl requires SSH connectivity to all of the nodes it will manage so it would be up to you to ensure network connectivity across multiple datacenters or cloud providers. As long as the connectivity is there, we can proceed with two methods.

Deploying MariaDB Clusters Using Asynchronous Replication

ClusterControl can help you to deploy two clusters connected using asynchronous replication. When you have a single MariaDB Cluster deployed, you want to ensure that one of the nodes has binary logs enabled. This will allow you to use that node as a master for the second cluster that we will create shortly.

Once the binary log has been enabled, we can use Create Slave Cluster job to start the deployment wizard.

We can either stream the data directly from the master or you can use one of the backups to provision the data.

Then you are presented with a standard cluster deployment wizard where you have to pass SSH connectivity details.

You will be asked to pick the vendor and version of the databases as well as asked for the password for the root user.

Finally, you are asked to define nodes you would like to add to the cluster and you are all set.

When deployed, you will see it on the list of the clusters in the ClusterControl UI.

Deploying Multi-Cloud MariaDB Cluster

As we mentioned earlier, another option to deploy MariaDB Cluster would be to use separate segments when adding nodes to the cluster. In the ClusterControl UI you will find an option to “Add Node”:

When you use it, you will be presented with following screen:

The default segment is 0 so you want to change it to a different value.

After nodes have been added you can check in which segment they are located by looking at the Overview tab:

Conclusion

We hope this short blog gave you a better understanding of the options you have for multi-cloud MariaDB Cluster deployments and how they can be used to ensure high availability of your database infrastructure.

Tags:

Having a backup plan is a must when running a database in production. Running a backup every day, however, can eventually lead to an excess of backup storage space, especially when running on premises.

One popular option is to store the backup files in the cloud. When it comes to cloud storage, we don’t need to be worried if the disk is exhausted as the cloud object storage is unlimited. Disaster recovery best practices recommend that backups should be stored offsite. While cloud storage is unlimited there are still concerns about the cost, as the pricing is based on the size of the backup file.

In this blog, we will discuss backup archiving in the cloud and how to implement a proper backup policy and ultimately save costs.

What is Object Storage in the Cloud?

Object storage is a data storage architecture that stores the data as objects. This is different when compared to other storage systems which manage the data as a file system or block storage which manages the data as evenly sized blocks of data. There are several types of storage based on how users access their data, which are...

Hot storage, the data need to be accessible instantaneously.
Cool storage, the data is accessed more infrequently.
Cold storage, the data archival storage, which is rarely accessed.

AWS has an object storage service platform called the S3 (Simple Storage Service). It is a platform for storing object files in a highly scalable way. Data is durable and provides relatively fast access. You can store and retrieve any kind of data. It is used for data that requires infrequent access. Another platform offered by AWS is S3 Glacier, which offers cold storage of data. It is ideal for storing older database backups.

GCP (Google Cloud Platform) also provides an object storage service called GCS (Google Cloud Storage). There are several types of cloud storage based on how often the data is accessed, they are: Standard (used for highly frequent access), Nearline (used for data accessed less than once a month), Coldline (used for data accessed less than once a quarter), and Archive (used for data accessed less than once a year).

Azure provides three different access tiers called Azure Blob Storage. Hot Storage is always readily available and accessible. Cool Storage is for infrequently accessed data and Archive storage is used for rarely accessed data.

The colder the storage, the lower the cost.

Creating a Backup Archival Policy

ClusterControl supports backups to the cloud which currently supports three cloud providers (AWS, Google Cloud Platform, and Azure). For more cloud provider options, we also have our Backup Ninja tool.

ClusterControl also supports having a backup retention policy in the cloud. This allows you to determine how long you want to keep the backup database which is stored in the object storage. You can configure the retention policy in Backup Settings as shown below.

It will remove the backup that is stored in object storage. This backup retention policy can be combined with the archiving of the database backup that is stored in object storage on each cloud provider.

AWS has lifecycle management for archiving database backup from S3 to Glacier, to enable the archiving policy, you need to add lifecycle rules in Management Lifecycle for your S3 bucket.

Fill the rule name and add the prefix or tag the filter, after that click Next, and you need to choose the Object creation transition and Days after creation.

The configuration of expiration is used to expire and delete the object after N days of its creation.

The last thing is to review your lifecycle rules, if it is already correct. After that you can save the Rules.

So now, you have Lifecycle Policy Rules for your AWS S3 bucket to Glacier.

Google Cloud Platform has “Object Lifecycle Management” to enable the Lifecycle rule. Go to the bucket,

Choose the Lifecycle tab, then the lifecycle rules page will appear as shown below...

You can click the “Add A Rule” on the page and it will display the configuration page for the Action and Object Condition to be archived. There are four actions (as we already mentioned) are Nearline, Coldline, Archive, or Delete the object.

Choose the object conditions you want to configure based on your requirement to meet the selected conditions. You can choose based on Age, Created on or before, Storage class matches, Number of newer version, or Live state.

Then the new rule will be created in Lifecycle object management. This rule may take up for 24 hours to take effect.

Azure Cloud has features for managing Azure Blob Storage lifecycle. You can go through Storage Account, choose your bucket as shown below...

Then click Lifecycle Management, after that you will be prompted to a page for Lifecycle Management.

Add a new rule, to define your archiving rule in the Storage Account.

Fill in your rule name, must be letter and numeric. Enabled the status, and chose the action needed to take and fill the Days after last modification. There are 3 options; move the blob data to cool storage, move blob data to archive storage and delete the blob data. After that, click Next: Filter Set.

In the Filter Set, you can define the path for your virtual folder prefix. And then click Next: Review + Add

This page contains information that you had defined previously, the Action Set and Filter Set. You just need to click the Add button at the bottom and it will add a new rule in your Lifecycle Management.

The lifecycle management policy in your cloud will let you transition your database backup into a cooler storage tier, and delete your backup object storage at the end of the life cycle.

Conclusion

Combining retention policy and archiving rules in S3/object storage is essential for your backup strategies. It reduces your cloud storage costs, while allowing you to store your historical backups.

Tags: