Integrating ClusterControl with SNMP : Part Two

Ashraf Sharif

This blog post is a continuation of the previous part 1, where we have covered the basics of SNMP integration with ClusterControl.

In this blog post, we are going to focus on SNMP traps and alerting. SNMP traps are the most frequently used alert messages sent from a remote SNMP-enabled device (an agent) to a central collector, the “SNMP manager.” In the case of ClusterControl, a trap could be an alert after the critical alarm for a cluster is not 0, indicating something bad is happening. 

As shown in the previous blog post, for the purpose of this proof-of-concept, we have two SNMP trap notifications definition:

criticalAlarmNotification NOTIFICATION-TYPE
    OBJECTS { totalCritical, clusterId }
    STATUS current
    DESCRIPTION
        "Notification if critical alarm is not 0"
    ::= { alarmNotification 1 }

criticalAlarmNotificationEnded NOTIFICATION-TYPE
    OBJECTS { totalCritical, clusterId }
    STATUS  current
    DESCRIPTION
        "Notification ended - Critical alarm is 0"
    ::= { alarmNotification 2 }

The notifications (or traps) are criticalAlarmNotification and criticalAlarmNotificationEnded. Both notification events can be used to signal our Nagios service, whether the cluster is actively having critical alarms or not. In Nagios, the term for this is passive check, whereby Nagios does not attempt to determine whether or host/service is DOWN or UNREACHABLE. We will also configure the active checks, where checks are initiated by the check logic in the Nagios daemon by using the service definition to also monitor the critical/warning alarms reported by our cluster.

Take note that this blog post requires the Severalnines MIB and SNMP agent configured correctly as shown in the first part of this blog series.

Installing Nagios Core

Nagios Core is the free version of the Nagios monitoring suite. First and foremost, we have to install it and all necessary packages, followed by the Nagios plugins, snmptrapd and snmptt. Note that instructions in this blog post are assuming all nodes are running on CentOS 7.

Install the necessary packages to run Nagios:

$ yum -y install httpd php gcc glibc glibc-common wget perl gd gd-devel unzip zip sendmail net-snmp-utils net-snmp-perl

Create a nagios user and nagcmd group for allowing the external commands to be executed through the web interface, add the nagios and apache user to be a part of the nagcmd group:

$ useradd nagios
$ groupadd nagcmd
$ usermod -a -G nagcmd nagios
$ usermod -a -G nagcmd apache

Download the latest version of Nagios Core from here, compile and install it:

$ cd ~
$ wget https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.4.6.tar.gz
$ tar -zxvf nagios-4.4.6.tar.gz
$ cd nagios-4.4.6
$ ./configure --with-nagios-group=nagios --with-command-group=nagcmd
$ make all
$ make install
$ make install-init
$ make install-config
$ make install-commandmode

Install the Nagios web configuration:

$ make install-webconf

Optionally, install the Nagios exfoliation theme (or you could stick to the default theme):

$ make install-exfoliation

Create a user account (nagiosadmin) for logging into the Nagios web interface. Remember the password that you assign to this user:

$ htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

Restart Apache webserver to make the new settings take effect:

$ systemctl restart httpd
$ systemctl enable httpd

Download the Nagios Plugins from here, compile and install it:

$ cd ~ 
$ wget https://nagios-plugins.org/download/nagios-plugins-2.3.3.tar.gz
$ tar -zxvf nagios-plugins-2.3.3.tar.gz
$ cd nagios-plugins-2.3.3
$ ./configure --with-nagios-user=nagios --with-nagios-group=nagios
$ make
$ make install

Verify the default Nagios configuration files:

$ /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Nagios Core 4.4.6
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2020-04-28
License: GPL
Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...
Running pre-flight check on configuration data...
Checking objects...
Checked 8 services.
Checked 1 hosts.
Checked 1 host groups.
Checked 0 service groups.
Checked 1 contacts.
Checked 1 contact groups.
Checked 24 commands.
Checked 5 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 1 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 5 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors:   0
Things look okay - No serious problems were detected during the pre-flight check
If everything looks okay, start Nagios and configure it to start on boot:
$ systemctl start nagios
$ systemctl enable nagios

Open the browser and go to http://{IPaddress}/nagios and you should see an HTTP basic authentication pops up where you need to specify the username as nagiosadmin with your chosen password created earlier.

Adding ClusterControl server into Nagios

Create a Nagios host definition file for ClusterControl:

$ vim /usr/local/nagios/etc/objects/clustercontrol.cfg

And add the following lines:

define host {
        use                     linux-server
        host_name               clustercontrol.local
        alias                   clustercontrol.mydomain.org
        address                 192.168.10.50
}

define service {
        use                     generic-service
        host_name               clustercontrol.local
        service_description     Critical alarms - ClusterID 23
        check_command           check_snmp! -H 192.168.10.50 -P 2c -C private -o .1.3.6.1.4.1.57397.1.1.1.2 -c0
}

define service {
        use                     generic-service
        host_name               clustercontrol.local
        service_description     Warning alarms - ClusterID 23
        check_command           check_snmp! -H 192.168.10.50 -P 2c -C private -o .1.3.6.1.4.1.57397.1.1.1.3 -w0
}


define service {
        use                     snmp_trap_template
        host_name               clustercontrol.local
        service_description     Critical alarm traps
        check_interval          60 ; Don't clear for 1 hour
}

Some explanations:

  • In the first section, we define our host, with the hostname and address of the ClusterControl server.

  • The service sections where we put our service definitions to be monitored by the Nagios. The first two is basically telling the service to check the SNMP output for a particular object ID. The first service is about the critical alarm, therefore we add -c0 in the check_snmp command to indicate that it should be a critical alert in Nagios if the value goes beyond 0. While for the warning alarms, we will indicate it with a warning if the value is 1 and higher.

  • The last service definition is about the SNMP traps that we would expect coming from the ClusterControl server if the critical alarm raised is higher than 0. This section will use the snmp_trap_template definition, as shown in the next step.

Configure the snmp_trap_template by adding the following lines into /usr/local/nagios/etc/objects/templates.cfg:

define service {
        name                            snmp_trap_template
        service_description             SNMP Trap Template
        active_checks_enabled           1       ; Active service checks are enabled
        passive_checks_enabled          1       ; Passive service checks are enabled/accepted
        parallelize_check               1       ; Active service checks should be parallelized
        process_perf_data               0
        obsess_over_service             0       ; We should obsess over this service (if necessary)
        check_freshness                 0       ; Default is to NOT check service 'freshness'
        notifications_enabled           1       ; Service notifications are enabled
        event_handler_enabled           1       ; Service event handler is enabled
        flap_detection_enabled          1       ; Flap detection is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
        check_command                   check-host-alive      ; This will be used to reset the service to "OK"
        is_volatile                     1
        check_period                    24x7
        max_check_attempts              1
        normal_check_interval           1
        retry_check_interval            1
        notification_interval           60
        notification_period             24x7
        notification_options            w,u,c,r
        contact_groups                  admins       ; Modify this to match your Nagios contactgroup definitions
        register                        0
}

Include the ClusterControl configuration file into Nagios, by adding the following line inside  

/usr/local/nagios/etc/nagios.cfg:
cfg_file=/usr/local/nagios/etc/objects/clustercontrol.cfg

Run a pre-flight configuration check:

$ /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Make sure you get the following line at the end of the output:

"Things look okay - No serious problems were detected during the pre-flight check"

Restart Nagios to load the change:

$ systemctl restart nagios

Now if we look at the Nagios page under the Service section (left-side menu), we would see something like this:

Notice the "Critical alarms - ClusterID 1" row turns red if the critical alarm value reported by ClusterControl is bigger than 0, while the "Warning alarms - ClusterID 1" is yellow, indicating that there is a warning alarm raised. In case if nothing interesting happens, you would see everything is green for clustercontrol.local.

Configuring Nagios to receive a trap

Traps are sent by remote devices to the Nagios server, this is called a passive check. Ideally, we don't know when a trap will be sent as it depends on the sending device decides it will send a trap. For example with a UPS (battery backup), as soon as the device loses power, it will send a trap to say "hey, I lost power". This way Nagios is informed immediately.

In order to receive SNMP traps, we need to configure the Nagios server with the following things:

  • snmptrapd (SNMP trap receiver daemon)

  • snmptt (SNMP Trap Translator, the trap handler daemon)

After the snmptrapd receives a trap, it will pass it to snmptt where we will configure it to update the Nagios system and then Nagios will send out the alert according to the contact group configuration.

Install EPEL repository, followed by the necessary packages:

$ yum -y install epel-release
$ yum -y install net-snmp snmptt net-snmp-perl perl-Sys-Syslog

Configure the SNMP trap daemon at /etc/snmp/snmptrapd.conf and set the following lines:

disableAuthorization yes
traphandle default /usr/sbin/snmptthandler

The above simply means traps received by the snmptrapd daemon will be passed over to /usr/sbin/snmptthandler.

Add the SEVERALNINES-CLUSTERCONTROL-MIB.txt into /usr/share/snmp/mibs by creating /usr/share/snmp/mibs/SEVERALNINES-CLUSTERCONTROL-MIB.txt:

$ ll /usr/share/snmp/mibs/SEVERALNINES-CLUSTERCONTROL-MIB.txt
-rw-r--r-- 1 root root 4029 May 30 20:08 /usr/share/snmp/mibs/SEVERALNINES-CLUSTERCONTROL-MIB.txt

Create /etc/snmp/snmp.conf (notice without the "d") and add our custom MIB there:

mibs +SEVERALNINES-CLUSTERCONTROL-MIB

Start the snmptrapd service:

$ systemctl start snmptrapd
$ systemctl enable snmptrapd

Next, we need to configure the following configuration lines inside /etc/snmp/snmptt.ini:

net_snmp_perl_enable = 1
snmptt_conf_files = <<END
/etc/snmp/snmptt.conf
/etc/snmp/snmptt-cc.conf
END

Note that we enabled the net_snmp_perl module and have added another configuration path, /etc/snmp/snmptt-cc.conf inside snmptt.ini. We need to define ClusterControl snmptt events here so they can be passed to Nagios. Create a new file at /etc/snmp/snmptt-cc.conf and add the following lines:

MIB: SEVERALNINES-CLUSTERCONTROL-MIB (file:/usr/share/snmp/mibs/SEVERALNINES-CLUSTERCONTROL-MIB.txt) converted on Sun May 30 19:17:33 2021 using snmpttconvertmib v1.4.2

EVENT criticalAlarmNotification .1.3.6.1.4.1.57397.1.1.3.1 "Status Events" Critical
FORMAT Notification if the critical alarm is not 0
EXEC /usr/local/nagios/share/eventhandlers/submit_check_result $aA "Critical alarm traps" 2 "Critical - Critical alarm is $1 for cluster ID $2"
SDESC
Notification if critical alarm is not 0
Variables:
  1: totalCritical
  2: clusterId
EDESC

EVENT criticalAlarmNotificationEnded .1.3.6.1.4.1.57397.1.1.3.2 "Status Events" Normal
FORMAT Notification if the critical alarm is not 0
EXEC /usr/local/nagios/share/eventhandlers/submit_check_result $aA "Critical alarm traps" 0 "Normal - Critical alarm is $1 for cluster ID $2"
SDESC
Notification ended - critical alarm is 0
Variables:
  1: totalCritical
  2: clusterId
EDESC

Some explanations:

  • We have two traps defined - criticalAlarmNotification and criticalAlarmNotificationEnded.

  • The criticalAlarmNotification simply raises a critical alert and passes it to the "Critical alarm traps" service defined in Nagios. The $aA means to return the trap agent IP address. The value 2 is the check result value which in this case is critical (0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN).

  • The criticalAlarmNotificationEnded simply raises an OK alert and passes it to the "Critical alarm traps" service, to cancel the previous trap after everything backs to normal. The $aA means to return the trap agent IP address. The value 0  is the check result value which in this case is OK. For more details on string substitutions recognized by snmptt, check out this article under the "FORMAT" section.

  • You may use snmpttconvertmib to generate snmptt event handler file for a particular MIB.

Note that by default, the eventhandlers path is not provided by the Nagios Core. Therefore, we have to copy that eventhandlers directory from the Nagios source under the contrib directory, as shown below:

$ cp -Rf nagios-4.4.6/contrib/eventhandlers /usr/local/nagios/share/
$ chown -Rf nagios:nagios /usr/local/nagios/share/eventhandlers

We also need to assign snmptt group as part of the nagcmd group, so it can execute the nagios.cmd inside the submit_check_result script:

$ usermod -a -G nagcmd snmptt

Start the snmptt service:

$ systemctl start snmptt
$ systemctl enable snmptt

The SNMP Manager (Nagios server) is now ready to accept and process our incoming SNMP traps.

Sending a trap from the ClusterControl server

Suppose one wants to send an SNMP trap to the SNMP manager, 192.168.10.11 (Nagios server) because the total number of critical alarms has reached 2 for cluster ID 1, one would run the following command on the ClusterControl server (client-side), 192.168.10.50:

$ snmptrap -v2c -c private 192.168.10.11 '' SEVERALNINES-CLUSTERCONTROL-MIB::criticalAlarmNotification \
SEVERALNINES-CLUSTERCONTROL-MIB::totalCritical i 2 \
SEVERALNINES-CLUSTERCONTROL-MIB::clusterId i 1

Or, in OID format (recommended):

$ snmptrap -v2c -c private 192.168.10.11 '' .1.3.6.1.4.1.57397.1.1.3.1 \
.1.3.6.1.4.1.57397.1.1.1.2 i 2 \
.1.3.6.1.4.1.57397.1.1.1.4 i 1

Where, .1.3.6.1.4.1.57397.1.1.3.1 is equal to criticalAlarmNotification trap event, and the subsequent OIDs are representations of the total number of the current critical alarms and the cluster ID, respectively.

On the Nagios server, you should notice the trap service has turned red:

You can also see it in the /var/log/messages of the following line:

May 30 23:52:39 ip-10-15-2-148 snmptrapd[27080]: 2021-05-30 23:52:39 UDP: [192.168.10.50]:33151->[192.168.10.11]:162 [UDP: [192.168.10.50]:33151->[192.168.10.11]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2423020) 6:43:50.20#011SNMPv2-MIB::snmpTrapOID.0 = OID: SEVERALNINES-CLUSTERCONTROL-MIB::criticalAlarmNotification#011SEVERALNINES-CLUSTERCONTROL-MIB::totalCritical = INTEGER: 2#011SEVERALNINES-CLUSTERCONTROL-MIB::clusterId = INTEGER: 1
May 30 23:52:42 nagios.local snmptt[29557]: .1.3.6.1.4.1.57397.1.1.3.1 Critical "Status Events" UDP192.168.10.5033151-192.168.10.11162 - Notification if critical alarm is not 0
May 30 23:52:42 nagios.local nagios: EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;192.168.10.50;Critical alarm traps;2;Critical - Critical alarm is 2 for cluster ID 1
May 30 23:52:42 nagios.local nagios: PASSIVE SERVICE CHECK: clustercontrol.local;Critical alarm traps;0;PING OK - Packet loss = 0%, RTA = 22.16 ms
May 30 23:52:42 nagios.local nagios: SERVICE NOTIFICATION: nagiosadmin;clustercontrol.local;Critical alarm traps;CRITICAL;notify-service-by-email;Critical - Critical alarm is 2 for cluster ID 1
May 30 23:52:42 nagios.local nagios: SERVICE ALERT: clustercontrol.local;Critical alarm traps;CRITICAL;HARD;1;Critical - Critical alarm is 2 for cluster ID 1

Once the alarm has resolved, to send a normal trap, we may execute the following command:

$ snmptrap -c private -v2c 192.168.10.11 '' .1.3.6.1.4.1.57397.1.1.3.2 \ 
.1.3.6.1.4.1.57397.1.1.1.2 i 0 \
.1.3.6.1.4.1.57397.1.1.1.4 i 1

Where, .1.3.6.1.4.1.57397.1.1.3.2 is equal to criticalAlarmNotificationEnded event, and the subsequent OIDs are representations of the total number of the current critical alarms (should be 0 for this case) and the cluster ID, respectively.

On the Nagios server, you should notice the trap service is back to green:

The above can be automated with a simple bash script:

#!/bin/bash
# alarmtrapper.bash - SNMP trapper for ClusterControl alarms

CLUSTER_ID=1
SNMP_MANAGER=192.168.10.11
INTERVAL=10

send_critical_snmp_trap() {
        # send critical trap
        local val=$1
        snmptrap -v2c -c private ${SNMP_MANAGER} '' .1.3.6.1.4.1.57397.1.1.3.1 .1.3.6.1.4.1.57397.1.1.1.1 i ${val} .1.3.6.1.4.1.57397.1.1.1.4 i ${CLUSTER_ID}
}

send_zero_critical_snmp_trap() {
        # send OK trap
        snmptrap -v2c -c private ${SNMP_MANAGER} '' .1.3.6.1.4.1.57397.1.1.3.2 .1.3.6.1.4.1.57397.1.1.1.1 i 0 .1.3.6.1.4.1.57397.1.1.1.4 i ${CLUSTER_ID}
}

while true; do
        count=$(s9s alarm --list --long --cluster-id=${CLUSTER_ID} --batch | grep CRITICAL | wc -l)
        [ $count -ne 0 ] && send_critical_snmp_trap $count || send_zero_critical_snmp_trap
        sleep $INTERVAL
done

To run the script in the background, simply do:

$ bash alarmtrapper.bash &

At this point, we should be able to see Nagios's "Critical alarm traps" service in action if there is a failure in our cluster automatically.

Final Thoughts

In this blog series, we have shown a proof-of-concept on how ClusterControl can be configured for monitoring, generating/processing traps and alerting using SNMP protocol. This also marks the beginning of our journey to incorporate SNMP in our future releases. Stay tuned as we will bring more updates on this exciting feature.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.