Skip to main content

How to use Nagios to monitor Puppet

One thing that is critical for any Puppet deployment is monitoring. Experience has shown that while the Puppetmaster server process is usually quite stable, the Puppet client daemons will, on occasion, die unexpectedly. This article will give an overview of how to setup monitoring of both the Puppetmaster server itself and all of the Puppet clients.

First, let's get the prerequisites out of the way. You need to do the following:

  • Install Nagios on a server, if you haven't already (sudo yum -y install nagios nagios-nrpe nagios-plugins).
  • Install NRPE on the Nagios server. (sudo yum -y install nrpe).
  • Install NRPE on the Puppetmaster server. (sudo yum -y install nrpe nagios-plugins).

Note that we're not installing anything on the Puppet clients themselves; all of the monitoring takes place on the Puppetmaster server.

Configure NRPE to start on boot on both the Nagios and Puppetmaster servers (but don't start it yet):

    $ sudo chkconfig nrpe on

Next, edit the /etc/nagios/nrpe.cfg file on the Puppetmaster server. Make the following changes to it:

  • Add the IP address of your Nagios server to the allowed_hosts line.
  • Set the dont_blame_nrpe line to 1.
  • Uncomment the command_prefix=/usr/bin/sudo line.
  • Add the following two lines to the bottom of the file:

    command[check_procs_puppetmasterd]=/usr/lib64/nagios/plugins/check_procs -w 1:1 -c 1:1 -C puppetmasterd
    command[check_puppet_client]=/usr/lib64/nagios/plugins/check_puppet_client $ARG1$

If you're not on 64-bit hardware, then your path will read /usr/lib/nagios... instead. Adjust the above as needed.

Now edit the /etc/sudoers file on the Puppetmaster server and add the following line to the bottom of the file:

    nrpe ALL=(ALL) NOPASSWD:/usr/lib64/nagios/plugins/

Again, the caveat about 64-bit hardware applies.

Now create the /usr/lib64/nagios/plugins/check_puppet_client file with the following contents:

    #!/bin/bash
 
    CLIENT=$1
 
    if [ "$CLIENT" == "" ]; then
        echo "Your check_puppet_client plugin configuration is broken"
        exit 1
    fi
 
    NOW=`date "+%s"`
    LOGFILE=/var/log/puppet/masterhttp.log
    LASTRUN=`grep $CLIENT $LOGFILE | tail -1 | awk '{ print $1 " " $2 }' | sed 's/\[//' | sed 's/\]//'`
    LASTRUN=`date "+%s" -d "$LASTRUN"`
    TIMEDIFF=`expr $NOW - $LASTRUN`
 
    if [ $TIMEDIFF -gt 3600 ]; then
        echo "PUPPET CLIENT CRITICAL - Last checkin was $TIMEDIFF seconds ago"
        exit 2
    elif [ $TIMEDIFF -gt 1800 ]; then
        echo "PUPPET CLIENT WARNING - Last checkin was $TIMEDIFF seconds ago"
        exit 1
    else
        echo "PUPPET CLIENT OK - Last checkin was $TIMEDIFF seconds ago"
        exit 0
    fi

This Nagios plugin checks the Puppetmaster's log file for the last time the specified Puppet client checked in, and if the last time was over 1 hour ago, throws a critical alert. If the last checkin time was over 30 minutes ago, it throws a warning.

Edit your Nagios commands.cfg file and change the NRPE definition to read as follows:

    define command {
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $ARG1$ -c $ARG2$ -a $ARG3$
    }

Now for each host that you are monitoring via Nagios and that is a Puppet client, add the following service definition:
    define service {
        use                             generic-service
        host_name                       PUPPET_CLIENT_HOSTNAME
        service_description             nrpe_puppet_client
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  pager
        notification_options            w,u,c,r
        notification_interval           10
        notification_period             24x7
        check_command                   check_nrpe!PUPPETMASTER_HOSTNAME!check_puppet_client!PUPPET_CLIENT_HOSTNAME
    }

Replace PUPPETMASTER_HOSTNAME with the hostname of your Puppetmaster server, and PUPPET_CLIENT_HOSTNAME with the hostname of your Puppet client.

Finally, to monitor the Puppetmaster itself, add the following to the same file:

    define service{
        use                             generic-service
        host_name                       PUPPETMASTER_HOST
        service_description             nrpe_puppetmaster
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  pager
        notification_options            w,u,c,r
        notification_interval           10
        notification_period             24x7
        check_command                   check_nrpe!PUPPETMASTER_HOST!check_procs_puppetmasterd!1
    }

Note that the !1 on the end of the check_command is entirely extranious, but required to satisfy the changes we made to the check_nrpe command definition.

This last entry corresponds to the following entry in the Puppetmaster server's /etc/nagios/nrpe.cfg file:

    command[check_procs_puppetmasterd]=/usr/lib64/nagios/plugins/check_procs -w 1:1 -c 1:1 -C puppetmasterd

On the Puppetmaster server, start NRPE by running:
    $ sudo service nrpe start

On the Nagios server, check your configuration files and then restart Nagios by running:
    $ sudo nagios -v /etc/nagios/nagios.cfg && sudo service nagios restart

That's it! Nagios will now alert if the Puppetmaster process dies, or if a given Puppet client doesn't check in at least every hour.