One thing that is critical for any Puppet deployment is monitoring. Experience has shown that while the Puppetmaster server process is usually quite stable, the Puppet client daemons will, on occasion, die unexpectedly. This article will give an overview of how to setup monitoring of both the Puppetmaster server itself and all of the Puppet clients.
First, let's get the prerequisites out of the way. You need to do the following:
- Install Nagios on a server, if you haven't already (sudo yum -y install nagios nagios-nrpe nagios-plugins).
- Install NRPE on the Nagios server. (sudo yum -y install nrpe).
- Install NRPE on the Puppetmaster server. (sudo yum -y install nrpe nagios-plugins).
Note that we're not installing anything on the Puppet clients themselves; all of the monitoring takes place on the Puppetmaster server.
Configure NRPE to start on boot on both the Nagios and Puppetmaster servers (but don't start it yet):
$ sudo chkconfig nrpe onNext, edit the /etc/nagios/nrpe.cfg file on the Puppetmaster server. Make the following changes to it:
- Add the IP address of your Nagios server to the allowed_hosts line.
- Set the dont_blame_nrpe line to 1.
- Uncomment the command_prefix=/usr/bin/sudo line.
- Add the following two lines to the bottom of the file:
command[check_procs_puppetmasterd]=/usr/lib64/nagios/plugins/check_procs -w 1:1 -c 1:1 -C puppetmasterd
command[check_puppet_client]=/usr/lib64/nagios/plugins/check_puppet_client $ARG1$If you're not on 64-bit hardware, then your path will read /usr/lib/nagios... instead. Adjust the above as needed.
Now edit the /etc/sudoers file on the Puppetmaster server and add the following line to the bottom of the file:
nrpe ALL=(ALL) NOPASSWD:/usr/lib64/nagios/plugins/
Again, the caveat about 64-bit hardware applies.
Now create the /usr/lib64/nagios/plugins/check_puppet_client file with the following contents:
#!/bin/bash CLIENT=$1 if [ "$CLIENT" == "" ]; then echo "Your check_puppet_client plugin configuration is broken" exit 1 fi NOW=`date "+%s"` LOGFILE=/var/log/puppet/masterhttp.log LASTRUN=`grep $CLIENT $LOGFILE | tail -1 | awk '{ print $1 " " $2 }' | sed 's/\[//' | sed 's/\]//'` LASTRUN=`date "+%s" -d "$LASTRUN"` TIMEDIFF=`expr $NOW - $LASTRUN` if [ $TIMEDIFF -gt 3600 ]; then echo "PUPPET CLIENT CRITICAL - Last checkin was $TIMEDIFF seconds ago" exit 2 elif [ $TIMEDIFF -gt 1800 ]; then echo "PUPPET CLIENT WARNING - Last checkin was $TIMEDIFF seconds ago" exit 1 else echo "PUPPET CLIENT OK - Last checkin was $TIMEDIFF seconds ago" exit 0 fi
This Nagios plugin checks the Puppetmaster's log file for the last time the specified Puppet client checked in, and if the last time was over 1 hour ago, throws a critical alert. If the last checkin time was over 30 minutes ago, it throws a warning.
Edit your Nagios commands.cfg file and change the NRPE definition to read as follows:
define command {
command_name check_nrpe
command_line $USER1$/check_nrpe -H $ARG1$ -c $ARG2$ -a $ARG3$
}Now for each host that you are monitoring via Nagios and that is a Puppet client, add the following service definition:
define service {
use generic-service
host_name PUPPET_CLIENT_HOSTNAME
service_description nrpe_puppet_client
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 5
retry_check_interval 1
contact_groups pager
notification_options w,u,c,r
notification_interval 10
notification_period 24x7
check_command check_nrpe!PUPPETMASTER_HOSTNAME!check_puppet_client!PUPPET_CLIENT_HOSTNAME
}Replace PUPPETMASTER_HOSTNAME with the hostname of your Puppetmaster server, and PUPPET_CLIENT_HOSTNAME with the hostname of your Puppet client.
Finally, to monitor the Puppetmaster itself, add the following to the same file:
define service{
use generic-service
host_name PUPPETMASTER_HOST
service_description nrpe_puppetmaster
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 5
retry_check_interval 1
contact_groups pager
notification_options w,u,c,r
notification_interval 10
notification_period 24x7
check_command check_nrpe!PUPPETMASTER_HOST!check_procs_puppetmasterd!1
}Note that the !1 on the end of the check_command is entirely extranious, but required to satisfy the changes we made to the check_nrpe command definition.
This last entry corresponds to the following entry in the Puppetmaster server's /etc/nagios/nrpe.cfg file:
command[check_procs_puppetmasterd]=/usr/lib64/nagios/plugins/check_procs -w 1:1 -c 1:1 -C puppetmasterd
On the Puppetmaster server, start NRPE by running:
$ sudo service nrpe startOn the Nagios server, check your configuration files and then restart Nagios by running:
$ sudo nagios -v /etc/nagios/nagios.cfg && sudo service nagios restart
That's it! Nagios will now alert if the Puppetmaster process dies, or if a given Puppet client doesn't check in at least every hour.
