The SYSSTAT package offers a bunch of utilities that are great for troubleshooting a slow or intermittently unresponsive server. The individual commands also seem to come up somewhat frequently during technical interviews, although if I were ever asked for specific parameters to a given command I'd probably respond with "read the man page." Nevertheless, I think a few examples are well worth seeing.
First, make sure the SYSSTAT package is installed on your server:
rpm -qa | grep sysstat
Next, make sure that SYSSTAT is collecting statistics:
sudo cat /etc/cron.d/sysstat # run system activity accounting tool every 10 minutes */10 * * * * root /usr/lib64/sa/sa1 1 1 # generate a daily summary of process accounting at 23:53 53 23 * * * root /usr/lib64/sa/sa2 -A
sa1 is a shell script that wraps around the sadc program, and in the above example is collecting information about the system every ten minutes. The parameters of 1 and 1 tell it to collect information for one second one time (the order is interval and count.) The data collected by sadc is stored in binary format in /var/log/sa/sadd where dd is the current day.
sa2 does the same thing as sa1, except it instructs sar to write a daily report to the /var/log/sa/sardd file. Unlike sa1, the output file is in a human-readable format.
That's the backend for SYSSTAT. There's a few other utilities that I'll mention here before moving on to the meat of the discussion, the commands that actually allow you to examine the information collected by SYSSTAT.
The sadf command lets you export the data collected by sar in a variety of formats, like CSV or XML. See http://pagesperso-orange.fr/sebastien.godard/man_sadf.html for more information.
Let's say an Apache web server currently seems sluggish and unresponsive. Why?
We'll begin by examining the OS itself, and then move to examining the application, namely, Apache.
First, what is the CPU currently doing?
iostat -c avg-cpu: %user %nice %system %iowait %steal %idle 3.34 0.00 1.34 0.01 0.00 95.31
This says that 3.34% of the CPU for that second in time was being spent on a user application (not to venture too far out into the weeds here, but the kernel classifies everything into two kinds of processes: system processes and user processes. Anything you run is a user process. Even init(8) is a user process. The kernel swapper, on the other hand, is a system process.)
1.34% of that second was spent by the kernel executing in system level (the kernel has two states: system and user. It flip-flops back and forth, known as context switching, depending on what it is doing.)
It spent 0.01% sitting idle, doing nothing, while it waited for a disk I/O request to complete. Conversely, this means it spent 95.31% of that second sitting idle waiting on nothing.
If you're running a multiprocessor system, use mpstat:
mpstat 07:50:51 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 07:50:51 PM all 3.34 0.00 1.24 0.01 0.02 0.08 0.00 95.31 1276.16
What's happening with the disk?
iostat -dmx Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.08 13.82 0.30 11.06 0.02 0.10 21.77 0.01 0.86 0.15 0.17
The interesting fields are rMB/s (megabytes read per second), wMB/s (megabytes written per second), await (average time, in milliseconds, that the disk took to service a request), and %util (percentage of CPU time spent waiting for the disk to service a request.) %util strongly correlates to the %iowait field in the output from iostat -c. (When %util approaches 100 your server will burst into flames.)
If you suspect a specific process is causing the slowness, for example Apache, you can look at its memory and swap information:
pidstat -r -p ALL -C httpd 02:34:03 PM PID minflt/s majflt/s VSZ RSS %MEM Command 02:34:03 PM 9744 0.05 0.00 287784 24580 1.19 httpd
Or its CPU usage:
pidstat -u -p ALL -C httpd 02:43:25 PM PID %usr %system %guest %CPU CPU Command 02:43:25 PM 9744 0.00 0.00 0.00 0.00 1 httpd
The nice thing with all of the above commands is that you can pass two additional parameters to them: the number of seconds to pause, and the number of times to repeat the command. For example, to see the CPU usage of Apache eight times while pausing for 10 seconds between looking, you could run:
pidstat -u -p ALL -C httpd 10 8
Or to watch the CPU as a whole every second for the next minute, you could run:
iostat -c 1 60
This is all fine and well if you happen to be logged in while the server is having a problem; but as every person who has ever been on call knows, shit usually goes down between the hours of two and four AM. The good news is that you can "retroactively" troubleshoot by using the sar command.
What did the CPU's run queue look like?
sar -q 12:00:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 12:10:01 AM 6 361 0.01 0.04 0.01
The larger the number in the runq-sz column, the more processes are waiting in the run queue. That in and of itself isn't a problem, for the real question is "why are they waiting?" Perhaps because of...
Were we hitting the swap file heavily?
sar -B 12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s 12:10:01 AM 0.00 26.11 995.36 0.02
These are strongly correlated to disk I/O since the swap area is on a disk. pgpgins/s is the number of kilobytes swapped into memory from disk per second. pgpgout/s is the same, only in a reverse direction. fault/s is the total number of both minor and major page faults per second, and majflt/s is just major faults per second:
- A minor page fault occurs when a memory page exists in RAM but an attempt to access it fails for some reason, such as the process lacking the correct permissions to read the page. A good example of this is a first write to a second reference to a shared page. The kernel must duplicate the page before allowing the second reference to write to it; a minor page fault occurs because the write is blocked while the kernel performs the memory duplication. It is very fast because no disk I/O is involved.
- A major page fault occurs when a memory page does not exist in RAM, and must be swapped in from the disk. This involves a high degree of latency, particularly if a page that is in memory must be swapped out to make room for the demanded page. Since all this takes time, the program that is requesting the memory page is left in sleep state while the kernel completes the swap.
Was the disk getting slammed or otherwise responding slowly?
sar -b 12:00:01 AM tps rtps wtps bread/s bwrtn/s 12:10:01 AM 9.65 0.00 9.65 0.00 104.46
This reports disk I/O and transfer rate statistics. tps is I/O requests per second to a physical device. I'm sure you can derive the meaning of rtps and wtps by now. Also, the b in bread/s and bwrtn/s stands for blocks.
Okay, so let's say the disk looks fine, we're not thrashing the swap file, and the CPU run queue looks good.
Was the network connection experiencing faults?
sar -n EDEV 12:00:01 AM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 12:10:01 AM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
That's how it should look. There are no line errors or excessive dropped packets.
Was the network connection suffering from something else, like a routing problem?
sar -n DEV 12:00:01 AM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 12:10:01 AM eth0 70.76 103.18 13474.33 102870.32 0.00 0.00 1.04
This is basically just a sanity check. If the values look all out of whack, then we need to dig deeper into the network layer, which is beyond the scope of this discussion.
Had the server run out of network resources?
sar -n SOCK 12:00:01 AM totsck tcpsck udpsck rawsck ip-frag 12:10:01 AM 217 85 16 0 0
Remember, your system only has 65,536 possible sockets, so if the totsck count is extremely high, it could be indicative of a problem (like, say, a DoS attack.) We should also check on file handles and inodes:
sar -v 12:00:01 AM dentunusd file-sz inode-sz super-sz %super-sz dquot-sz %dquot-sz rtsig-sz %rtsig-sz 12:10:01 AM 751551 1020 714796 0 0.00 0 0.00 0 0.00
file-sz is the number of open file handles in use by the system, and inode-sz is the same, but for inodes. If these counts approach or reach the limits set by sysctl then you've got a real problem.
If you're running as an NFS client or server, then this will probably be useful:
sar -n NFS 12:00:01 AM call/s retrans/s read/s write/s access/s getatt/s 12:10:01 AM 0.00 0.00 0.00 0.00 0.00 0.00
