So, I was recently asked to check on an EC2 instance that started spitting Nagios plugin errors for no apparent reason for a few days.
Basically, almost all NRPE checks would time out randomly. There is no load on the server, no disk IO that would cause something similar. Also, several commands were pretty slow on the command line. However, the most notable ones were commands run with sudo. Especially since the same commands when ran as root mostly worked fine.
Initially I tried to check dmesg for any file system (or disk ?) issues there. I found none of that. However, I did fine several traces of OOM kills. I checked and turned out that the application running on that instance had eaten all the memory and crashed a few days ago.
I tried to check the system logs for errors, and found out that all logs had their last around when the Nagios problem started, and that was also when the application on the instance crashed.
So, rsyslog was dead. There was a pid file and everything, but no process running. I went back to dmesg and found that it got waked by the OOM killer. By now I had a pretty good idea of what was the problem.
The theory is, most applications write to /dev/log (a UNIX socket) to send syslog messages to rsyslog. If rsyslog is dead, no one would read from that socket buffer and it will be filled pretty quickly. Once that happens, any process trying to that socket will have to block until the free buffer is freed or times out.
sudo was particularly sensitive to this because PAM write to the auth.log any time sudo is used. When rsyslog was dead, sudo had to wait till the log write attempt times out.
Simply restarting rsyslogd fixed the problem and everything was back to normal. I took a note-to-self that standard services such as rsyslogd should be monitored on any systems under our management, to avoid such situations.