The Mysteriously Slow sudo..

So, I was recently asked to check on an EC2 instance that started spitting Nagios plugin errors for no apparent reason for a few days.

Basically, almost all NRPE checks would time out randomly. There is no load on the server, no disk IO that would cause something similar. Also, several commands were pretty slow on the command line. However, the most notable ones were commands run with  sudo. Especially since the same commands when ran as root mostly worked fine.

Initially I tried to check dmesg for any file system (or disk ?) issues there. I found none of that. However, I did fine several traces of OOM kills. I checked and turned out that the application running on that instance had eaten all the memory and crashed a few days ago.

I tried to check the system logs for errors, and found out that all logs had their last around when the Nagios problem started, and that was also when the application on the instance crashed.

So, rsyslog was dead. There was a pid file and everything, but no process running. I went back to dmesg and found that it got waked by the OOM killer. By now I had a pretty good idea of what was the problem.

The theory is, most applications write to /dev/log (a UNIX socket) to send syslog messages to rsyslog. If rsyslog is dead, no one would read from that socket buffer and it will be filled pretty quickly. Once that happens, any process trying to that socket will have to block until the free buffer is freed or times out.

sudo was particularly sensitive to this because PAM write to the auth.log any time sudo is used. When rsyslog was dead, sudo had to wait till the log write attempt times out.

Simply restarting rsyslogd fixed the problem and everything was back to normal. I took a note-to-self that standard services such as rsyslogd should be monitored on any systems under our management, to avoid such situations.

Address reuse in python’s SocketServer

Hello, World! This is my first technical post, I hope it’s useful to someone out there!

I am working on a very small tool that I need to for a proof of concept. It’s basically a small TCP server in python.
After creating a small skeleton using SocketServer, I found that the server it self works fine with no problem.

However, if I try to stop and start the server again to test any modifications, I get a random “socket.error: [Errno 98] Address already in use” error. This happens only if a client has already connected to the server.

Checking with netstat and ps, I found that although the process it self is no longer running, the socket is still listening on the port with status “TIME_WAIT”. Basically the OS  waits for a while to make sure this connection has no remaining packets on the way.

My good friend mux mentioned that I should probably set the socket option “SO_REUSEADDR” to avoid this issue.

The man socket(7) says about this:

SO_REUSEADDR

Indicates that the rules used in validating addresses supplied in a bind(2) call should allow reuse of local addresses.  For AF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address.  When the listening socket is bound to INADDR_ANY with a specific port then it is not possible to bind to this port for any local address.  Argument is an integer boolean flag.

When using the pure socket module, you can simply set this option using:

import socket

s = socket.socket()
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

 

However, you don’t need to this in SocketServer. SocketServer.TCPServer allows you to set by setting a simple variable. Here is a working example:

#! /usr/bin/env python
import socket
import SocketServer

class MyServer(SocketServer.StreamRequestHandler):

def handle(self):

self.wfile.write("Hi! I am echo!\n")

while True:

data = self.rfile.readline().strip()
if not data: break
self.wfile.write(">>%s\n" % data)

if __name__ == "__main__":

host, port = "localhost", 10000
# Setting allow_resue_address to True.
SocketServer.TCPServer.allow_reuse_address = True
server = SocketServer.TCPServer((host, port), MyServer)
server.serve_forever()

 

This is just a simple example of how to use it. You can check the last link in the resources for a more complex example with threading support.

Resources:

http://linux.die.net/man/7/socket
http://docs.python.org/library/socket.html
http://docs.python.org/library/socketserver.html
http://code.activestate.com/lists/python-list/222584/
http://www.technovelty.org/code/python/socketserver.html