Lost ALRM signals

Sat, 10 Aug 2002 17:26:21 -0700 (PDT)

One of my (more annoying :-) cohorts noticed the timestamps were
missing for one of the log files he was looking at.  After poking
around, I found about 1/3 of the child processes looked "normal":

  # psig 12550 | grep ^ALRM
  12550: ALRM     caught  0

while the others looked "odd":

  # psig 12555 | grep ^ALRM
  12555: ALRM     caught  RESETHAND,NODEFER

and there was a direct correlation between "normal"/"odd" and whether
the timestamps were working.  Sending an ALRM signal to an "odd" process
by hand caused it to die, although at least the master restarted it.
However, the flags were still "wrong".

Looking at the code I noticed some sleep() and usleep() calls that could
potentially disturb the ALRM handler and clearly needed to be protected.

However, a debugging session showed the real culprit to be, of all things,
TCP wrappers.  It messes with the ALRM signal and makes no attempt
whatsoever to save/restore it (grrrr).  So every time a connection
was made and hosts_access() was called, our ALRM handler was clobbered
(of course, it only took one).

The following code adds a small function that must be called from any
place that might mess up the handler.  Future coding should be sure to
call it if any sleep() or usleep() calls are added (maybe sleep/usleep
should be wrapped with our own code and never be called directly?).

This was all tested on Solaris 2.8 with conserver 7.2.2, although the
problem first showed up my production system, which is Solaris 2.6 and
conserver 7.1.3 (upgrade to both is imminent).

John R. Jackson, Technical Software Specialist, jrj@purdue.edu