[Date Prev] [Date Index] [Date Next] [Thread Prev] [Thread Index] [Thread Next]

Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)

Bryan Stansell bryan@conserver.com
Sun, 26 May 2002 00:10:06 -0700 (PDT)


On Thu, May 23, 2002 at 04:45:42PM -0400, Greg A. Woods wrote:
> conserver eventually seems to go catatonic after SIGPIPE (on NetBSD)
> 
> I think this is the same problem as what I reported some time ago with
> a somewhat older release, but now with 7.2.1 it's just far less common.
> (note the code I'm running includes the patches I sent to the list,
> though I don't see how any of those changes could affect any signal
> processing....)

ugh...well, i'll address this at the stack trace...

> I suspect the SIGPIPE is triggered by an attempt to write to a socket
> that's been closed (TCP RST) by the client.  The server should just
> close the socket and do any per-client cleanup necessary, but I don't
> see a signal handler for SIGPIPE anywhere....

right...no SIGPIPE handler.  but, there is a SIGCHLD handler, which
gets called when forked processes die.  i don't think this is a problem
(yet, at least).

> I'm recompiling now.....  Hmm.... seems it was waiting for a PID that
> didn't exist:
> 

[chopped gdb header]

> (gdb) where
> #0  0x100b7e5c in wait4 ()
> #1  0x100b4e88 in waitpid ()
> #2  0xa6cc in ConsChat (pCE=0x15400) at group.c:3016
> #3  0x7a9c in Kiddie (pGE=0x47180, sfd=0x6c010) at group.c:1458
> #4  0xa20c in Spawn (pGE=0x47180) at group.c:2907
> #5  0xcad0 in FixKids () at master.c:143
> #6  0xd2f4 in Master () at master.c:313
> #7  0xc874 in main (argc=269482840, argv=0x14400) at main.c:724
> (gdb) up
> #1  0x100b4e88 in waitpid ()
> (gdb) up
> #2  0xa6cc in ConsChat (pCE=0x15400) at group.c:3016
> 3016                    while (waitpid(pid, &cstatus, 0) < 0) {
> (gdb) print pid
> $1 = 21006
> (gdb) 

looking at #2, you see it's calling waitpid() from ConsChat().
ConsChat() is part of your patch.  the problem, i'm guessing, is that
the waitpid() inside the while loop has a little bad logic.
specifically, what happens when the waitpid() returns an error that
isn't EINTR?  it'll come around for another waitpid() and, i suppose,
lock up like this.  at least, that's my guess - i haven't done any real
testing - just scanned the code quickly.

so, unfortunately, it looks like the patch you've put together might
need a little work.  i don't think folks using the base 7.2.1 will see
this type of problem.  i'd love to get the whole chat-based thing
integrated in...hopefully this stuff can be worked out.

anyway, there's my two cents.

Bryan