[Date Prev] [Date Index] [Date Next] [Thread Prev] [Thread Index] [Thread Next]

Re: lines going "down"

David Harris zonker@certaintysolutions.com
Wed, 28 Nov 2001 09:19:23 -0800 (PST)

  I've seen a similar scenario in one particular lab, using a
Cisco 3640 with NM-32A cards. I don't think this is a brand
issue. I merely offer it as another clue....

  When we see the failure, we typically see 8 ports in a group
go down...all 8 in a modulo-8 group. (i.e. 1-8, 17-24, etc.)
All of the affected lines are run by the same OCTART chip.

  While I could point to a failure in IOS for this (which
would only be circumstantial and unsupported by fact), I 
actually have another working theory, based on looking at
the devices attached...

  In these cases, there was usually a network interruption
between the conserver and the console server. This could be
a switch/router failure in the network, or a forced reboot
of the conserver host without a polite shutdown...and the
devices showing 'down' were what I call 'quiet hosts'. (A
quiet host is a device that only replies when you talk to
it...it doesn't usually offer any log traffic, time stamps,
etc. to the logs unless someone is typing to it.)

  In the case of a network break like this, the TCP session
to all of the ports (from Conserver to the Console Server)
don't get cleared out when the connectivity failure occurs!
Since the host doesn't generate any traffic on the serial
port, the console server never tries to send traffic to the
conserver host, and the console server leaves the session
open, thinking that the conserver host is just idle. The
root cause here is that the TCP FIN sequence never occured.
So, when you restart your Conserver, and it tries to then
connect to these ports on the console server, the console
server tells the conserver that the TCP port is busy (since
the console server still thinks the old session is still 
there and idle...)

  In these cases, our cure has been to log into the console
server, and reset each affected line, one by one. This will
blow away the (already broken) TCP session, and allow you to 
either restart your conserver, or just force open each of
the lines that were down.

  While this doesn't happen too often in the data centers, 
I have seen this in some of the remote locations. Maybe
that's another good argument for having a distributed
Conserver deployment, and putting a logging host 'closer'
to the console servers? :-)


      -Z-    http://www.conserver.com/consoles/breakoff.html