Jeffrey I. Schiller (jis@BITSY.MIT.EDU)
Thu, 20 Feb 86 23:35:17 EST
I thought I would share this one with everybody:
We have just finished installing an Ethernet with ~40 single
user Vaxstation II's running UNIX 4.2 on it. We also have a few IBM RT
PC's as well. The Vaxstations use the MIT Developed RVD protocol to
access the files in "/usr" which are in fact located on a VAX 11/750.
I mention RVD not because it is related to the problem described, but
because it means that these workstations make quite regular use of the
network even when involved in otherwise non-network activity. And
obviously they also depend strongly on the network in order to
function properly at all.
Therefore it was a major problem when people began complaining
to me about the network dying for 20 second intervals about once a
minute. The network would come back to life with the Vaxstations
spiting out an error about their Ethernet board being wedged and
having to be restarted.
One might normally expect there to be some hardware problem
behind all this. This turned out to only partially be the case, which
is to say the hardware problem found was of a design nature, not just
one broken board... read on.
A running examination of network hardware statistics showed
that we were getting a large number of collisions about once a minute,
exactly. At this point we turned on "Netwatch." For those who don't
know what it is: Netwatch is a program which is part of the MIT PC/IP
package for the IBM PC (PC/XT/AT and compatibles). It allows one to
examine packet traffic on the Ethernet and also does some automatic
packet analysis (ie. Type of packet, Protocol, Source and
The problem turned out to be that one workstation was
misconfigured so that it thought it was on network 46 (whereas the
real network is 18.72, we use a subnetting scheme). Most UNIX 4.2
systems have a cute little daemon named "rwhod" that broadcasts
information to all other stations once per minute. In the case of our
misconfigured workstation this packet was sent to the Ethernet
broadcast address and to IP address 18.104.22.168. Now this resulted in all
other workstations on the network receiving this packet and deciding
that they would be good citizens and forward it on to our gateway.
Because this level of processing occurs at interrupt level, and most
of our stations are not loaded to the point that significant interrupt
latency occurs, all ~40 workstations would simultaneously attempt to
forward this packet to our gateway. This resulted in a monster
collision on the Ethernet. Now combine this with a hardware design
problem in the Ethernet controller that results in the Ethernet
hardware having a good probability for wedging up in the (otherwise
rare) case of monster collision, and add a 20 second timeout in the
device driver software and you have a neat problem on your hands.
Obviously we fixed the misconfigured machine, however this
problem has occurred several times since (we are still installing
Btw. tracing down the location of the offending machine is
also a bit of a bear. The network is in an office area. Each office
has two Ethernet drops, each coming from a dedicated transceiver above
the (icky dirty) ceiling. The offices are generally locked. People
leave their workstations on, up and in operation. Admittedly this
Ethernet topology is just asking for this kind of problem (ie. what do
you do when an interface starts jabbering).
To help track this kind of problem down I have established a
database of IP address => Ethernet Address => Physical Location.
Luckily the software tools our workstations have do not allow a user
to easily change his Ethernet address, so that can be relied on to
determine the location of the offending machine.
Moral of the story:
1) The Ethernet topology described above may not be the best possible
for an office area (but we all probably know that anyway).
2) A Network Spy program like netwatch is an invaluable tool.
3) WORKSTATIONS THAT ARE HAPPY TO ACT AS GATEWAYS are probably a BAD
idea. In fact it is probably best that when a workstation receives
a packet incorrectly sent to it, the packet be DISCARDed and
perhaps logged for debugging purposes.
Note that if a workstation sends an ICMP destination
unreachable message back to the source, the monster
collision would still occur as ALL workstations would try
to send the ICMP message.
4) Ethernet controllers that work "most" of the time may not be
P.S. Sorry this message was so long.
This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:35:40 GMT