Over one month ago, I truly hoped an evasive network misbehavior issue was solved. In the institute, we had a strange situation which caused a (seemingly random) stack of switches to go out to lunch and leave one of the floors without network access until it decided to restore itself (usually power-cycling it was not enough). The equipment, ten 4200 SuperStack switches, is new, we got it ~September, so I expected 3Com to give me at least some support on the issue. The behavior was apparently random, and I could not track it down, only (thanks to the very nice Cacti
) prove it existed.
December 5 was terrible. The whole building's network refused to work. Switches were able to reply to my pings for no more than a minute after booting before they got flooded with "Spanning Tree CTRL MAC PAUSE: Quanta 65535" (according to another great tool, the Ethereal
I had been bugging the support people at 3Com for at least three weeks before this incident, they kept promising they would come soon to help us track the cause of the ellusive problem - We threatened to send the switches back and pursue reimbursement through whatever means we could (we had much better stability and performance with our old 10Mbps hubs, so my boss insisted on replacing them), so we got them to come. The answer they gave me? Upgrade your firmware to version 2.50 (not available at their website, of course) and disable spanning tree functionality at the switches.
I have to admit the big problem immediately ceased. This is not to say I was pleased with the response time or with the switches' software quality, but at least we could work.
Happily, ~7 weeks passed by with no major outages (although with a few small ones).
Yesterday, we got back the complete chaotic conditions. This time, though, I was able to get a good and useful tcpdump
snapshot, passed it through Ethereal, and actually got something interesting (available to whoever requests it to me if it sounds interesting). In short: Something happens that confuses a switch (I guess it's a faulty NIC), and makes it request to all of the neighbouring switches (via spanning tree protocol - Yes, it's disabled, but still) for their complete ARP tables. But somehow, this noise or whatever keeps confusing it, so the spanning tree requests start repeating every two seconds for slightly over one minute. Then, the same switch which requested this information gets tired of too many ARP tables being thrown at it, and sends out a storm of packets (one each 0.04 seconds) asking every other switch _not_ to send updates anymore. This is broadcasted to all of each of the switches' ports, and the network dies.
My grudge against 3Com is: How come this frigging noise (at least that's what I assume it to be) on one of the ports kill all of the network? If a switch is supposed to be smart, would it be too much for it just to disable
the misbehaving port?
Anyway... One and a half days went down the drain trying to find the cause for the problem, sending the report to the 3Com guy (who is very nice in person, yes, but that's by far not enough!), running up and down the stairs to reset different switches, lock down on the source of distress... And as my 3Com contact was busy with another client (who presumably had not yet given them money), I never got a call back from them.
All my good intentions to spend a nice quiet time coding for three projects we currently need to have ready soon (one of them is in production and many of you have used, my dear Comas conference management system, but requires heavy tinkering to be able to extend it) were stuck at that: Good intentions.
I seem to have locked down the problem in one of three computers (that is, network seems stable again once I disconnected them). I insist it must be a faulty NIC. My boss bets it's an über-contaminated Windows machine. My boss' boss insists we should get the switches double-grounded just to be sure that's not the reason for failure. I am tired, bad mooded, and -again- bored you to death with a too-long blog entry.
Blame it all on 3Com.