traffic storm - management interface retransmits broadcasts

jtn39 · October 28, 2005, 3:59pm

We’ve had a very strange issue where a broadcast will be retransmitted by all canopy Subscriber Modules in NAT mode. This results in a storm of traffic that will halt valid network activity for a few seconds at a time.

We’ve been running NAT mode for over a year and a half, so we’re wondering if this issue has been caused possibly by the 7.2.9 firmware update? or possibly it’s been happening but now due to network growth it’s just starting to affect us?

After sniffing a few of these storms, we’ve verified that the actual source is the management interface of the SM. Taking customer’s out of NAT mode resolves the issue, but this is not a viable solution due to high number of customers in this mode, plus as mentioned before, we’ve been running in NAT mode for over a year and a half problem free bringing up the firmware problem theory.

Anyone else experiencing these issues?

–jtn39

msmith · October 28, 2005, 8:35pm

What exactly is the management interface of the SM broadcasting for? What is it looking for? What type of broadcast packet is it?

jtn39 · October 28, 2005, 9:24pm

msmith wrote:
What exactly is the management interface of the SM broadcasting for? What is it looking for? What type of broadcast packet is it?

The management interface isn't the originating source of the broadcast, instead, it's deciding to re-transmit the broadcast.

The original broadcast comes from some other device on the network (i.e. random customer's computer, or router), the broadcast type has been anything from stray RIP advertisements, to NetBios, etc.

Instead of ignoring this traffic as it should, the management interface resends the packets onto the network, so a single broadcast packet can suddenly generate thousands or retransmits, one from each NATed SM.

msmith · October 30, 2005, 3:40am

Can you downgrade the firmware of one single SM in your network, put the SM in NAT-mode, and see if the problem disappears?

vj1 · October 31, 2005, 3:00pm

we are seeing a similar problem, a broadcast message being repeated. We have about 400 SM on the network 30 AP and have been running for about 15 months.

About a week ago we upgraded to 7.2.9 on the AP+SM and then all of a sudden we have started getting ARP traffic.

We are running MRTG to monitor our SM’s and we see all our SM’s with a constant traffic flow of about 40kpbs even with nothing connected.

We have started the painful process of getting Motorola involved

vinay · October 31, 2005, 5:14pm

We are seeing a similar issue to JTN39. We have noticed that the ARP traffic on net has increased exponentially. It seems that any ARP traffic targeted to or originating from the NAT Public interface is being transmitted multiple times - multiple to the extent that some packets are apparently being transmitted over 30 times. This started to occur shortly after upgrading to 7.2.9. Our network is of similar size to previous posting, and we can see 32k – 52k attributed to ARP traffic. Having spoken to Motorola, they want everything but the kitchen sink to start looking into it. The obvious downfall in the meantime is that whilst all this traffic is flying around, its affecting our users. Maybe the Moto guys on the forum can tell us if they have been contacted by others about this?

jtn39 · November 1, 2005, 4:50pm

We have contacted Motorola about the issue, so far they asked us for configuration information taken from an AP and SM using the Gather Customer Support Information tool in CNUT.

After a day or two they got back to us, unfortunately we had accidentally included a bridged module instead of a NATed module configuration, so I just sent them another email with the correct information…

I’m looking forward to seeing what they come up with…

vj1 · November 2, 2005, 2:16pm

We manged to sort out the problem.

It was a VLAN configured SM causing problems, at this stage not sure if it was the SM or something in the customers network, need to investigate further.

You need to go through a process of elimination, have your sniffer running and start taking down parts of your network, until the problem goes away. We did this at 2am on Sunday morning.

jtn39 · November 2, 2005, 3:13pm

vj wrote:
We manged to sort out the problem.

It was a VLAN configured SM causing problems, at this stage not sure if it was the SM or something in the customers network, need to investigate further.

You need to go through a process of elimination, have your sniffer running and start taking down parts of your network, until the problem goes away. We did this at 2am on Sunday morning.

We have gone through extensive analysis of network traffic during the problem on multiple occasions and have each time eliminated a "threat". The issue will just come back in another form, from another customer a few hours or a day later.

The core issue here is not tracking down a customer that is the "cause", we have done this multiple times, each time eliminating the "threat" of the moment. Rather, the issue is that the Canopy modules are retransmitting traffic in ways a network device should never operate and in the process making all sorts of traffic that would in most networks be considered "normal" a possible problem trigger.

We intend to try downgrading firmware in a test environment to see if this helps resolve the issue, if it does though, the process of downgrading over a thousand Subscriber Modules isn't exactly a welcome prospect.

vj1 · November 2, 2005, 4:09pm

I see…

during our testing we did see other duplication but not as bad as this. I take it that when you have been finding the culprit, you have just swapped out the SM with a new one with the same setting and the problem has been temp. fixed. i.e the problem has been the SM and not something on the customer network. In our case we had one extremely bad SM and have a couple of not so bad ones, who are creating a little bit of duplication.

We were assuming that it was something on the customer network.

When we sent our network diagram in to Motorola (8 clusters) 400 SM - first thing motorla showed a surprise at was that it was all one big cluster, so a broadcast storm could take it down, but that was their only reason for the surprise.

Looks like a bug with 7.2.9 software, in which case we stuffed, I will try downgrading the SM and then upgrading it again to see if the initial upgrade corrupted something.

Solutions I can think of, as we may need to come up with a solution too are:

1 downgrade to 6.1
2 put everyone on a diff vlan and mangement vlan (could be a management nightmare) need to look at it more
3 break the switch cloud into smaller clouds and stick in routers (again management nightmare)
4 instead of interlinking our 8 clusters, we set up 2 new BH dedicated clusters and fire all clusters to these pops. make them all slaves so the timing is controlled by the master at the other pops, that way you don’t need a CMM for timing and can use a CISCO switch with vlan routing to seperate our clusters

Anyone else out there on 7.2.9 software experiencing similar problems, you may have a problem and realise it yet ?

Jaybug56 · November 8, 2005, 1:20am

We are seeing the same ARP traffic. At what point will we see this start overloading our network. 200 SMs? 300 SMs? 500 SMs? We have all of our SMs in NAT mode and not using any of the VLAN settings. It sounds like VJ is using VLAN and is having the problem. Is the solution no VLAN and NAT disabled???

paulchops · November 8, 2005, 12:48pm

Wow! I just read this thread and bells and whistles are going off. I have been complaining and griping about general unexplainable slowdowns within the network since I first put 7.2.9 on my network. This has been a frustration I have been living with. I have called support several times and pushed them for an answer and they have said there are no known issues regarding this type of problem.

That’s frustrating. We have 10% of the network “SM NAT’d”. We have everyone on routers all verified and I have SW I use to consistently test for routers plugged in backwards, and have the appropriate ports blocked on the non-NAT’d SMs, and it still has driven us bananas.

I view this is as a critical feature. In my area, Us vs. DSL vs. Cable is very competitive, and the network slowing down when I have less than 1/3 of my expensive fiber circuit to the Net in use, is really a bad thing.

What I did find helped a little (for a time anyway) was to replace all the cheaper little switches with something at least like a 3com 3300. I had various Cisco’s (29xx, 35xx) out there and found the 3com did a better job of “buffering” the problem a little, because at times the network would grind to a halt.

Moto suggested VLAN’s and yes I understand the logic, but the admin, and other issues didn’t make me want to jump right in and change the way things we work, because it general my network is very much sparse but we are ready to grow big-time. We have about 400 SM’s spread across 65 AP’s at 23 dif. locations (some places full clusters, others one AP) on 3 “physical” segments and nobody killing us consistently with bandwidth use, so we really shouldn’t be having this slow down.

And, this all did happen when we moved to 7.x software. I saw a little bit with 7.1.4 and then in 7.2.9 it has become a real problem.

In any of the issues with Moto, has anyone gotten an answer on the prognosis for a timely fix?

Paul

vj1 · November 9, 2005, 4:12am

I think we are on our own… to answer Jaybug question VERY QUICKLY, we were generating about 30k of constant arp traffic, and mrtg showed all our SM constantly doing 30k day and night, (when not being used), but it wasn’t the bandwidth that was killing us or causing the slow down, its the number of packets.

Although MOTO have not come out and said this, and I am doing intensive testing in this area, MOTO claim of 6.4mbs raw data and then we need to optimise our network is a curtain, there is a restriction on the number of packets canopy can handle (as there is with all equipement) but the restriction is a very low with MOTO.

so guys watchout… arp is taking up your packet limitaion not your bandwith, hence the slow down

vince · July 13, 2006, 7:56pm

has anyone found the solution to this problem? i had some rip responses that ate up my network for a while today…even after i removed who i thought was doing the damage i still get some horrible spikes…do the later versions seem to fix this problem?