Service being dropped epmp1000

Hi

We have epmp1000 setup as APs & SMs across the network with own static IP range and management vlan. Traffic vlans are configured as per bridge/PMP basis under QoS. After a broadcast storm/loop on management vlan, we noticed we couldnt reach a few APs while others where fine. Funny enough the services also where cut for clients. As a bridge device, why would the AP gig-port also drop client traffic? Any suggestion to resolve this and also prepare for future storms/loops.

Software Versions: 3.4.1 and 3.5.1


@MartinWandira wrote:

Hi

We have epmp1000 setup as APs & SMs across the network with own static IP range and management vlan. Traffic vlans are configured as per bridge/PMP basis under QoS. After a broadcast storm/loop on management vlan, we noticed we couldnt reach a few APs while others where fine. Funny enough the services also where cut for clients. As a bridge device, why would the AP gig-port also drop client traffic? Any suggestion to resolve this and also prepare for future storms/loops.

Software Versions: 3.4.1 and 3.5.1


Hi,

To limit broadcast storms event effect in our network you can enable Broadcast/Multicast Traffic Shaping feture on APs. And configure Broadcast Packet Rate.

You can find this options on Configuration -> Network page.

Thank you.

A broadcast storm is also known as an ARP storm.

A bit of understanding can help in diagnosis and this is still the most missunderstood concept in network engineering.( I know for a fact that the various cerificate holders will comment about how wrong I am, I still stand by the IETF understanding that I was taught and I know I may not be 100% right)

How ARP works:

ARP's are broadcasts in the network they ask mac address ff:ff:ff:ff:ff:ff for the mac address of some client or other device's ip address (remember that the application knows the IP address but ethernet uses mac addresses to get from point a to b). All devices listen to ff:ff:ff:ff:ff:ff and forward this broadcast to other devices behind them, an example is a l2 switch. A switch listens for ARP requests, logging the mac address that is came from and then sends it out all interface ports ( unless blocked on purpose). This forwarding is key as this allows devices beyond that switch to respond if the request mac address is thiers. When the correct device responds then they send a unicast reply which is added to the switches ARP table and it builds a lookup table of IP/MAC to interface.

How a storm happens and what it does:

Say you create a loop path on the L2 network of say three switches. Not a bad thing in theory, the data will flow down eith path to get where it has to go. This works fine until the first ARP request that is not in the switches lookup table is transmitted. now the switch does what it is supposed to and sends the request out on all interfaces, which in turn goes to the next switch and is rebroadcast to the next switch which rebroadcasts to the first switch. And now the loop has been made and the request continues to loop indefinitely as the network has no end. this gets bigger as each rebroadcast creates 2 more broadcasts that creates 2 to the n to the 2nd power broadcasts. In seconds a 10Gbps link can be flooded with nothing but ARP request broadcasts, but thats not the whole of the issue. Every device receiving the broadcast must take some amount of CPU cycles to deal with the broadcasts that are coming in at exponential rates every second. Plus each device only has so much memory for the ARP table which keeps getting bigger due to the broadcast storm. At some point the device will run out of memory and stop responding. If the firmware is coded right a timer will timeout and the memory will get flushed and it will start responding again. The problem is that the storm is still there and it causes this cycle to repeat and eventually the CPU will not be able to keep up and a hard-lock condition is met. Now the CPU has stopped processing and the timer can not timeout.

The ePMP equipment is very robust, but even it has its limits. Some are directly due to cost, memory and processors mostly but also due to performance factors. Would you pay $10,000.00 for the same AP that just has more ram and a faster processor but can not move any significantly more packets? I know I would not! Under a storm condition, most networking gear is put to the ultimate test and I have yet to see any that survives cleanly. This is why there is a broadcast limit stack option in these radios. Setting the radios to only respond to say 400 broadcasts a second allows the radio to process the broadcasts in due measure and if it misses the broadcast it was supposed to respond to, then the protocol says to retry after a delay.

2 Likes

Thanks for this advice and am starting to implement the measures now. Any recommended "Broadcast Packet Rate"?

using port isolation can help control your storm risk as well. 

we configure our switches so each AP can talk to the local router, and not each other.  isolating your CPEs to do the same in the AP will also reduce your risk.    the isolation functions also reduce waste traffic.     

1 Like

Port isolation will help but it adds a wrinkle that all traffic must go core-ward just to loop to another sm. Think about same client in multiple spots and your bridging their network for them.

What I do is set the broadcast limit to stupid low of 400 on the AP and 200 on the sm. Conbine this with good switches and a router per tower and we limit the storms ability to leave a tower or prevent a tower from inhibiting another.

There is no one answer that is right, you know your network design, your policies and what you want to accomplish. Test each setting individually so you know what it does and how it will affect your network.