On the subject of Radios disappearing

shrimpdaddy · October 4, 2005, 5:19am

Reading through the list, I notice some interesting issues with radios disappearing on the network. As far as I can tell, there are (At least) two causes of this, and in both situations the radios are still ‘registered’ to an access point:

1. A situation where the radio cannot be pinged, and it does still bridge traffic to the network (it’s not in NAT mode here, or if it is in NAT mode, then the whole thing is ‘down’ as far as the customer is concerned)

2. A situation where the radio may be pingable, but does not respond to web requests

And while nobody can adequtately explain the cause for #1, the cause for #2 is clearly shit programming. In all fairness to Motorola, i have not encountered #2 in the 7.x software releases yet.

I think the cause for #1 is fairly simple and obvious. I just figured it out after I added a Cisco router to my hub site today.

Basically, the Cisco has a 4 hour ARP timeout. So, it will just assume that it can reach a particular MAC address for 4 hours after the last arp table entry.

Unfortunately, the reality of the situation is that the Canopy SMs, APs and BHs have a default bridge entry timeout of only 25 minutes.

Let’s paint a simple scenario here:

Customer ARPs and finds the Cisco’s IP address. Traffic happens.

30 Minutes goes by with no traffic. AP drops bridge entry which points customer’s MAC address to their radio.

Cisco router still knows customer is at a particular MAC address and decides to send traffic to that IP address. BH or AP doesn’t know where to route that frame so it just ignores it. That IP is “down” now.

If that MAC address initiates conversation back to the router, then they are now “up”.

You can replace “customer” with “AP” or “BH” or any other MAC address that resides behind an AP, SM, or BH.

The solution is simple, if you lower the ARP timeout on the Cisco to less than 25 minutes, then the cisco is guaranteed never to “know” about (have an entry for) a particular MAC address that the BH, SM and/or AP does not also "know where to route (e.g. route it out radio link, or not)t. The Cisco will send out a broadcast packet with the ARP which will hit all subscriber radios and the appropriate one will respond. The SM/BH/AP will see the MAC address traveling through its bridge and it will automatically have a new MAC routing entry pointing in the right direction. The Cisco will talk to that MAC address and the intermediate Motorola radio bridge will know where to forward the traffic to.

I hope this helps to explain what appears to frustrate several people. Lowering the arp timeout isn’t really a “Workaround”, it’s necessary when the MAC bridge timeout is slow low.

The 25 minute timeout makes sense in an environment where the customer could jump between access points…

msmith · October 4, 2005, 6:11pm

Instead of lowering the ARP cache timeout on our Cisco I simply increased whatever Motorola calls their ARP cache timeout to the maximum, 1440 minutes - 24 hours.

This seemed to do the trick, however my situation was a bit unique and when I think back on it, I’m still not sure how this worked prior to changing all radios to 1440.

T1 terminates into a Cisco 3662, then Fast Ethernet 0/0 terminates to a Layer-2 D-Link 3662. Attached to the D-Link switch is (1) BHS, and (2) Red Hat Linux 9.0 servers. With the Cisco 3662 set to ARP timeout of 4 hours, and all radios at default of 25 minutes, I could ping customer routers from the Red Hat servers only, and not from the Cisco. As soon as the ping response came back, I could then ping from the Cisco and get responses as well.

Yes, the MAC addresses of the customer routers were indeed in the Cisco ARP cache, confirmed with a “sh arp” command. The strange thing is that the MAC addresses were NOT always in the ARP cache of the Linux servers, but yet the servers were the ones that were able to get the replies back from the pings. So if the servers did not have the MAC addresses of the customer routers, the ARP broadcast went out, hit the switch, went out every port on the switch, up the backhaul, to the radio network and eventually came back with a response from the destination. At that point, the MAC address would be added to the Linux server ARP cache, and the actual ICMP echo requests and responses would occur.

Assuming the Cisco was working correctly, the following should have occurred. If the Cisco already had the MAC addresses of the customers, it would query the routing table and then forward the ICMP request out FE0/0. Once it hit the switching fabric, if the MAC of the customer was associated with the BHS port, it would be forwarded out that port, if not it would broadcast and still hit the BHS port. If it was not an ARP-broadcast, it would hit the BHS, then if the BHS had the destnation MAC it would be forwarded, if it didn’t I believe it would have been dropped? If it was an ARP-broadcast, the packet should have simply been forwarded to the BHM, then the AP cluster, to the SM, then the customer router.

This still doesn’t explain why a ping from the Linux servers would get responses and the Cisco would not. I’m sure I could have figured out exactly what was going on if I would have done some packet sniffing from different locations, but switching the radio ARP timeouts to 24 hours worked and has been flawless since.