Force 200 losing wireless IP 3.5.2 onward

I've tried multiple times to open a ticket with support but they continue to find nothing wrong and have been unsuccessful in actually digging in and finding this problem so I'm hoping I can find a solution here.

Long story short, we use RADIUS and PPPoE with certificates for authentication of our Force 200's to ePMP2000 AP's.  We used software version 3.5.1 for a very long time as it was super stable and we really had no issues with it.  Well when Cambium came out with the new MAC address Force 200's we were forced to move past 3.5.1 software.  Somewhere between version 3.5.2 and 3.5.6 there was a major bug introduced which causes the SM's to not fully authenticate after a disconnect.  i.e. they will authenticate and get a management IP but the wireless IP is left blank along with the DNS IPs.  The only fix to this is to login to each SM and manually reboot them.

We've continued to try every software version after 3.5.6 in hopes that this bug would finally be fixed, but to no avail.  

The only promising news we've seen is that the Cambium engineers are quite obviously poking around in this part of the code, because in 4.4.1 software, the SM's don't have a blank wireless IP anymore but now they somehow get the default 192.168.0.2 IP address.  Again, only a reboot of the SM fixes the issue.  And in 4.4.1 the logging finally gives much more useful data.  We're seeing both on the SM and AP side the following message:

kernel: [7850272.050000] SM disassociated from AP F=5835 11naht20.  Reason: 53 (AP KEEP ALIVE RX STUCK)

kernel: [7850311.070000] SM associated with AP

This is an extremely frustrating issue and makes the network a real handful to try to constantly keep SM's rebooted in order to keep customer complaints down.  We're stuck with no version of software that actually works.  Please help!

This is a foible of using pppoe on the newer firmware.

Cambium is using the linux pppoe stack which activates as soon as a network interface is made active, which would normally be fine but on the wireless side, we must wait for the link to come up before pppoe works properly. Since we have experienced the same issue with ours, we have determined that is is a race condition in their boot script that allows this.

Things we have done to mitigate this (note this is nota way to completely resolve this):

we use a cisco router for our pppoe BRAS and it handles pppoe client address DHCP with a tie-in to our client-side DNS servers (mostly for reporting and tracking if we should be asked who had what IP and this time, its easier to search by name than by connection number). We shortend the connection timeout timer and sped-up the hold off and IP recycle timers.This improved IP availability for clients that had their link flap and thus we would not have the same user with 50 stale IP addresses.

We have set our AP's not act as an intermeadiate and the subscribers to use the shortest keep-alive time possible.

You also do not need to reboot the subscriber, just deregister them and when it recreates the link, the pppoe module will trigger and work.

Honestly though, we are moving away from pppoe as radius eap-ttls and IPoE does most of the same things we used radius for. cnMeastro and Open-NMS/Cacti (choose your favorite) provide the rest with a little scripting to make changes to the radius database (we are using Freeradius with Daloradius on sql for all users including pppoe) for limited-state users, which are becoming less and less as the difference in package plans to be unlimited/no data cap is getting smaller all the time.

eap-ttls is used between the AP and the subscriber radio, its a bit to setup but works great. We still maintain two vlans, our device management vlan and our clients vlans (some have private vlans for multi-site service).

This change has also reduced our support calls for "no internet", just have to work at moving all our sites over.

Add the cnPilot to a clients connection and even the cnPilot will register with eap-ttls to the radius server. And its a premium service that you can bill for!

1 Like

Thanks for the incredibly detailed response.  Certainly some additional items I can try to alleviate this headache. 

This morning I tried to simply deregister CPEs to recover them, as you has suggested, but unfortunately, that does not resolve the problem.  CPE's come back, register with the AP and get a management IP but the wireless IP always gets set to the default for the unit, instead of authenticating and getting the correct IP.  

If the wireless IP is staying default, then change the IP mode from dhcp to manual and place a temp IP in there. Then make sure your radius is setup to push a dhcp ip to the radio.

Douglas - This has been working fine for the last few years without issue, it was only with software version 3.5.2 and newer that this started happening sometimes when clients disconnect.  It doesn't happen every time.  It makes me think you're onto something with the race condition in the CPE, that you're suspecting.

Hi Jason,

Can you give me the details of the tickets you opened up with the support so I can take a look?

Thanks,

Dmitry

PS Unrelated to your issue but better use 4.4.3 instead of 4.4.1 in any circumstances.

Dmitry,

Request #192046 is the one.  I couldn't go much further because i'm not going to distrupt a RADIUS server that is in service.  Absolutely nothing has changed on the RADIUS server in nearly 5 years.  Like I said in the ticket as well as this post, something was changed in the ePMP software between 3.5.2 and 3.5.6.  3.5.2 and before works perfectly.  In fact, if we weren't forced to upgrade to 3.5.6 for the new Force 200 hardware, we'd still be running 3.5.1.  Can you please have the software engineers check what changed in between those versions and identify the problem.  

FYI - 4.4.3 also does not work.  Constantly having to reboot the CPEs.

Hi,

Let me check the ticket and come back to you. Sorry for the delay.

Dmitry

Dmitry,

Additionally, here is what a CPE on 3.5.6 says in this scenario.