Anyone else using EAP-TTLS with ePMP?

JLLC · March 15, 2021, 8:29pm

Hello-

Anyone here running EAP-TTLS with ePMP F300 gear?

We’ve run this way since 2016, through multiple firmware iterations. Everything works as expected with N clients, regardless of AP. All AC clients have intermittent, repeatable issues, regardless of AP.

Opened a support ticket a year ago and the most recent guidance from Cambium’s top tier engineers is to either not use AC clients, or do WPA-PSK instead.

TYIA.

Douglas_Generous · March 17, 2021, 11:59am

Did you drop cambiums certificates in your AC gear?

We havent seen any issues with eap-ttls on the 300 series SMs.
We are using freeradius3 and mysql (Mariadb) with Daloradius providing the sql tables and a handy direct interface (though a bit confusing until your use to it). ALL of our tower routers are Ciscos and each acts as a NAS with the AP also providing NAS support.

JLLC · March 18, 2021, 5:56pm

We generate our own certificates. Again, the only change is F100/200 consistently works vs F300 regularly will randomly stop forwarding traffic. A CPE reboot will usually fix it, or if we leave it along until the next key exchange, that usually resolves it.

Are you using the default Cambium certificates?

Douglas_Generous · March 18, 2021, 6:39pm

This may not be an EAP issue but here are our observations for the F300’s:

It does not matter if you use a public cert or a self-signed, what I was asking is if you dropped/disabled Cambiums certs? if you haven’t, I would really suggest doing so and making sure your on 4.6.0.1 firmware. Even though the Cambium certs are not normally used, they do pose several issues, the major one for us was certificate looping causing the SM to not pass data (authorized/not-authorized issue where the device presents two certs and only one is authorized so it joins the network and the data flow hits a not authorized on periodic re-auth). This is not just to Cambium products as we have had a range of devices (tp-link, Linksys, Cisco, D-link, etc) that also exhibited similar issues.

We have also had issues with the F300 smartspeed feature that causes SMs to stop passing data. In our investigations we found that the port doesn’t always refresh its state and we were getting mismatches whish would cause the port to not pass data, so to mitigate this we turn it off too but leave auto negotiation enabled.

There is also an issue with RFC1918 to RFC1918 NAT on the F300 radios, so if your using rfc1918 addresses as the bridging IP space (PPPoE or IPoE double NAT), I suggest using the CGN addresses instead. Just remember to block them at your gateways. We haven’t noticed any issues with RFC1918 addresses for management IP space though (yet?).

JLLC · March 18, 2021, 6:58pm

Doug-

I appreciate you taking time to respond.

I can’t say we have consistently deleted the default certs, so I’ll be sure to do that in our lab setup. It would be awesome if it turns out to be that simple. We’ve never experience that issue with the N-based radios, so I think it would still point to a flaw in the ePMP software.

Also, to be clear, there is never a TTLS authorization failure. You can show a client has been connected for weeks on end, but it stops passing L3 traffic, both for management and traffic beyond the ethernet port. We’ve been able to demonstrate this both when the client radio is running in NAT-router mode and in bridge mode.

We don’t do any private-to-private NAT. Either we deploy the radio in NAT-router mode with publics on the WLAN interface and privates on the LAN interface, or we deploy in bridge mode with privates for management and pass a public IP to the client’s device.

We’ve seen this behavior since PTMP operation was supported on F300 all the way into 4.6.0.1

Douglas_Generous · March 18, 2021, 7:36pm

Keep in mind that the epmps run a stripped down version of linux and even some of the applications have been stripped down too so this does point to a possible bug/flaw hence my point of smartspeed issues we have had. Since no management data passed (no L3 at all) regardless of the network mode and it is random, I would be looking at smartspeed or power issues (may or may not be the SM) as you have basically eliminated the other possibilities. I am not discounting any other possibilities, but given the provided information and experience I have gained in the last 20+ years (the last 18 using Motorola and Cambium radios) gives me a different insight. Of course its not the same as being onsite and dealing with it personally.

Are you using cnmaestro? do you use MRTG/Cacti? if so see if its just a subset of radios or if all of your f300s are doing this. Keep in mind you only can see a weeks worth of data in cnmaestro unless you pay for the pro license. We use MRTG (not as nice as Cacti but simpler IMHO) to provide long term data based on device name( makes things easier if you swap an SM or if the IP changes which does happen on our network) of the SM, tracking RSSI, SNR, status (online), local ethernet port data thoughput (there is always something chattering) and bridged data throughput. We also send as much logging to our log servers from each radio as a simple reboot wipes the log displayed. Still doesn’t capture a lot when things are weird but does provide some helpful hints at times.

We also had one client with power issues (rural farm on well and septic system) that at the pump would cut in the and at random the SM would stop passing data. This took us a bit to determine this as it is normal for lights to flicker in these setups and there is enough capacitance in the SM power supply to ride out most power sags. The problem was his wifi router would have the port flash off and on (like a broken cable in the wind) during the pump operation and the SM would disable the port but leave no log for it nor show the ethernet connection was down. A quick reboot and it would be back up, On a laptop directly connected it wouldn’t happen but if on wifi it would. We since have made it clear that he needed a UPS on the router and SM or we could not provide service or support. He has a UPS and we haven’t hear from him in 8 months. Again may not be the issue but is a possibility

By the way, you will not get a auth failure message on the SM log since one cert passes and the radio joins the network. From what we figured the wlan mac is auth’d to the network and joins but the local port can be auth’d separately and there seems to be a race of which cert is applied during re-auths (which are supposed to happen periodically). Depending on your radius software and its config you may not see it in the logs there either. We had fun isolating this on our network and almost made us scrap the eap-ttls system until we turned on full debug on our radius and seen the dual certs.
I am not saying this IS what is happening but its been seen and could be a possibility though I am leaning on smartspeed.

JLLC · March 22, 2021, 2:26pm

We haven’t been using Cambium for 18 years, but we’ve been doing this the same way since 2016.

Cambium engineers have confirmed there is an issue, but have pointed at forward compatibility, which isn’t it because I can demonstrate the same issue with 3k or 3k-L APs.

When I’ve asked in other groups it became clear most users don’t seem to utilize EAP, they do PSK and so they never see this issue. If we disable EAP and just do PSK, then the issue disappears.

I don’t understand why smart speed would cause traffic on the WLAN interface to stop forwarding. I agree it seemed like it caused ethernet forwarding issues in older firmware, so I typically disable it.

It’s not related to the default certs. It’s not smart speed. It’s not a power issue, we’ve confirmed the issue at a number of locations, many with battery backup.

If you’re running EAP successfully with F300 radios, then there has to be a work-around, so we’ll keep looking at our systems since Cambium support stopped responding.

Douglas_Generous · March 22, 2021, 3:06pm

Which radius and eap-ttls implication are you using. We demo’d a windows based system but had nothing but issues. Went back to freeradius3 on debian and cant say there is any eap-ttls issues.

We do have the odd radio stop passing data here and there but it is not specific to a hardwarw generation and is a lot less with 4.6 0.1 on everything. We have a mix of e1k and e3kL APs and just about one of every type of SM out there except the f190s.

We are also moving away from nat at the radio as we provide managed wifi to a lot of our clients now, makes things easier for us too.

JLLC · March 22, 2021, 4:59pm

I’ll have to check specific releases, but IIRC, we’re running CentOS and Free RADIUS from about 5 years ago. Built the VM when we started moving to ePMP and that is its sole purpose.

Douglas_Generous · March 27, 2021, 11:58pm

The centos version of freeradius had issues (centos based iircc).

My suggestion: If you are using an sql backend, spin up a debian 10 vm and test it.

khoff · February 24, 2022, 1:12am

For the archives…

Late to the party here, but we found that the N-based devices (Force 100, ePMP1000, etc.) would complete radius requests if you ignored CHAP just fine. But if you don’t actually complete the CHAP challenge/handshake on AC-based devices, they go into an infinite loop and never connect. Our config was completely ignoring the username/password CHAP challenge and just authorizing on EAP and MAC, so we ran into the issue with the first Force300s deployed.

Hope that helps.

Douglas_Generous · February 24, 2022, 6:34am

Never seen that on our N based devices…Definitely true for the AC based devices! Just tried it on the test network and it failed spectacularly! We use the MAC as the username so this is probably why we didnt see this before.

JLLC · February 24, 2022, 3:56pm

@khoff Are you saying that the F300 doesn’t connect at all or that it may intermittently get stuck in a loop?

Our issue is that there is no failure on the RADIUS server. The F300 stays connected to the AP, it just stops passing traffic. We’ve tested on multiple firmware versions, multiple AP revisions, multiple Linux OSes and multiple Free RADIUS builds. We’ve even simplified our responses to enable/disable (rather than extended responses with MIRs, etc.)

We can recreate this scenario easily.

F300 doesn’t disconnect from AP, so you can look and say “geez, this thing has been connected to the AP for 40 days, this customer must be crazy.” When we put more granular monitoring in place, we find both management and customer traffic will stop forwarding at random. Rebooting the unit restores connectivity or if you just wait until the next key exchange, it typically will start forwarding traffic again. The more traffic moving across the link, the more often it will stop forwarding traffic.

khoff · February 24, 2022, 4:30pm

In the case I was describing, the F300 would connect enough to begin the TTLS handshake, but on the RADIUS server, you would see and endless loop of the same RADIUS requests.

What you’re describing sounds like a bug we reported when using the F300 with RADIUS and VLANs on earlier versions of the firmware (pre-4.5.0). When the F300 would re-authenticate after a few hours, it would stop passing traffic on the data VLAN. A power cycle would resolve the issue. Turned out that the script that runs on the radio to set up the VLANs was broken on re-authentication (it was passing the wrong arguments to brctl or something like that). I looked it up and that ticket was from May 2020. Ahh, here it is in the release notes from 4.5.4…

ACG-9628 Fixed an issue for when the data traffic is not passed via the DATA VLAN on a
reconnecting SM whose wireless security setting is set to RADIUS(EAP-TLS).

Hope that helps.

JLLC · February 24, 2022, 4:42pm

Here are the scenarios we can break the radio in:

straight bridged.
bridged with separate mgmt and data vlans
nat with a shared mgmt & data vlan

We obviously haven’t tested every scenario. The commonality above is EAP-TTLS. If I switch to WPA2-PSK, problem goes away.

We are not passing VLAN data back in our RADIUS response. Those are manually programmed into the radio upon deployment.

Our ticket has been open since March 2020. I came here to see if anyone was successfully running RADIUS. Considering 2 users responded, my assumption is most users do PSK due to simplicity and either manually enable/disable customers, or run through a walled garden or Powercode BMU, etc.

Thank you for taking the time to respond.

khoff · February 24, 2022, 6:00pm

In our case, it was bridged SMs, data/mgmt VLAN configured on the CPE (not by RADIUS VSA), and EAP-TTLS for auth. Switching to PSK resolved the issue in testing. Are you running 4.5.4 or newer on the SM? If not, start there.

If it helps, you can reference our ticket 214959.

JLLC · February 24, 2022, 6:18pm

We can reproduce the issue all the way up to 4.6.2.

Some of the top level Cambium techs get auto-generated email blasts from us when our test units drop. They’re well aware there is an issue. I was hoping someone else found a work around because “we’re working on it” as their monthly update is getting a little old after having this product in release for 3 years and us identifying the issue for them 2 years ago.

I think they might be going down UBNT’s old faithful road of “see we fixed it” by releasing a completely new product line, such as the 4000/400 series.

Douglas_Generous · February 24, 2022, 7:15pm

Out of curiousity, have you entered the management and a data vlan on the radio prior to using radius eap-ttls? if not, you should since radius VSAs update the current config and it must not be null. We had that issue when we first setup EAP-TTLS on our network. Because our NAS’ are Cisco, we were digging in the Cisco documentation and found that config changes must not be null, set as 0 is ok , but null and they wouldnt get updated.

As for traffic just up and stopping without the wireless side dropping, we found the majority of our repeat offenders had power issues that would lockup part of the radios. Not 100% sure if that applies to your issue, but it is something to consider.

JLLC · February 24, 2022, 8:34pm

Yes, the VLANs are being populated prior to connecting.

I would think that if RADIUS was populating the VLANs there would be indications such as the vlan reporting differently, non-vlan populated bridged connections still working, etc.

The radios will self-correct if left alone. In my experience, that doesn’t happen with radios that lock up due to brown outs.

Also, we’ve ruled out power by running test behind on battery backup.

Douglas_Generous · February 24, 2022, 9:21pm

Do you have smart speed enabled?
This little “feature” caused us a lot of grief until we disabled it. Similar issue to what you report, zero data passing but radio has great uptime, quick reboot solve it for now and then seems like randomly just stops again.

After this, I would be guessing hardware issues but that would need to be correlated with serial numbers, and mac addresses to see if you got a batch thats acting up.