Crazy connection failure problem

I've seen this problem on two different links now, at different times.

ePMP1000 5.8, 2.3.4 firmware, WPA2, used as a PtP link, absolutely clear channel. (noise floor is -100dB +-50MHz from center)

Reboot or make radio changes to master end. Radio link occasionally fails to restore. Watching it, I see it try to reconnect every 90 seconds or so, each time telling me it fails due to 'invalid security key'. (which was not changed, and is not changed from then until the link finally DOES come back up)

Watching the monitor->wireless page I see it start to associate, and the stats show it at a -62dB (estimated) downlink RSSI, and -94dB to -97dB uplink, then drops the link.

Further radio changes or reboots usually restore normal behavior, but it usually takes 10 minutes or more, multiple reboots (both ends) etc.

My best guess based on what I'm seeing is that the transmit power on the slave/SM end is NOT sufficient at first for it to successfully register. I don't know what changes it (or restores it) and I don't know why it shows up as an authentication error in the log. (well, the last I guess is because the authentication key gets garbled travelling over a too-weak uplink)

Both links where I've seen this happening are on DFS channels.

j


log snip from AP/BHM end - rebooted and failed link three times:

Dec 31 19:00:19 JSG2LVF kernel: VAP device ath0 created Feb 4 14:51:44 JSG2LVF crond[1449]: time disparity of 23717991 minutes detected Feb 4 14:54:20 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: associated Feb 4 14:54:28 JSG2LVF kernel: STA[00:04:56:c9:a6:87] aid=1 disassociated. Reason: INVALID SECURITY KEY Feb 4 14:54:28 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: deauthenticated due to local deauth request Feb 4 14:54:28 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: disassociated Feb 4 14:56:05 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: associated Feb 4 14:56:05 JSG2LVF kernel: tdd_ratectrl_stat_add: suspicious ARQ statistic, ARQ fn = 48604, RA last FN = 0 Feb 4 14:56:13 JSG2LVF kernel: STA[00:04:56:c9:a6:87] aid=1 disassociated. Reason: INVALID SECURITY KEY Feb 4 14:56:13 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: deauthenticated due to local deauth request Feb 4 14:56:13 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: disassociated Feb 4 14:57:50 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: associated Feb 4 14:57:58 JSG2LVF kernel: STA[00:04:56:c9:a6:87] aid=1 disassociated. Reason: INVALID SECURITY KEY Feb 4 14:57:58 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: deauthenticated due to local deauth request Feb 4 14:57:58 JSG2LVF hostapd: ath0: STA 00:04:56:c9:a6:87 IEEE 802.11: disassociated

This may seem like an odd question, but do you have layer-2 redundancy in the network and are you using STP?  I've seen a similar behavior and have traced it to a specific senario where there is a redundant path and the root bridge is upsream from the AP.

No and no.

j

Update - I switched one of the links to 5210 MHz and still have this problem.  The dowstream end of that link experienced a brownout 45 minutes ago, as of right now we've still not gotten the link back up.  

Multiple reboots of both units, changed frequency settings, no change.

j

Hi Joel, 

This cannot be a power issue. For authentication, managements messages are used and these messages go out at max power and at MCS1 (by default). Have you tried switching AP Management Packet Rate to MCS0 (under Configuration->Radio)? Also just as a test, have you tried changing Security mode to "Open" to see if the problem goes away?

Lastly, you mentioned these are on DFS channels. Do you see the "Radar (DFS) Detections" counter incrementing under Monitor->Performance page?

Thanks,
Sriram

Haven't tried MCS0 vs MCS1 yet, nor 'Open' vs WPA2. 

The most recent occurance of this (yesterday for about an hour) was with the link running on 5210, NOT a DFS frequency.  If it happens again on the link still running a DFS channel then I'll take note of the DFS counter.

I really don't want to screw with this link - the other end of it serves 3 retail businesses, plus the City of Hamlet and Chamber of Commerce.  Total seven customers, and all seven use it regularly, and will be on the phone within moments if the link drops.  (reboot the AP and some of them will call before it restores sessions)

Boss already decided if it barfs again he's replacing it with a UBNT powerbridge or an Orthogon.  We can't afford questionable stability on that backhaul.

j

Hi newkirk.  IMHO, if it's had this  'crazy problem' a few times already, but if you're unwilling to try the suggestions to narrow the cause down, then it's pretty much guaranteed to 'barf again', sooner or later. :)

I'm not trying to be a jerk - but it's unlikely that it's going to magicly fix itself, and it's also unlikely that Cambium is going to be able to figure out what's going on without you being willing to work with them, right?  :)

I'm working on a similar issue.  We have two NON-SYNC 5GHz ePMP connectorized units in PTP mode, Flexible, and were running 5320 freq.  The issue is every once in a random while, we lose connectivity to the SM side site.  Now, a few things of note.  When it happens the RF does not disconnect, CNS shows no drops, or lost packets.  The radio on the AP side shows the SM connected.  However, we are using /30's on these BH's, and so in an effort to deteremine if we are truly losing the link we just recently moved the gateway of the SM side to the AP side so if it happens again, we will know if we can still reach the SM BH.  We are also running OSPF.  There was a known issue we had in the past with UBNT gear and OSPF, it involved Spanning Tree and IP aliasing and an older FW version.  I don't believe it's related, but worth mentioning.  Generally speaking what we believe to be the core of the problem here is some issue with the connection, wether it's ethernet, RF, or some other network component, is causing the OSPF session to drop, and to re-establish the session we have to reboot the AP side BH.  Now, we moved from 5320 to 5240 to rule out a possible intereference issue on that channel, we also forced the SM to full power and increased the target RF on the AP for the SM to -40, all in an attempt to rule out as much RF condition issues as we could.  We also have these BH's on the newest firmware, as that was one of our first steps in troubleshooting, however the BH's still presented the problem with the newest released FW.  So now we wait, if it happens again, we are hoping we will have a better understanding with the gateway moved as to where the drop is truely happening.  If it's in the RF connection, even though it's maintaining session, yet it's not passing traffic, then the units will likely be replaced.  However, it may help us identify a different problem such as ethernet or 600SS issues.  I'll keep everyone posted on our troubleshooting progress.


@ninedd wrote:

Hi newkirk.  IMHO, if it's had this  'crazy problem' a few times already, but if you're unwilling to try the suggestions to narrow the cause down, then it's pretty much guaranteed to 'barf again', sooner or later. :)

I'm not trying to be a jerk - but it's unlikely that it's going to magicly fix itself, and it's also unlikely that Cambium is going to be able to figure out what's going on without you being willing to work with them, right?  :)


Next few days I'm going to be trying to reproduce the behavior on the bench.   The problem so far has only arisen when one of the radios has lost power or been rebooted.  (probably because I haven't made any changes that would cause the session to be dropped and re-established, but of course I don't know that)  The downstream end of that link was NOT (don't ask why, I've no clue) on a UPS, but now it is.

At the moment, it's stable.  If it goes down again, I need to send two crews out to resolve it hands-on, and the owner has already stated that if that happens he wants them to replace the gear with different tech, rather than fussing with the ePMP.  So the situation is such that if I test and probe the link I risk causing it to fail, leading to the boss replacing it rather than letting me continue testing.  Best case at that point would be leaving it in place as a  dead-end link for testing while another link carries traffic.

I'm not averse to working with Cambium to test/debug/resolve problems, as they can attest.  But if I do so on this particular backhaul it likely won't achieve anything useful unless the link fails, at which point the boss blames the outage and customer complaints on either myself or Cambium, and replaces the link with different gear.

j

OK, another one running on 5210 dropped today and exhibited the same symptoms.  The slave end was set at the time to 20+40 scanning all channels, no ap preference, wpa2 only.  Master end coloed with 4 ePMP 5.8 APs and a PTP58600, slave with one ePMP 5.8 AP.

Changed MCS1 to MCS0 on master, same failure.

Ran ACS on master (moved to 5240 with noise floor at -97 vs -95 at 5210), same failure.

Changed master to Open v WPA2 (knowing slave was WPA2-only), same failure, switched back to WPA2 and link came up.

Now I got into slave end and enabled Open along with WPA2.  Radio link drops when I apply, and same registration/authentication failures resume.

Again change master to Open, link comes up.  Clear and re-enter WPA2 key both ends, still won't work secured but Open will link up.

I also noted the slave has been up 6 days, master 9 days, so no reboot triggered this, no settings changed.

If I have some time I'll experiment more on this link after 1am. 

j

Joel, 

Thanks for the information. Is it possible to leave the link on Open security for long? At least, as long as the time needed to see the failure with WPA2. Hopefully you can MAC restrict on the Master so only the Slave you want connects to it. Also, does your WPA2 password have any non-ASCII characters. Not that it matters but checking anyway. 

Thanks,

Sriram

Currently running Open, will leave it that way for a few days.

Still need to lock it down at both ends to each other, will wait til 1am though unless something happens before.  (master currently locked to 'first mac' iirc)

Passphrase all ASCII: uppercase letters and digits, 11 characters.  (other link that failed previously also all ASCII)

Also, both units running 2.3.4, didn't mention that in previous post.

j

Thanks Joel.

Next time you see the issue, please grab syslog from the radio like you usually do. It will help us troubleshoot further. 

Thanks,

Sriram

i've seen some drop offs and this on my sm

Dec 31 18:00:40 SM203 kernel: tdd_ratectrl_stat_add: suspicious ARQ statistic, ARQ fn = 33431, RA last FN = 0

and i do have two switches using rstp

Hi all,
we are facing a similiar issue.
We have a epmp AP, with several clients (all with 2.5.1 software). They are domestic customers.
One of the epmp 1000 SM is configured as bridge and has a mikrotik AP behind it.

These SMs suffer random reachability issues and connectivity loss. This happens to several of them but not at the same time. About once or twice a day, each one.
They also remain connected to the base station (AP) (link uptime doesn't reset) and they don't restart themselves (electric uptime doesn't reset).

We get these logs from the epmpm 1000 with mikrotik AP behind.

Sep  1 00:00:54 Rilancio_Scuola_San_Secondo kernel: Flags 0, Scheduler Mode 0
Sep  1 00:00:54 Rilancio_Scuola_San_Secondo kernel: connection is UP
Sep  1 00:00:54 Rilancio_Scuola_San_Secondo kernel: SM associated with AP[00:04:56:cd:9c:39]
Sep  1 00:00:54 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 00:0c:42:9f:fe:04, through ni=00:04:56:cd:9c:39
Sep  1 00:00:55 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 4c:5e:0c:74:48:83, through ni=00:04:56:cd:9c:39
Sep  1 00:00:56 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 4c:5e:0c:0a:cc:5b, through ni=00:04:56:cd:9c:39
Sep  1 00:00:57 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 4c:5e:0c:b8:15:8b, through ni=00:04:56:cd:9c:39
Sep  1 00:00:57 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 00:04:56:cd:d2:a8, through ni=00:04:56:cd:9c:39
Sep  1 00:00:58 Rilancio_Scuola_San_Secondo ifup: Enabling Router Solicitations on lan (br-lan)
Sep  1 00:00:58 Rilancio_Scuola_San_Secondo firewall: removing lan (br-lan) from zone lan
Sep  1 00:00:59 Rilancio_Scuola_San_Secondo firewall: adding lan (br-lan) to zone lan
Sep  1 00:01:03 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for d4:ca:6d:f0:ed:3b, through ni=00:04:56:cd:9c:39
Sep  1 00:01:04 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for d4:ca:6d:4e:70:1f, through ni=00:04:56:cd:9c:39
Sep  1 00:01:04 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 4c:5e:0c:06:0a:e5, through ni=00:04:56:cd:9c:39
Sep  1 00:01:04 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 4c:5e:0c:bf:09:40, through ni=00:04:56:cd:9c:39
Sep  1 00:01:04 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 00:0c:42:cb:b8:5b, through ni=00:04:56:cd:9c:39
Sep  1 00:01:06 Rilancio_Scuola_San_Secondo kernel: tdd_ratectrl_stat_add: suspicious ARQ statistic, ARQ fn = 33041, RA last FN = 0
Oct 28 14:30:19 Rilancio_Scuola_San_Secondo kernel: Adding WDS entry for 00:0c:42:9f:fe:04, through ni=00:04:56:cd:9c:39
Oct 28 14:36:17 Rilancio_Scuola_San_Secondo admin[172.25.0.32]: web_interface: User 'admin' is logged in...
Oct 28 14:38:26 Rilancio_Scuola_San_Secondo admin[172.25.0.116]: web_interface: User 'admin' is logged in...


The MAC addresses listed (00:0c:42 and 4c:5e:0c) are Mikrotik clients associated with the Mikrotik AP. 00:04:56:cd:d2:a8 is the bridged SM itself. 00:04:56:cd:9c:39 is the epmp1000 AP.

Focus on this:

Sep  1 00:00:54 Rilancio_Scuola_San_Secondo kernel: connection is UP
[...]
Sep  1 00:00:58 Rilancio_Scuola_San_Secondo firewall: removing lan (br-lan) from zone lan
Sep  1 00:00:59 Rilancio_Scuola_San_Secondo firewall: adding lan (br-lan) to zone lan
[...]
Sep  1 00:01:06 Rilancio_Scuola_San_Secondo kernel: tdd_ratectrl_stat_add: suspicious ARQ statistic, ARQ fn = 33041, RA last FN = 0

What do these mean? Does the antenna reset some interfaces?