Some DHCP clients get a 0.0.0.0 address

@Ashok sadly we have had the fix going for awhile now and nothing has changed im sending you more tech files now

@Ashok also i did see it on the 2.4 not just the 5ghz so sadly that is ruled out i hadnt seen any 2.4 earlier when previously got tech files but i did see 2.4 showing same symptoms this time

Thanks @Network_Support_Spec will go through the logs and get back to you.

Unfortunately, problem is back, so @Ashok was correct that factory reset didn’t actually fix it. Not sure why it took so much longer to come back this time, though. Very interesting.

I went ahead and ran the tests Ashok wanted me to run the next time that I saw the issue, and sent him the techsupport files that I generated. Hopefully you guys will soon manage to unearth the root cause of the problem…

It would be a good idea to put all debugging info into the ticket for preserving in a known location for historical purposes and for making available for others to review.

Yes @Pete_C we have created one jira internally to track this issue.

@Ashok also to follow up we tried factory reset as well and did not seem to change anything… intially seemed to work maybe a day or so but may have just been coincidence

I am happy to update the Cambium support ticket that I have open with the TS files I’ve generated recently. I have just been e-mailing directly with Ashok since that seemed to be the preferred method of communication, and I’m trying to do whatever I can to have this issue investigated and addressed as expeditiously as possible.

We are seeing the 0.0.0.0 IP allocation on a mix of XV2-2 and E505s. One WLAN seems to be fine but the 2nd WLAN gets the zeros. Both WLANs get addresses from the same DHCP server.

Do you need more tech support files?

Thanks @Network_Support_Spec for the update. Trying to replicate the same issue here in our lab unfortunately it is not reproduced. Right now working on @nathana setup to find any clue.

If it required is it possible to provide remote access to your setup if it is required. So that it will be helpful in further debugging.

1 Like

I strongly suspect that specific conditions in the environment are tripping this bug (for example, a particular client chipset that behaves in a certain way and transmits some pattern of traffic), and once this occurs, it puts the AP into a bad state. This is why you have been unable to reproduce, and why trigger in our environment has been so random.

It seems like there should be some clue when this happens, such as unusual entries in kernel dmesg from Atheros driver, or something along those lines.

you would just need remote dashboard access correct?

Okay, I think I finally found something.

I issued ‘service radio iwpriv --all’ on both a working AP and on an AP that is experiencing the problem. I then diff’d them against each other to compare them.

Remember that, at least in our experience, the particular VAP that is experiencing the problem is always one of the open/unsecured ones. (It also always seems to be on 5GHz, but still unclear if that is coincidence or not.)

There was an obvious difference with one iwpriv attribute between working and non-working states: when the open 5GHz VAP is working, ‘authmode’ is 1, but when it is not working, ‘authmode’ is 3. So something is changing this.

Information on the Atheros iwpriv ioctls is hard to come by (probably protected by confidentiality / NDA). But according to internet sources (what little has leaked), everybody seems to be in agreement that ‘authmode’ of 1 means open. So it should be 1. (Also, the working VAP on 2.4GHz shows ‘authmode’ of 1 while the broken one on 5GHz shows 3, which is further evidence.)

It is less clear what 3 means. Some sources say WEP + 802.1x. Others say WPA1-EAP.

(All of the (working) VAPs with WPA2-PSK are showing ‘authmode’ of 6.)

Anyway, wlan16 is the Linux interface name for the VAP that keeps failing. And if I run ‘service radio iwpriv wlan16 authmode 1’, then problem instantly gets fixed without rebooting the E410.

I went ahead and rebooted all APs shortly afterward, just in case that would make the problem come back more quickly. Only a few hours later, it did on the one AP. Immediately after reboot, everything was still working and ‘authmode’ for the VAP showed 1. After it stopped working again, I re-checked ‘authmode’ on that same VAP interface, and it was back to 3.

One other difference I noticed is that ‘uciphers’ and ‘ucastcipher’ are both normally 1 on VAP that is set to open security. But after the problem happens and ‘authmode’ changes to 3, I also see that both ‘uciphers’ and ‘ucastcipher’ show 0 instead. (On working 2.4GHz open VAP, and on all other open VAPs on other E410, they show 1.)

Still unclear what is triggering this, but I did notice that the AP changed channels on 5GHz sometime between the last reboot, and when the problem started happening again. It is possible that this is not a coincidence. (Though clearly it doesn’t happen every time the Pilot decides to change channels, otherwise it would be happening on all of our E410s in the same AP Group.) But this might still explain why it seems to be happening on 5GHz only: it could be that the 5GHz radio is more likely to change channels than 2.4GHz, due to DFS requirements in the U.S.

At this point I do not believe this is either a bridging bug or a “coplane” bug. I’d guess either a wifid bug (when it brings VAPs down and back up during channel change maybe?), or a bug in the underlying Atheros kernel driver or HAL.

2 Likes

Wow. That is some awesome troubleshooting and great information. Hopefully this is key to resolving the issue.

1 Like

THEORY CONFIRMED.

I rebooted all of the APs again to see if I could once again trigger the issue, but this time I put a client in place that I scripted to constantly disconnect/reconnect from/to the problematic SSID every few seconds. The hope was that with this, I could pinpoint the exact time that it stopped working, in order to collect a techsupport file as close to the event as possible. (It seems like lots of the old log entries have already expired by the time we manage to get around to generating a techsupport file, which in part is what has been making determining the trigger so difficult.)

To my great fortune, within 30 minutes of the reboot, the problem happened again. And I was still sitting down at my computer watching things when it happened, so within 5 minutes of the event I managed to collect a techsupport file as well as make some additional observations and tests.

The problem coincided EXACTLY with “Auto RF” picking a new channel. When Pilot first booted up, it had picked channel 36 on the 5GHz. After roughly 30 minutes, it changed to channel 52, and these events were logged:

WIFI_AUTORF_ICA_COMPLETE - Initial Channel Assignment complete on [5GHz] radio
WIFI_AUTORF_CHANNEL_SWITCH - Channel switched from [36] to [52] on [5GHz] radio, [High Intf on channel]

At precisely that exact moment, my test client suddenly became unable to complete a DHCP exchange.

For completeness’ sake, I both collected a techsupport file after a fresh reboot while things were confirmed working, as well as collected one immediately after this happened. In addition, I also issued ‘service radio iwpriv wlan16 get_authmode’ on the CLI right after a fresh reboot, as well as immediately after the problem occurred. And sure enough: upon fresh boot and while things were working, wlan16 (first VAP on 5GHz radio, set to ‘open’ security) returned a 1 for the authmode iwpriv ioctl, and then returned a 3 once the problem started happening.

So the trigger is quite clearly Auto RF. Auto RF only changed the 5GHz channel, and that AP has a tendency to first select channel 36 on bootup. If the algorithm it is using always tends to find “High Intf [interference]” on that channel in that particular AP’s particular location, then that explains why this specific AP is experiencing the problem as frequently as it is: it picks 36 first, always finds interference on it, and switches away from it. At that point, the problem occurs.

After I was able to reproduce the issue so cleanly, I then decided to test setting the first VAP to have WPA2-PSK security instead, and repeat the test. I did so, and this time, when Auto RF picked a new 5GHz channel, that SSID continued to work: ‘authmode’ showed 6 before and after Auto RF did its thing, and data continued to flow properly on that VAP. So this demonstrates that the problem is only happening on VAPs that are set to ‘open’ security (or, more accurately, it is at the very least not happening to VAPs configured for WPA2-PSK security).

I think Auto RF or the Atheros driver itself is misconfiguring the open-security VAPs whenever it has to bring them down and then back up after making a channel change.

At the very least, all of this implies that one can temporarily work around the issue for now by setting specific Pilots to use specific channels, instead of using auto channel, while we wait for a fix.

Frankly, this is kind of a ridiculous bug…how did this one squeak through QA…

I will be attaching my techsupport files to my open ticket.

4 Likes

Great diagnosing and sleuthing. I hope this is appreciated, and an official solution comes quickly.

So although I can repeat this experiment nearly 100% every time with the particular AP at the particular site, I’m having difficulty coming up with a “minimum viable config” that can repro the issue. If I simply take a new AP out of the box, default it, and configure a single open-security VAP on 5GHz with Auto-RF enabled, the problem doesn’t happen after a channel change.

I’m still trying to confirm, but it’s looking like it might be a combination of 1) open-security + 2) auto-rf + 3) channel-list on the 5GHz radio being set to ‘all-channels’. And that perhaps it only triggers if Auto-RF picks a DFS channel, and moves from a non-DFS channel to the new DFS channel.

Okay, yes, this is exactly what’s going on: the problem is triggered when Auto-RF selects a DFS channel to switch to.

So the conditions that must be met for this bug to trigger are as follows:

  1. There must be at least one open-security VAP
  2. It must be running on 5GHz
  3. Auto-RF must be enabled on the underlying radio
  4. ‘all-channels’ must be set on the underlying radio
  5. The Auto-RF subsystem decides to change channels, and it changes from a non-DFS channel to a DFS one

Here is my minimum viable config, and steps to reproduce:

wireless radio 1
shutdown
exit

wireless radio 2
channel-width 40
channel-list all-channels
off-channel-scan
auto-rf
exit

wireless wlan 1
ssid Test
band 5GHz
exit

For bug reproduction to succeed, we need the cnPilot to pick a non-DFS channel initially, and then switch to a DFS channel.

  1. Load software 4.2.2.1-r3 onto an E410.
  2. ‘delete config’ and reboot to ensure it is running at defaults.
  3. Apply & save the above sample config.
  4. If ‘show wireless radios’ shows that 5GHz radio picked a DFS channel from the start, then force it to try to immediately pick a new channel with the following:
wireless radio 2
channel 36
apply
channel auto
apply
exit

(just issuing ‘shutdown’ and ‘no shutdown’ does not work! you need to actually set non-auto channel, then set back to auto channel)

  1. If ‘show wireless radios’ continues to show a DFS channel, then repeat step 4 until it finally picks a non-DFS one.
  2. Use a second AP to generate noise on the channel that your test E410 picked. I do this by setting a second AP to use the same channel, connecting a client to it, and then running a continuous transmit (e.g., iperf test)
  3. Wait for Auto-RF on the test E410 to decide channel utilization exceeds the acceptable threshold, and performs a channel change (can take 30+ minutes).

If Auto-RF picks a DFS channel, then you can issue ‘service radio iwpriv wlan16 get_authmode’, and this will return ‘get_authmode:3’. This indicates the bug has been triggered. At this point you can try to connect a wireless client to the Test SSID, and will be able to verify that you cannot move any data through that VAP.

If Auto-RF changes the channel but picks another non-DFS one, then the bug will not be triggered (‘get_authmode’ will show ‘1’ and wireless clients will have no trouble moving data when connected to Test SSID), and you will need to try again until you can get the Auto-RF subsystem to pick a DFS channel.

I previously commented that I had a hard time understanding how such a bug – an Auto-RF channel change event breaking all open-security VAPs – could have managed to avoid being caught by the engineers. But now that it’s clear there is this additional condition that must be met (channel must change to a DFS one), it’s a bit more understandable how this could have been missed.

EDIT: It also looks like maybe ‘channel-width 40’ might also be necessary for bug to be tripped.

3 Likes

For the record, 6.4.2.1 does not appear to suffer from this bug.

(‘authmode’ mappings also seem to maybe be different in this version of firmware, FYI. Have not dug deeply enough yet to be sure though. I’m getting back ‘4’ when a VAP is configured as open, but some sources suggest this is an “auto” mode, controlled by external daemon. In which case it might be ‘4’ regardless of actual config, even when using WPA.)

6.4.x feels very unfinished/“beta” in many other ways though. I was able to get Auto-RF to trip (and VAP still worked after that happened, thankfully), but…

E410-xxxxxx(config)# show events
Aug 21 07:52:17 WIFI-6-AUTORF-CHANNEL-SWITCH Channel switched from [48] to [128(20MHz)] on radio [2] band [5GHz], [DCS triggered]
[...]

E410-xxxxxx(config)# show auto-rf
Unable to process JSON response: Invalid JSON message received
E410-xxxxxx(config)# show auto-rf channel-info
E410-xxxxxx(config)# show auto-rf history
E410-xxxxxx(config)#

…seriously? None of the “show auto-rf” commands work at all?

Is 6.4.x on E4xx/E5xx/E6xx models known to be unfinished / in a beta state, and I either just missed the memo or nobody bothered to explicitly state it anywhere?

Nathan,
Thanks for troubleshooting and providing very useful information. I really appreciate your troubleshooting efforts to narrow down the root cause of the issue. It will help us in fixing the issue quickly.

Also can you please confirm issue is not seen when channel-width is configured as 20MHz