Some DHCP clients get a 0.0.0.0 address

Thanks @Network_Support_Spec for the update. Trying to replicate the same issue here in our lab unfortunately it is not reproduced. Right now working on @nathana setup to find any clue.

If it required is it possible to provide remote access to your setup if it is required. So that it will be helpful in further debugging.

1 Like

I strongly suspect that specific conditions in the environment are tripping this bug (for example, a particular client chipset that behaves in a certain way and transmits some pattern of traffic), and once this occurs, it puts the AP into a bad state. This is why you have been unable to reproduce, and why trigger in our environment has been so random.

It seems like there should be some clue when this happens, such as unusual entries in kernel dmesg from Atheros driver, or something along those lines.

you would just need remote dashboard access correct?

Okay, I think I finally found something.

I issued ‘service radio iwpriv --all’ on both a working AP and on an AP that is experiencing the problem. I then diff’d them against each other to compare them.

Remember that, at least in our experience, the particular VAP that is experiencing the problem is always one of the open/unsecured ones. (It also always seems to be on 5GHz, but still unclear if that is coincidence or not.)

There was an obvious difference with one iwpriv attribute between working and non-working states: when the open 5GHz VAP is working, ‘authmode’ is 1, but when it is not working, ‘authmode’ is 3. So something is changing this.

Information on the Atheros iwpriv ioctls is hard to come by (probably protected by confidentiality / NDA). But according to internet sources (what little has leaked), everybody seems to be in agreement that ‘authmode’ of 1 means open. So it should be 1. (Also, the working VAP on 2.4GHz shows ‘authmode’ of 1 while the broken one on 5GHz shows 3, which is further evidence.)

It is less clear what 3 means. Some sources say WEP + 802.1x. Others say WPA1-EAP.

(All of the (working) VAPs with WPA2-PSK are showing ‘authmode’ of 6.)

Anyway, wlan16 is the Linux interface name for the VAP that keeps failing. And if I run ‘service radio iwpriv wlan16 authmode 1’, then problem instantly gets fixed without rebooting the E410.

I went ahead and rebooted all APs shortly afterward, just in case that would make the problem come back more quickly. Only a few hours later, it did on the one AP. Immediately after reboot, everything was still working and ‘authmode’ for the VAP showed 1. After it stopped working again, I re-checked ‘authmode’ on that same VAP interface, and it was back to 3.

One other difference I noticed is that ‘uciphers’ and ‘ucastcipher’ are both normally 1 on VAP that is set to open security. But after the problem happens and ‘authmode’ changes to 3, I also see that both ‘uciphers’ and ‘ucastcipher’ show 0 instead. (On working 2.4GHz open VAP, and on all other open VAPs on other E410, they show 1.)

Still unclear what is triggering this, but I did notice that the AP changed channels on 5GHz sometime between the last reboot, and when the problem started happening again. It is possible that this is not a coincidence. (Though clearly it doesn’t happen every time the Pilot decides to change channels, otherwise it would be happening on all of our E410s in the same AP Group.) But this might still explain why it seems to be happening on 5GHz only: it could be that the 5GHz radio is more likely to change channels than 2.4GHz, due to DFS requirements in the U.S.

At this point I do not believe this is either a bridging bug or a “coplane” bug. I’d guess either a wifid bug (when it brings VAPs down and back up during channel change maybe?), or a bug in the underlying Atheros kernel driver or HAL.

2 Likes

Wow. That is some awesome troubleshooting and great information. Hopefully this is key to resolving the issue.

1 Like

THEORY CONFIRMED.

I rebooted all of the APs again to see if I could once again trigger the issue, but this time I put a client in place that I scripted to constantly disconnect/reconnect from/to the problematic SSID every few seconds. The hope was that with this, I could pinpoint the exact time that it stopped working, in order to collect a techsupport file as close to the event as possible. (It seems like lots of the old log entries have already expired by the time we manage to get around to generating a techsupport file, which in part is what has been making determining the trigger so difficult.)

To my great fortune, within 30 minutes of the reboot, the problem happened again. And I was still sitting down at my computer watching things when it happened, so within 5 minutes of the event I managed to collect a techsupport file as well as make some additional observations and tests.

The problem coincided EXACTLY with “Auto RF” picking a new channel. When Pilot first booted up, it had picked channel 36 on the 5GHz. After roughly 30 minutes, it changed to channel 52, and these events were logged:

WIFI_AUTORF_ICA_COMPLETE - Initial Channel Assignment complete on [5GHz] radio
WIFI_AUTORF_CHANNEL_SWITCH - Channel switched from [36] to [52] on [5GHz] radio, [High Intf on channel]

At precisely that exact moment, my test client suddenly became unable to complete a DHCP exchange.

For completeness’ sake, I both collected a techsupport file after a fresh reboot while things were confirmed working, as well as collected one immediately after this happened. In addition, I also issued ‘service radio iwpriv wlan16 get_authmode’ on the CLI right after a fresh reboot, as well as immediately after the problem occurred. And sure enough: upon fresh boot and while things were working, wlan16 (first VAP on 5GHz radio, set to ‘open’ security) returned a 1 for the authmode iwpriv ioctl, and then returned a 3 once the problem started happening.

So the trigger is quite clearly Auto RF. Auto RF only changed the 5GHz channel, and that AP has a tendency to first select channel 36 on bootup. If the algorithm it is using always tends to find “High Intf [interference]” on that channel in that particular AP’s particular location, then that explains why this specific AP is experiencing the problem as frequently as it is: it picks 36 first, always finds interference on it, and switches away from it. At that point, the problem occurs.

After I was able to reproduce the issue so cleanly, I then decided to test setting the first VAP to have WPA2-PSK security instead, and repeat the test. I did so, and this time, when Auto RF picked a new 5GHz channel, that SSID continued to work: ‘authmode’ showed 6 before and after Auto RF did its thing, and data continued to flow properly on that VAP. So this demonstrates that the problem is only happening on VAPs that are set to ‘open’ security (or, more accurately, it is at the very least not happening to VAPs configured for WPA2-PSK security).

I think Auto RF or the Atheros driver itself is misconfiguring the open-security VAPs whenever it has to bring them down and then back up after making a channel change.

At the very least, all of this implies that one can temporarily work around the issue for now by setting specific Pilots to use specific channels, instead of using auto channel, while we wait for a fix.

Frankly, this is kind of a ridiculous bug…how did this one squeak through QA…

I will be attaching my techsupport files to my open ticket.

4 Likes

Great diagnosing and sleuthing. I hope this is appreciated, and an official solution comes quickly.

So although I can repeat this experiment nearly 100% every time with the particular AP at the particular site, I’m having difficulty coming up with a “minimum viable config” that can repro the issue. If I simply take a new AP out of the box, default it, and configure a single open-security VAP on 5GHz with Auto-RF enabled, the problem doesn’t happen after a channel change.

I’m still trying to confirm, but it’s looking like it might be a combination of 1) open-security + 2) auto-rf + 3) channel-list on the 5GHz radio being set to ‘all-channels’. And that perhaps it only triggers if Auto-RF picks a DFS channel, and moves from a non-DFS channel to the new DFS channel.

Okay, yes, this is exactly what’s going on: the problem is triggered when Auto-RF selects a DFS channel to switch to.

So the conditions that must be met for this bug to trigger are as follows:

  1. There must be at least one open-security VAP
  2. It must be running on 5GHz
  3. Auto-RF must be enabled on the underlying radio
  4. ‘all-channels’ must be set on the underlying radio
  5. The Auto-RF subsystem decides to change channels, and it changes from a non-DFS channel to a DFS one

Here is my minimum viable config, and steps to reproduce:

wireless radio 1
shutdown
exit

wireless radio 2
channel-width 40
channel-list all-channels
off-channel-scan
auto-rf
exit

wireless wlan 1
ssid Test
band 5GHz
exit

For bug reproduction to succeed, we need the cnPilot to pick a non-DFS channel initially, and then switch to a DFS channel.

  1. Load software 4.2.2.1-r3 onto an E410.
  2. ‘delete config’ and reboot to ensure it is running at defaults.
  3. Apply & save the above sample config.
  4. If ‘show wireless radios’ shows that 5GHz radio picked a DFS channel from the start, then force it to try to immediately pick a new channel with the following:
wireless radio 2
channel 36
apply
channel auto
apply
exit

(just issuing ‘shutdown’ and ‘no shutdown’ does not work! you need to actually set non-auto channel, then set back to auto channel)

  1. If ‘show wireless radios’ continues to show a DFS channel, then repeat step 4 until it finally picks a non-DFS one.
  2. Use a second AP to generate noise on the channel that your test E410 picked. I do this by setting a second AP to use the same channel, connecting a client to it, and then running a continuous transmit (e.g., iperf test)
  3. Wait for Auto-RF on the test E410 to decide channel utilization exceeds the acceptable threshold, and performs a channel change (can take 30+ minutes).

If Auto-RF picks a DFS channel, then you can issue ‘service radio iwpriv wlan16 get_authmode’, and this will return ‘get_authmode:3’. This indicates the bug has been triggered. At this point you can try to connect a wireless client to the Test SSID, and will be able to verify that you cannot move any data through that VAP.

If Auto-RF changes the channel but picks another non-DFS one, then the bug will not be triggered (‘get_authmode’ will show ‘1’ and wireless clients will have no trouble moving data when connected to Test SSID), and you will need to try again until you can get the Auto-RF subsystem to pick a DFS channel.

I previously commented that I had a hard time understanding how such a bug – an Auto-RF channel change event breaking all open-security VAPs – could have managed to avoid being caught by the engineers. But now that it’s clear there is this additional condition that must be met (channel must change to a DFS one), it’s a bit more understandable how this could have been missed.

EDIT: It also looks like maybe ‘channel-width 40’ might also be necessary for bug to be tripped.

3 Likes

For the record, 6.4.2.1 does not appear to suffer from this bug.

(‘authmode’ mappings also seem to maybe be different in this version of firmware, FYI. Have not dug deeply enough yet to be sure though. I’m getting back ‘4’ when a VAP is configured as open, but some sources suggest this is an “auto” mode, controlled by external daemon. In which case it might be ‘4’ regardless of actual config, even when using WPA.)

6.4.x feels very unfinished/“beta” in many other ways though. I was able to get Auto-RF to trip (and VAP still worked after that happened, thankfully), but…

E410-xxxxxx(config)# show events
Aug 21 07:52:17 WIFI-6-AUTORF-CHANNEL-SWITCH Channel switched from [48] to [128(20MHz)] on radio [2] band [5GHz], [DCS triggered]
[...]

E410-xxxxxx(config)# show auto-rf
Unable to process JSON response: Invalid JSON message received
E410-xxxxxx(config)# show auto-rf channel-info
E410-xxxxxx(config)# show auto-rf history
E410-xxxxxx(config)#

…seriously? None of the “show auto-rf” commands work at all?

Is 6.4.x on E4xx/E5xx/E6xx models known to be unfinished / in a beta state, and I either just missed the memo or nobody bothered to explicitly state it anywhere?

Nathan,
Thanks for troubleshooting and providing very useful information. I really appreciate your troubleshooting efforts to narrow down the root cause of the issue. It will help us in fixing the issue quickly.

Also can you please confirm issue is not seen when channel-width is configured as 20MHz

@Ashok, I have now tried multiple times to reproduce with 20MHz channel-width, and I can confirm that I cannot make the problem happen. So 20MHz is also a work-around.

It is a very strange bug indeed, that seems to require multiple components to all line up: 5GHz DFS, auto-rf, open-security, and channel widths > 20. My hunch is that this is some sort of timing/deadlock situation, where when a 40MHz DFS channel is selected, the AP has to complete radar detect both for the main carrier as well as the second HT carrier, and somehow the security config of the VAP gets stuck in a bad state, or isn’t properly initialized completely, because things either aren’t happening in the right order or DFS channel select takes longer than some other subsystem is expecting it to.

(I have not yet tried 80MHz.)

For now, I am working around the issue by simply using ‘channel-list prefer-non-dfs’. This is actually the default on 5GHz out-of-the-box. I don’t remember setting this to all-channels, and I suspect that though non-DFS might be the default in the actual AP firmware, all-channels might be the default in cnMaestro.

2 Likes

Thanks @nathana for confirming.
We are working on this issue in our lab and will try to fix the issue ASAP.
Yes, the default value of Channel-list in cnMaestro(all-channels) and AP(non-dfs) both are different.

@Ashok so is the changing range to 20 the official fix atm or is there an update resolving this? just wanting an official yay or nay here so i can take this back and be like here is an approved solution.

@Network_Support_Spec

We have found the root cause of the issue and fix will be available in the upcoming release.

1 Like

@Network_Support_Spec The other workarounds that you can use until official fix comes (if you need wider than 20MHz channels on 5GHz) is to either use static 5GHz channels instead of auto-channel, or to set 5GHz radio channel-mode to “prefer-non-dfs”.

was just wondering what was the Root cause so we can refrence this and is there a ETA on the firmware update/what will it be called? version ect.

Will that EPSK fix make it into this release too?

@Springs The firmware branch in question that is being discussed is 4.x. If I had to guess, the release will likely be versioned as 4.2.4.

Your EPSK VLAN bug as I understand it is in the 6.x firmware branch. The bug being discussed here (where open-security WLANs stop working on 5GHz when the radio decides to switch channels and picks a DFS channel) doesn’t exist in the 6.x branch as far as I’ve been able to tell, so I suppose upgrading to 6.x would be one other possible work-around to those suffering from this issue, assuming there isn’t some other feature that they happen to be relying on that is buggy in 6.x (out of the frying pan, into the fire).

Of course, 4.x is not applicable to XV-series (WiFi 6) hardware, only to older e-series (WiFi 5) hardware, and my understanding is also that you are working with XV hardware. I believe XV-series shipped with 6.x firmware on day 1. Recently it was ported to the older e-series stuff, though I have avoided deploying it in production since I have not had too much time to extensively test the 6.x releases made for e4xx/5xx.

has the new firmware been rolled out yet?