THEORY CONFIRMED.
I rebooted all of the APs again to see if I could once again trigger the issue, but this time I put a client in place that I scripted to constantly disconnect/reconnect from/to the problematic SSID every few seconds. The hope was that with this, I could pinpoint the exact time that it stopped working, in order to collect a techsupport file as close to the event as possible. (It seems like lots of the old log entries have already expired by the time we manage to get around to generating a techsupport file, which in part is what has been making determining the trigger so difficult.)
To my great fortune, within 30 minutes of the reboot, the problem happened again. And I was still sitting down at my computer watching things when it happened, so within 5 minutes of the event I managed to collect a techsupport file as well as make some additional observations and tests.
The problem coincided EXACTLY with “Auto RF” picking a new channel. When Pilot first booted up, it had picked channel 36 on the 5GHz. After roughly 30 minutes, it changed to channel 52, and these events were logged:
WIFI_AUTORF_ICA_COMPLETE - Initial Channel Assignment complete on [5GHz] radio
WIFI_AUTORF_CHANNEL_SWITCH - Channel switched from [36] to [52] on [5GHz] radio, [High Intf on channel]
At precisely that exact moment, my test client suddenly became unable to complete a DHCP exchange.
For completeness’ sake, I both collected a techsupport file after a fresh reboot while things were confirmed working, as well as collected one immediately after this happened. In addition, I also issued ‘service radio iwpriv wlan16 get_authmode’ on the CLI right after a fresh reboot, as well as immediately after the problem occurred. And sure enough: upon fresh boot and while things were working, wlan16 (first VAP on 5GHz radio, set to ‘open’ security) returned a 1 for the authmode iwpriv ioctl, and then returned a 3 once the problem started happening.
So the trigger is quite clearly Auto RF. Auto RF only changed the 5GHz channel, and that AP has a tendency to first select channel 36 on bootup. If the algorithm it is using always tends to find “High Intf [interference]” on that channel in that particular AP’s particular location, then that explains why this specific AP is experiencing the problem as frequently as it is: it picks 36 first, always finds interference on it, and switches away from it. At that point, the problem occurs.
After I was able to reproduce the issue so cleanly, I then decided to test setting the first VAP to have WPA2-PSK security instead, and repeat the test. I did so, and this time, when Auto RF picked a new 5GHz channel, that SSID continued to work: ‘authmode’ showed 6 before and after Auto RF did its thing, and data continued to flow properly on that VAP. So this demonstrates that the problem is only happening on VAPs that are set to ‘open’ security (or, more accurately, it is at the very least not happening to VAPs configured for WPA2-PSK security).
I think Auto RF or the Atheros driver itself is misconfiguring the open-security VAPs whenever it has to bring them down and then back up after making a channel change.
At the very least, all of this implies that one can temporarily work around the issue for now by setting specific Pilots to use specific channels, instead of using auto channel, while we wait for a fix.
Frankly, this is kind of a ridiculous bug…how did this one squeak through QA…
I will be attaching my techsupport files to my open ticket.