Okay, I think I finally found something.
I issued ‘service radio iwpriv --all’ on both a working AP and on an AP that is experiencing the problem. I then diff’d them against each other to compare them.
Remember that, at least in our experience, the particular VAP that is experiencing the problem is always one of the open/unsecured ones. (It also always seems to be on 5GHz, but still unclear if that is coincidence or not.)
There was an obvious difference with one iwpriv attribute between working and non-working states: when the open 5GHz VAP is working, ‘authmode’ is 1, but when it is not working, ‘authmode’ is 3. So something is changing this.
Information on the Atheros iwpriv ioctls is hard to come by (probably protected by confidentiality / NDA). But according to internet sources (what little has leaked), everybody seems to be in agreement that ‘authmode’ of 1 means open. So it should be 1. (Also, the working VAP on 2.4GHz shows ‘authmode’ of 1 while the broken one on 5GHz shows 3, which is further evidence.)
It is less clear what 3 means. Some sources say WEP + 802.1x. Others say WPA1-EAP.
(All of the (working) VAPs with WPA2-PSK are showing ‘authmode’ of 6.)
Anyway, wlan16 is the Linux interface name for the VAP that keeps failing. And if I run ‘service radio iwpriv wlan16 authmode 1’, then problem instantly gets fixed without rebooting the E410.
I went ahead and rebooted all APs shortly afterward, just in case that would make the problem come back more quickly. Only a few hours later, it did on the one AP. Immediately after reboot, everything was still working and ‘authmode’ for the VAP showed 1. After it stopped working again, I re-checked ‘authmode’ on that same VAP interface, and it was back to 3.
One other difference I noticed is that ‘uciphers’ and ‘ucastcipher’ are both normally 1 on VAP that is set to open security. But after the problem happens and ‘authmode’ changes to 3, I also see that both ‘uciphers’ and ‘ucastcipher’ show 0 instead. (On working 2.4GHz open VAP, and on all other open VAPs on other E410, they show 1.)
Still unclear what is triggering this, but I did notice that the AP changed channels on 5GHz sometime between the last reboot, and when the problem started happening again. It is possible that this is not a coincidence. (Though clearly it doesn’t happen every time the Pilot decides to change channels, otherwise it would be happening on all of our E410s in the same AP Group.) But this might still explain why it seems to be happening on 5GHz only: it could be that the 5GHz radio is more likely to change channels than 2.4GHz, due to DFS requirements in the U.S.
At this point I do not believe this is either a bridging bug or a “coplane” bug. I’d guess either a wifid bug (when it brings VAPs down and back up during channel change maybe?), or a bug in the underlying Atheros kernel driver or HAL.