Some DHCP clients get a 0.0.0.0 address

nathana · August 21, 2022, 2:04am

Okay, yes, this is exactly what’s going on: the problem is triggered when Auto-RF selects a DFS channel to switch to.

So the conditions that must be met for this bug to trigger are as follows:

There must be at least one open-security VAP
It must be running on 5GHz
Auto-RF must be enabled on the underlying radio
‘all-channels’ must be set on the underlying radio
The Auto-RF subsystem decides to change channels, and it changes from a non-DFS channel to a DFS one

Here is my minimum viable config, and steps to reproduce:

wireless radio 1
shutdown
exit

wireless radio 2
channel-width 40
channel-list all-channels
off-channel-scan
auto-rf
exit

wireless wlan 1
ssid Test
band 5GHz
exit

For bug reproduction to succeed, we need the cnPilot to pick a non-DFS channel initially, and then switch to a DFS channel.

Load software 4.2.2.1-r3 onto an E410.
‘delete config’ and reboot to ensure it is running at defaults.
Apply & save the above sample config.
If ‘show wireless radios’ shows that 5GHz radio picked a DFS channel from the start, then force it to try to immediately pick a new channel with the following:

wireless radio 2
channel 36
apply
channel auto
apply
exit

(just issuing ‘shutdown’ and ‘no shutdown’ does not work! you need to actually set non-auto channel, then set back to auto channel)

If ‘show wireless radios’ continues to show a DFS channel, then repeat step 4 until it finally picks a non-DFS one.
Use a second AP to generate noise on the channel that your test E410 picked. I do this by setting a second AP to use the same channel, connecting a client to it, and then running a continuous transmit (e.g., iperf test)
Wait for Auto-RF on the test E410 to decide channel utilization exceeds the acceptable threshold, and performs a channel change (can take 30+ minutes).

If Auto-RF picks a DFS channel, then you can issue ‘service radio iwpriv wlan16 get_authmode’, and this will return ‘get_authmode:3’. This indicates the bug has been triggered. At this point you can try to connect a wireless client to the Test SSID, and will be able to verify that you cannot move any data through that VAP.

If Auto-RF changes the channel but picks another non-DFS one, then the bug will not be triggered (‘get_authmode’ will show ‘1’ and wireless clients will have no trouble moving data when connected to Test SSID), and you will need to try again until you can get the Auto-RF subsystem to pick a DFS channel.

I previously commented that I had a hard time understanding how such a bug – an Auto-RF channel change event breaking all open-security VAPs – could have managed to avoid being caught by the engineers. But now that it’s clear there is this additional condition that must be met (channel must change to a DFS one), it’s a bit more understandable how this could have been missed.

EDIT: It also looks like maybe ‘channel-width 40’ might also be necessary for bug to be tripped.