Stochastic disappearance and reappearance of e510

waw · June 29, 2020, 5:09pm

Hello the Community,

my installation consists of 10 e510 in the following setup:

-A public access network (vlan11) and a management network (vlan92) are routed to the internet via a linux router as shown at network_topology.png
-E510s are configured using cnmaestro. See vlan settings at cnmaestro_vlan_settings.png.

After about a week of normal operation one e510 went offline without any apparent reason. The next day, a 2h power cut brought the whole network down. After power was restored, the device reappeared fully operational BUT, another e510 went offline...

Trying to realise what is going on, I run netdiscover and tcpdump on the linux router. On both cases, I had the following results:
a) the offline AP is discoverable with netdiscover, at the management network interface (eth1)
b) the offline AP shows up with THREE IP addresses (all at the same MAC, the e510 ethernet MAC): its expected IP address on the management network, an IP address from the public access network and 192.168.0.1. I must repeat that all 3 IPs are found on eth1 which is the management network, which is not directly connected to the public access network.
c) the offline AP is continuously trying to communicate with the router (172.20.2.2) but rather fails to receive the replies as can bee seen at the tcpdump output (172.20.2.78 is the offline e510):
19:39:42.728061 ARP, Request who-has 172.20.2.2 tell 172.20.2.78, length 46
19:39:42.728074 ARP, Reply 172.20.2.2 is-at 00:0c:29:XX:XX:XX (oui Unknown), length 28
...<ignoring irrelevant capture>...
19:39:43.720053 ARP, Request who-has 172.20.2.2 tell 172.20.2.78, length 46
19:39:43.720078 ARP, Reply 172.20.2.2 is-at 00:0c:29:XX:XX:XX (oui Unknown), length 28
...<ignoring irrelevant capture>...
19:39:59.733948 ARP, Request who-has 172.20.2.2 tell 172.20.2.78, length 46
19:39:59.733961 ARP, Reply 172.20.2.2 is-at 00:0c:29:XX:XX:XX (oui Unknown), length 28
...<ignoring irrelevant capture>...
19:40:00.726303 ARP, Request who-has 172.20.2.2 tell 172.20.2.78, length 46
19:40:00.726313 ARP, Reply 172.20.2.2 is-at 00:0c:29:XX:XX:XX (oui Unknown), length 28
19:40:01.734339 ARP, Request who-has 172.20.2.2 tell 172.20.2.78, length 46
19:40:01.734362 ARP, Reply 172.20.2.2 is-at 00:0c:29:XX:XX:XX (oui Unknown), length 28
d) the offline AP does not respond to ping requests on its proper IP on the management network nor is reachable by any other mean.

Can anyone help me understand what is going on there? My feeling is that it has to do with vlans and the way they are implemented at e510s, which does not seem to be the way I would expect.

Before I thank you for taking the time to consider this weird behaviour, please take in to account this these 10 APs are an extension to a network with another 17 ubiquiti APs, operating seamlessly for 4 years now. I really do not expect the problem having to do with the rest of the infrastructure.

Thank you very much in advance!
Niko

network_topology.png (22.1 KB)
cnmaestro_vlan_settings.png (22 KB)

TrevorM · June 29, 2020, 5:54pm

You have not mentioned the layer2 vlan configuration for eth1 on the AP (this is the the AP group configuration under network->ethernet ports). Given your network diagram it should be set to trunk and allowed VLANs should include 11 and 92. I assume native VLAN is at the default value of 1, though this is not necessary as you don't actually use vlan 1 in your network. What is the corresponding switchport configuration on the ethernet switch?

You have three layer 3 interfaces configured for DHCP on your AP (vlan1, vlan11 and vlan92).

I assume you don't have a server that could give an IP to VLAN1, so it falls back to the default VLAN1 IP, which is 192.168.0.1. That's one of the IPs you're seeing. Assuming the layer 2 vlan config is set as above you will see arp responses for this IP sent with no 802.1q header.

VLAN1 also takes on a zeroconf IP address in the 169.254.0.0/16 subnet. That's the second IP address you're seeing. You can disable this IP in the AP group config under network->VLANs and disabling the zeroconf setting for that VLAN interface.Assuming the layer 2 vlanconfig is set as above you will see arp responses for this IP sent with no 802.1q header

I assume the third IP is for VLAN92. Assuming the layer 2 vlan config is set as above you will see arp responses for this IP sent with an 802.1q vlan tag with vid 92. What about the ARP request and response in your packet capture below - do you see tags on both (tcpdump -e, assuming you've port mirrored the switch port connected to the AP).

waw · June 30, 2020, 1:29pm

Thanks for your reply, I think it lead me to find the root of the problem.

>You have not mentioned the layer2 vlan configuration for eth1 on the AP

>What is the corresponding switchport configuration on the ethernet switch?

At the AP it was: allowed vlans:11,92. Native vlan 92 - tagged

The hp 1820g switch has the AP port participating in 11 and 92, both tagged.

I believe that declaring vlan92 as native and especially tagged, is the the reason at least for the AP not receiving arp replies. My hypothesis is this:

-the AP sends arp requests from an inteterface which does not tag traffic (vlan1 iface?), but tags them with the native vlan tag (92) before transmitting to the switch port

-since I had vlan92 native tagged, the reply is transmitted from the switch port to the AP with the vlan92 tag and thus does not make it to the interface listening for untagged traffic

I have changed the configuration on the APs to:native vlan 1 untagged and allowed vlans 11,92 hoping for my problem to dissapear.. The 9 APs that are online have synced with the new configuration and they are still online. As this is a remote installation, someone will power cycle tommorow the offline AP and I hope it will sync as well.

>What about the ARP request and response in your packet capture below - do you see tags on both

>(tcpdump -e, assuming you've port mirrored the switch port connected to the AP).

This is a remote installation and I can not have access to a mirrored port (a second switch is also in between and the 2 switches are wirelessly connected).

Thanks again, I will update the thread in the coming days with the outcome.

TrevorM · June 30, 2020, 2:52pm

Your original configuration sounds correct - if it is 11 and 92 tagged on the HP end, on the AP end it should be allowed vlans:11,92. Native vlan 92 - tagged

However the fact that you're seeing ARP responses for 192.168.0.1 suggests that there is some VLAN mismatch happening. Was the config with VLAN tags pushed through cnMaestro or were the APs configured through the local UI.

If you look under Jobs->configuration in cnMaestro it will tell you whether the config push succeeded from the cnMaestro point of view.

You mentioned that there is another switch between the HP and AP. Could that or the wireless link be modifying the tagging?

waw · July 1, 2020, 12:03am

The AP is back online. It was a switch configuration issue. I must have missed saving the switch configuration. Up to the moment of the power failure the AP was accessible because the switch was on the ‘running configuration’ where vlan 92 was tagged. After the power failure, the switch booted with the saved configuration where vlan92 was untagged at the AP’s port, making the AP inaccessible. What remains to be investigated is why the APs reply to arp queries at 192.168.0.1. I will update the thread if I find out. Apologies for my lack of thoroughness and thanks for the suggestions.