EPMP 1000 3.5.1 Connection lost on backhaul make 90% of sm to lost communication from AP

jparrot · February 7, 2018, 6:59pm

I have a remote site with different technologies. 2.4 epmp and 900ubnt.

we started to have problems with the 2.4 last week. the 900 is fine so we looked at any interference...

since no interference found we have started some ping on every device on this site.

we have found the problem. we lose some ping on our PTP feeding this tower... but this is not the main issue.

when the PTP loose a ping, 90 tp 100% of the SM registered to our epmp 2.4 stop responding. we have to click on the dereister button to get back access to the SM. the 900 ans other technologies in the same site continue to work correctly after the ping lost, but for some any unknown reason, the 2.4 epmp stop working juste after that.

Dmitry_Moiseev · February 8, 2018, 11:42pm

Hi!

What kind of PTP are you using to feed the tower? How often does it happen? Can you monitor the PTP link interface and see if there is anything looking like a broadcast storm during the ping loose?

Thanks,

Dmitry

jparrot · February 12, 2018, 12:44pm

We have found the problem, the link is a AirFiber from ubnt. We are also using a Toughswitch from ubnt.

we have 2 epmp radios in ports 1 and 2 of the switch. And 2 900 ubnt on port 3 and 4

the PTP is on port 8

On the toughswitch port 8 I see some errors:

RXFcsErr: 40887 and incresing.

Every time that error counter increase (on the PTP), we lose customers on epmp radios. We are also seeing ping lost on the 900 radios but SMs stay on-line.

It's a strange problem, like if the packet lost on the PTP is affectig the SM link to the AP.

We have replaced the PTP, the problem is now solved, but I'm currently watching all my other towers for similar problem. We are running this site sime long time.... this problem seem to appear juste after upgrading to 3.5.1

thanks

Chris_Bay · February 12, 2018, 1:13pm

Could you tell us more about your network config ? Pppoe, static etc.

jparrot · February 12, 2018, 1:22pm

SMs are static, NAT with seperate IP address for management.

When the problem occur, we are correctly seeing the SM in the AP list with good signal, but no ping answer to the management or public IP of the SM. We click deregister button and it come back. Acting like if only the radio part is working and no network.

dkeltgen · February 12, 2018, 2:34pm

Are you running STP on any of the switches? If your bad port is causing a TC it will flush the MAC/CAM table from the switch. Because the switch no longer has the MAC addresses of the SMs it will flood any packets destined for those unknown MAC addresses out all ports (with the exception of the received port). ePMP racios don't appear to pass along unknown unicast packets. Because of this, the SM will not be reachable until either the SM transmits a packet, thus allowing the switch to learn the location of its MAC address, or the router's ARP entry times out for that device, which will then initiate a broadcast ARP request wich will get passed to the SM.

To test this, clear the ARP entry on your router for a non-responsive SM and see if it starts responding.

Some solutions would be:

1. Disable STP if you're not using it.

2. Lower the arp timeout on your router.

3. Do nothing, because the moment your client tries to use their connection the SM will TX a packet and everything will work fine.

jparrot · February 12, 2018, 3:52pm

STP is disable.

Flushing the arp table wasent helping, I've already tried it

we got a LOT of customer complaint so wating for reconnect did not work.

The strange thing is that it's affecting only epmp radios. Other's one in the same site are connectig correctly after the error occur.