ePMP APs randomly rebooting... in unison?

We're seeing a very weird issue affecting our entire ePMP deployment consisting of 5APs on 5 towers  and 57 customers  total so far. The APs seem to be randomaly rebooting thoughtout the day but the odd thing is that they tend to do it together (based on system uptime). Generally, they dont get more than 3-4 hours without a reboot. 

Uptimes:
  Tower1: 1hr40min (NON-DFS, 5.7)

  Tower2: 1hr40min (DFS, 5.5)
  Tower3: 2hr37min (NON-DFS, 5.7)

  Tower4: 2hr37min (NON-DFS, 2.4) 

  Tower5: 5hr57min (DFS, 5.5, just installed a day or two ago)

None of these devices were rebooted manually anytime  since friday of last week, when they were upgraded to the 2.4 Software release. We've seen this issue on 2.3.4 and 2.3.3 but it may have been occuring before those. These devices are manually managed by Web/SSH and monitored via SNMP. I've tried to forcefully replicate the issue, but haven't been able to cause the devices to randomly reboot. 

I'm at a loss as to what could cause this, considering the range of frequencies and DFS/NON-DFS. The only other common point would be GPS Sync. I'm not sure how this would cause our issue, however. 

Do the AP's have public IPs on them ? I don't know if it affects ePMP but once upon a time our PMP100 AP's had public IP addresses on them  and this worked fine for us for many years. Then one day they all started rebooting every few hours and would seem to be almost in unison. Turned out that if you flood a PMP100 AP with telent requests the AP will reboot and when we looked at the data going to the radios we saw thousands of telnet requests from a handful of Asian  IP addresses.  

Could be possible that one of your monitoring tools is doing something that causes the radios to reboot.

First I want to validate my interpretation of 5 AP's on 5 towers and the information following.  There are 5 towers with 1 AP on each tower, not  5 towers with 5 AP's correct?  Are the AP's on the  same management subnet?  How many other subnets are on the AP's  and are the same between AP's?  Were there any log messages?  What is the frequency carrier and bandwidth.   I am trying to get a picture especially since the AP's listed with the same uptimes are on different towers with different frequencies and figure out the commonality.  

Correct, we have currently deployed 5 ePMP APs TOTAL across 5 towers, 1 AP per tower so far. They are on the same management subnet, with only one subnet. 

Here is the syslog from an AP that just rebooted 1 hour and 20 minutes ago:

Dec 31 19:00:17 APR_ePMP1000 kernel: GPS Sync Lost. (00:00:10:23961)
Dec 31 19:00:17 APR_ePMP1000 kernel: GPS Sync Restored. (00:00:15:527935)
Dec 31 19:00:19 APR_ePMP1000 snmpd[1533]: DFS status: N/A
Mar 20 07:39:22 APR_ePMP1000 crond[1482]: time disparity of 23780919 minutes detected

 Frequency and bandwidth vary by each tower with the exception of tower2 and tower5. 

Tower1 : 5760MHz @ 20MHz 

Tower2:  5680MHz @ 40MHz

Tower3:  5835MHz @ 40MHz

Tower4:  2442MHz @ 10MHz

Tower5:  5680MHz @ 40MHz

All APs are operating on their integrated GPS Sync with the included GPS antennas..

Edit: brubble1: unfortunately no, they all have private addresses only accessible from within our network. 

Hi, 

Are you sure the APs are rebooting or is it simply that the SMs are dropping thier links to the AP? If its the latter, then the logs you posted make sense. The APs are losing GPS sync and when they do, the APs transmitter shuts off dropping all the links. This can be due to several reasons such as the GPS antenna not having clear line of sight to the sky to periodic interference around the 1.4 GHz freq (GPS freq) around the towers. We've come across situations where nearby 3G UMTS repeaters have caused problems with the GPS recievers on the AP. 

First, can you confirm you are running the latest GPS firmware. Under Monitor->GPS ensure the that the GPS firmware is AXN_1.51_2838. 

If not, please upgrade the GPS firmware from the Tools->Software Upgrade page. 

You can also keep the trasmitter on when the AP loses GPS signal by configuring a larger value under Configuration->Synchronization->Synchornization Holdoff Time. 

Lastly, any reason why you are using GPS sync on your towers? There is only one AP per tower and it appears that each AP's frequency is quite far apart. If you are not frequency reusing or using adjacent channels between towers, I would suggest using "Flexible" mode on all your APs. Flexible mode does not require GPS sync and will temporarily resolve your issue while you troubleshoot why GPS signal keeps getting lost on these APs. 

But please confirm that its not actually an AP reboot but simply the SMs dropping thier link. The logs you posted don't suggest an AP reboot. 

Thanks,

Sriram

All APs are running the latest GPS firmware version. The reason why we suspect that the APs are rebooting is based on 2 factors: 1. When this happens, ALL SMs lose connection to the AP, and 2. the AP "system uptime" counter is reset to 00:00:00. 

I would be happy to disable GPS sync for the time being to test, but we eventually plan to deploy 360* coverage at each tower and GPS sync would be required at that point. One or two of our towers may be near a cellular site, but the majority are not close. 

FYI, I just verified that the AP becomes unreachable when this happens. Seems to be a legitimate reboot of the AP itself. I collected a packet capture of data destined for the AP up to the point where it dropped and am sorting through that now. 

Here is the syslog from the AP that just went down. System Uptime for the AP reads 2minutes and 58seconds:

Dec 31 19:00:17 APR_ePMP1000 kernel: GPS Sync Lost. (00:00:10:40574)
Dec 31 19:00:17 APR_ePMP1000 kernel: GPS Sync Restored. (00:00:15:938094)
Dec 31 19:00:19 APR_ePMP1000 snmpd[1534]: DFS status: N/A
Mar 20 12:06:42 APR_ePMP1000 crond[1483]: time disparity of 23781186 minutes detected

 This looks like a reboot to me, as the system comes up with the default time before contacting the NTP server and adjusting the internal clock. 

Hello sarnold,

I agree that it does look like a reboot. The entries in the syslog are after-the-fact entries, meaning, entries from when the AP is already recovering. Would it be possible for you to setup one of the APs to forward syslog entries to a syslog server and also enable all debug levels, as shown in the pic attached? It may give us some more information about the event.

Luis

Ok, collected logs from all of our APs over the weekend. Lots of reboots, with two just with morning (7a EST and 9a EST).

Attached are the syslog messages, the "excerpt" file contains just the logs from the most recent reboots whereas the other contains all the syslog messages  from today.

Hello sarnold,

So APs with ip .143, .146 and .176 did reboot couple times this morning (around 7:00am and 9:00am) as you reported, all three at the same time. AP .144 seems to report its Ethernet link going down at those time too but looks like it is not rebooting at those time. AP .147 does not report anything out of the ordinary at those times.

Unfortunately the syslogs for the rebooting APs do not show anything that could be causing the reboot.

Could you please map these APs to the Tower information you provided previously?

How is power supplied to these APs? Brick, CMM, or other?

Would it be possible to provide remote access to some of these APs so we can look for any other evidence that could help narrow down this problem?

Thanks,

Luis

Of course, the towers match as follows:

Tower1: 5760 x.x.0.146
Tower2: 5680 x.x.0.143
Tower5: 5680 x.x.0.176

Power is supplied to all units via the included 1000base-T PoE injector.

I can absolutely provide remote access, but it would have to be tomorrow (we are Eastern Standard Time).

Edit: at the time of this writing, all APs have an 8 hour uptime. Nothing has changed in our configuration or network.  

Another two APs rebooted this morning. Please let me know when you would like remote access. We're eager to get this resolved. 


Hello sarnold,

You have PM.

Thanks,

Luis

Just to put this out there, I'm assuming there's other gear at the same tower sites that is NOT rebooting?  And that the APs in question are on power filtration and backup?

j

No mention of power at all?  What about power supplies?  Are you running them DC by chance?

Correct. the rest of the tower is unaffected. All APs are powered via their included AC/DC PoE injectors on large APC UPS' which power the rest of the devices on the tower as well. 

Looked like Luis got some good info about what's happening from our remote session today. Fingers crossed for a quick fix. 

Any news on this? I have 2, maybe 3, Epmp connectorized CPE that seem to have this same issue.  Any updates to this thread about how this issue is solved would be appreciated.

Hello tiny,

Sarnold issue was due to certain multicast/IGMP traffic not been handled correctly by the ePMP software. The fix is already available in release 2.4.1. I don't know if you are experiencing exactly the same issue sarnold was but it may be worth upgrading to that load anyways, if you have not already done so.

Luis

I updated the SMs from 2.3.3 to 2.4.1 yesterday, and one rebooted twice this morning.  I'll update the AP, which isn't rebooting, and continue to monitor them.

I have a little more insight; These links are definitely longer than the "norm", one is 23 miles and the other is 33 miles.  They only reboot when there is some amount of traffic being broadcast.  When I switch the few subs on this tower back to the old backhaul, the Epmp SM stays powered up and stable.  Within hours of switching the traffic back to the Epmp1000, it starts rebooting. I can make it reboot, simply by using Mikrotik bandwidth test across the Epmp1000 link.