13.2 to 13.4 System Reset Exception -- Watchdog Reset

Last week we rolled out 13.4 to our entire OFDM network after testing it in-house and a few live AP's since it was in Beta.

We are seeing an issue with more frequent watchdog reset events on at least two AP's.  The resets seem to be random and graph data that we are saving does not indicate a high traffic or high utilization period before the events. 

Prior to the 13.4 update, I would see one particular AP reboot randomly with the "Watchdog Reset" message about once a week. 

Since the 13.2 to 13.4 update we performed on 8/20/15, this AP has rebooted 21 times.  Another AP I found has rebooted 10 times since the upgrade.

Is anyone else still seeing Watchdog reset after upgrading to 13.4?

Cambium, since it doesn't seem to be happening to ALL of our AP's, what do you suggest I try other than a downgrade and/or a replacment of the AP?

Found this on a third AP.  Watchdog resets happening daily. 

We just found a 4th 450 5G OFDM AP on our network doing this.  This latest AP I found  also sees all registered SM's (17) reboot along with it.  We have a pro-active support query that runs and Clients whose connections that are re-regging or power cycling are contacted about having the issue solved.  When everyone on this AP came up on the query we found that AP watchdog events line up with SM's power cycling as well.

Brian,

Do you have a support case opened for this issue?  If not, please open  one. If so,  please work with support on resolving this.  We are getting close to a maintenance release to help this issue.

We will open a ticket soon.  We downgraded the affected AP's and I swapped one of them yesterday with a spare so we can test it in-house. 

Do you still see this happen on the AP that is in house?   We have a test load we'd like to have you try if so.  

Thus far, strangely no.  We have one SM registered to it not passing any traffic.  My next step to replicate the problem is to generate traffic.   All settings otherwise are set by a script and same sync source.

Thinking that the problem would go away if we did the upgrade again, yesterday we tried it on another affected AP in the field running 13.2.   About 10 hours after we upgraded back to 13.4, it Watchdog reset again. 

Just an update.  The affected AP on the bench is NOT displaying the Watchdog reboot issue running 13.4 but the AP we replaced it with isn't either.  We submitted a ticket with support Case Number SC23857 and we were told to wait until the next release 13.4.1.  

Aaron, if you would like this AP to apply your 13.4.1 Alpha release onto, I will trade you.  :-)   We do not have an easy way to generate traffic that wont affect our network or day to day operations.  And if you want that one, I have a few more I can pluck out of our network for you too.

Update:

With this cold weather in the past week, we have seen an increasing number of 450 APs in the field Watchdog Resetting.  SWDR as the engineers call it.  As the temps fall, the frequency of the resets increase.  I have a support ticket open but it's not progressing very fast.  Our solution thus far is to roll the affected AP's back to 13.2 where there is less of an issue with the SWDR.  We like the Frame Utilization statistics and would like a fix for this ASAP!!  We are up to roughly 10 AP's that can't run 13.4.  14.1.1 didn't help either.

We are experiencing the same problem and really would like to see a fix. Clearly if a software downgrade resolves this it should be fixable.

I just did a count of AP's we have rolled back to 13.2 and it's 29% of all our OFDM AP's.

We have reproduced the issue in our lab and it is most prevalent at cold (for us, between -20C -- -50C), which doesn't help with the current season we are in in the northern hemisphere!   However because we are now able to see this repeatedly, we are digging in to resolve the issue as our highest priority.  Right now, it appears that 13.2.1 will be more stable than 13.4/13.4.1.  The 14.1.2 build on the open beta site has been reported to have better results, but the issue is still not resolved.

Keep an eye out on the Open Beta site - as soon as we have a new build that fixes the issue and passes our internal testbed, we'll have the load available for download.

I sincerely apologize for the inconvenience of this issue.  AP resets are really bad and we understand the seriousness of the problem.

1 Like

Good to hear that you guys are taking the AP reboot seriouly now.   It was as serious for us two years ago, as it is now.

I hope you guys allocate more resources to track these bugs down quickly in the futuer.

Thanks,

Tushar

Hi Tushar -

Cross posting an email I just sent to the AFMUG list addressing this issue further. 

Hi Everyone –

Sorry for the delay in response on this thread.  I’d like to give an update of where we are with this issue.

First off, I would like to apologize for  the issues that this is causing.  We have heard reports for awhile in varying fashion, and Tushar had been talking about having things like this for quite some time, but we were having issues finding some correlation between reports (configuration, network topology, etc), as well as being unable to recreate the issue in our lab on demand.  This issue appears to have definitely got worse in the 13.4 release and is becoming more widespread as the weather turns.    

What we have found out in the last several weeks is that there is an issue with the memory controller code in the FPGA.  What this leads to is memory coherency being lost which actually has now been verified to lead to several issues.  We had seen reports of various resets over time but had no reason to correlate them to one root cause until now.   The most prevalent of these is the Watchdog Reset without any accompanying crash log.  The other issues with the same root cause are the Illegal Instruction crash, the Invalid NiBuf crash, as well as any Null Exception Handler crash.  The bottom line is, when memory contents glitch on your software, it depends on when it happens as to what the outcome is.  We have found this to be very reproducible at very cold temperatures (-20C -- -50C), but it has been seen and reported at higher temperatures, just not as often.

The nature of the FPGA based memory controller is that there can be timing issues that get exacerbated at extreme temperatures.  If you don’t have proper constraints in place for a given signal path, its timing characteristics can change on you as temperature changes.  Also, if you don’t have a proper constraint in place, even recompiling the FPGA can change the characteristics that then make what used to work fine susceptible to extremes.   Something happened with the 13.4 FPGA that brought this to the edge such that it is now a problem and as we are seeing with winter cold coming in, becoming much more prevalent at cold temperatures.  13.4 and 13.4.1 have the same FPGA.  14.1.2 has a new FPGA and there have been some improvements made in this area, but we have found it is still susceptible to the problem.

We are reproducing the problem in our lab and we have multiple developers digging in to figure out what is going on.  These types of issues with timing are generally very difficult to find and fix, but this is our highest priority right now and we will not have another release until this is fixed. 

I’ve talked mostly about 13.4 and 13.4.1 here, but the nature of this issue and how it can interact with hardware doesn’t preclude it from having been the cause of the issues some (like Tushar) have seen over time.  Once we have a fix for this, we will be adding more rigorous regression testing including an internal HW memory test to validate that this type of memory issue doesn’t come back again.

From what we’ve seen and heard, this issue only affects the 450 AP FPGA and is not an issue on the 450 SM, 430AP/SM, nor the 450i devices.  The 450i is a very different architecture and has a hardware based memory controller and watchdog timer whereas on the 450/430 based devices, these items are in the FPGA.

Again, I apologize for the severe inconvenience and realize that it is getting colder and colder in NA so we are racing against the clock with this.   As soon as we have any updates and new open beta loads with a fix, I’ll let you know.

I appreciate your patience.

Regards,

-Aaron

1 Like

Has this issue now been solved ???

also can a 450i sm on 14.1.1 be wirelessly connected to AP450 on 13.4 ? or do I need to downgrade the 450i to 13.4 ?

Yes, this issue is solved on R14.1.2,

It is highly recommended that all 450 platform radios be brought to this software release at a minimum.