Hi Tushar -
Cross posting an email I just sent to the AFMUG list addressing this issue further.
Hi Everyone –
Sorry for the delay in response on this thread. I’d like to give an update of where we are with this issue.
First off, I would like to apologize for the issues that this is causing. We have heard reports for awhile in varying fashion, and Tushar had been talking about having things like this for quite some time, but we were having issues finding some correlation between reports (configuration, network topology, etc), as well as being unable to recreate the issue in our lab on demand. This issue appears to have definitely got worse in the 13.4 release and is becoming more widespread as the weather turns.
What we have found out in the last several weeks is that there is an issue with the memory controller code in the FPGA. What this leads to is memory coherency being lost which actually has now been verified to lead to several issues. We had seen reports of various resets over time but had no reason to correlate them to one root cause until now. The most prevalent of these is the Watchdog Reset without any accompanying crash log. The other issues with the same root cause are the Illegal Instruction crash, the Invalid NiBuf crash, as well as any Null Exception Handler crash. The bottom line is, when memory contents glitch on your software, it depends on when it happens as to what the outcome is. We have found this to be very reproducible at very cold temperatures (-20C -- -50C), but it has been seen and reported at higher temperatures, just not as often.
The nature of the FPGA based memory controller is that there can be timing issues that get exacerbated at extreme temperatures. If you don’t have proper constraints in place for a given signal path, its timing characteristics can change on you as temperature changes. Also, if you don’t have a proper constraint in place, even recompiling the FPGA can change the characteristics that then make what used to work fine susceptible to extremes. Something happened with the 13.4 FPGA that brought this to the edge such that it is now a problem and as we are seeing with winter cold coming in, becoming much more prevalent at cold temperatures. 13.4 and 13.4.1 have the same FPGA. 14.1.2 has a new FPGA and there have been some improvements made in this area, but we have found it is still susceptible to the problem.
We are reproducing the problem in our lab and we have multiple developers digging in to figure out what is going on. These types of issues with timing are generally very difficult to find and fix, but this is our highest priority right now and we will not have another release until this is fixed.
I’ve talked mostly about 13.4 and 13.4.1 here, but the nature of this issue and how it can interact with hardware doesn’t preclude it from having been the cause of the issues some (like Tushar) have seen over time. Once we have a fix for this, we will be adding more rigorous regression testing including an internal HW memory test to validate that this type of memory issue doesn’t come back again.
From what we’ve seen and heard, this issue only affects the 450 AP FPGA and is not an issue on the 450 SM, 430AP/SM, nor the 450i devices. The 450i is a very different architecture and has a hardware based memory controller and watchdog timer whereas on the 450/430 based devices, these items are in the FPGA.
Again, I apologize for the severe inconvenience and realize that it is getting colder and colder in NA so we are racing against the clock with this. As soon as we have any updates and new open beta loads with a fix, I’ll let you know.
I appreciate your patience.
Regards,
-Aaron