"Detected Ethernet Tx stalled" issues

We've been seeing a lot of these errors on our SMs lately - the snippet of the Event Log below is continuous (not edited)

Redacted - Event Log [root]

08/18/2018 : 20:46:16 CDT : :Detected Ethernet Tx stalled - recovering now. Please contact Canopy Support at support@cambiumnetworks.com
08/20/2018 : 16:36:49 CDT : :Time Set
08/20/2018 : 17:08:45 CDT : :Time Set
08/21/2018 : 08:47:54 CDT : :Detected Ethernet Tx stalled - recovering now. Please contact Canopy Support at support@cambiumnetworks.com
08/21/2018 : 09:35:37 CDT : :Web user; user=root; Reboot from Webpage;

This seems to be a new-ish problem, although it is entirely possible that this behavior has always occurred, but it just hadn't been reported in the Event Log. Still, they are a problem, and this seems to still be an issue with the latest, 15.2, firmware version.

Our team has opened a ticket with Cambium about this, but there's not been much progress. Is anyone else noticing this? How are you handling these?

Darren - Which type of SM, what are they connected to?  Is this affecting performance, or just something you noticed in the log?

SMs are 5GHz PMP 450 and 450d SMs, connecting to PMP 450, 450i, and 450m APs. Firmware versions are mostly 15.1.5, but we are seeing it on 15.2, as well.

We first noticed these in response to subscribers calling in to report their internet connection was "dropping off". We'd looked at signal level, run link tests, checked the NAT table, and found nothing unusual there, but the Event Log would have these sort of errors, and sometimes, the timing made it clear this was the reason they were having those symptoms, so it is performance-affecting.

Hi Darren,

Is there any way to get remote access to these devices?  When that message shows up it means we have detected and corrected the issue but we have also saved off some logs that would be helpful for us to see.  That is why there is a request to contact support.  You can reach me by DM in here with details.

Thanks!

OK, just to clarify in the public forum (I'm also communicating with Aaron privately), after further investigation, we have determined that all the SMs exhibiting this symptom have these factors in common:

  1. They are registered to PMP 450m APs
  2. They are in NAT mode
  3. Firmware version is at least 15.1 (the oldest example we've found was over a year ago)

Interesting, We have a similar or perhaps same problem. We have a ticket open as well. We too first encountered it over a year ago, the first time we encountered it was with 15.1 as well. Though at the time they were few and far between. It progressively got worse over the next year, but we were not sure what was happing because it was hard to find as it was happing. We finally realized all the trouble calls we were getting were true when we asked customers to call before they rebooted their radio. Turns out in each case the ethernet port had stopped sending any packets and was cycling constant out discards. In fact the radio would not even hand out an IP to the client. The only way to get it to start passing was to reboot the radio. In one case a customer said, if he waited 12 hours or so, it would come back, but I never asked a customer to wait that long and see what happened. Almost all our radios run in NAT mode with DMZ to the first address in the DHCP pool, since most of our customers just run a router. Cambium support recommended we turn off DMZ. We have run with that config for over 6 years no problem but we tried it anyway, and it did change the way it happened. After we disabled DMZ the radio would no longer lockup for hours at a time or until a reboot, but instead would only stop passing packets out the ethernet for 3 to 5 minuets then pick up again. Of note is the fact that we still have some old FSK or PMP 100 and 430 running with the old NAT/DMZ config on the same network with no issue whatsoever. In fact we just spent the last year converting almost all our customers over to the 450 system, with mostly 450M AP, and In my 18 years of working with wireless this last year has been the worst because of this issue (and a few other bugs but they are resolved now). We have lost quite a few customers over it and it has ruined our reputation in the community. I realized though, that I had not heard from certain customers that I know I would have called, even if their connection blinked for a second. So I called one up and much to my surprise he described his last year of service as the “best he had ever had from us” and he has been a customer for 17 years. I called a few others and they said the same. At this point we realized radios in bridge mode did not experience the issue. This customer has a static IP so we left the radio in bridge mode, its also why I never experienced the issue myself as my radio runs in bridge mode as well. So in a despite attempt to salvage customers we just finished switching all our radios into bridge mode and (jinx) one week in and the phones are dead quiet. We were getting any ware from 5 to 15 calls a day before the change. I have also checked in with a few that had the issue bad and they say no problems since. Its not a true fix though, we would like to go back to running NAT ASAP but this is the only workaround we could find for now. We still don’t know what causes the problem. We could never trigger it or force it to happen, it seamed to be completely random. I have a love hate relationship with the 450 system, in one way its easily the best performing system we have ever had, but to call it buggy and unstable would be a kindness. We would love to figure this out seeing as how we just spent almost $500.000.00 to “upgrade to 450”. If anyone has more info or questions please share or ask

2 Likes