Lost access to 2000 AP during upgrade to 3.5

We were upgrading a 2000 AP from 3.3 to 3.5 and have lost all access to the AP. The upload went fine and after the reboot at the end of the software upgrade, the AP never came back. We have cycled power on the AP via the Netonix switch and it will show a 1 Gbps link to the AP but the MAC address of the AP does not show as active in the switch anymore and it will not answer pings, SSH or HTML to the IP address of the switch.

Anything else I can try - other than climbing a 100' tower at the top of a mountain to see what is going on?  We upgraded almost 100 SMs and 6 other APs to 3.5 without issue. This is the first 2000 AP we tried.  

Are you able to get to one of the SMs on this AP to see if the AP is accessible from there? That is assuming the SMs are registered to the AP and you have allowed access to the AP via the wireless interface. 

All SMs failed over to a backup AP (that is now over crowded). They won’t re-establish to the bad one.

The SMS see the bad AP but the RSSI is - 90 (from - 50’s) so it’s like it went to low power mode. They won’t connect at that level

Able to get a couple SMs to apparently re-attach to the AP. When they do, they vanish off the network. It seems they are re-attaching to the bad AP but I loose all network connectivity to them until I kill power to that AP - forcing them to connect to the secondary AP in their list.


@Au Wireless wrote:

Able to get a couple SMs to apparently re-attach to the AP. 


    I think the idea is that instead of driving to the Mtn and climbing the tower you take a SM to a spot were you can get a decent connection to the AP, connect to the AP via the SM and then log into the AP to see if you can figure out the problem/revert the firmware.    This assumes you have the AP's configured to allow access via the wireless interface.

Drove to the base of the mountain where I was able to connect a SM to the bad AP. No problem getting access to it from the wireless side. The bad AP reported a 1 Gbps connection on the Ethernet port but it was unable to connect out or ping anything. Matches the behavior from the other side of that port.

I downgraded from 3.5 to 3.2.2.  AP came back up after a reboot and was working as normal.

I then upgraded from 3.2.2 to 3.4.1. No problems after reboot. All working normal.

I then upgraded from 3.4.1 to 3.5. After the reboot, the AP would not connect to anything on its Ethernet port. It was essentially dead to the world again.

Downgraded back to 3.4.1 and everthing is fine again.

So, as far as I can tell, there is an issue with the 2000 hardware and 3.5.  I have 2 other 2000 APs but I am not really willing to test this theory on them since it's pretty disruptive to my customers.

The system log was downloaded during the 3.5 upgrade when it would not connect via Ethernet but there is nothing of interest in there other than cnmaestro trying to connect.

** UPDATE**  For kicks, I uploaded 3.5RC7 to the AP. Same behavior as 3.5 - no communication out the Ehternet port.

For now, running 3.4.1 with no issues.  I am happy to send the json config file to Cambium for lab testing to reproduce this.

Curious if anyone else is seeing this behaviour?

I was just getting ready to roll at 3.5 to all my 2000 AP's and saw this.

Now i am hesitant.


@Au Wireless wrote:

Drove to the base of the mountain where I was able to connect a SM to the bad AP. No problem getting access to it from the wireless side. The bad AP reported a 1 Gbps connection on the Ethernet port but it was unable to connect out or ping anything. Matches the behavior from the other side of that port.

I downgraded from 3.5 to 3.2.2.  AP came back up after a reboot and was working as normal.

I then upgraded from 3.2.2 to 3.4.1. No problems after reboot. All working normal.

I then upgraded from 3.4.1 to 3.5. After the reboot, the AP would not connect to anything on its Ethernet port. It was essentially dead to the world again.

Downgraded back to 3.4.1 and everthing is fine again.

So, as far as I can tell, there is an issue with the 2000 hardware and 3.5.  I have 2 other 2000 APs but I am not really willing to test this theory on them since it's pretty disruptive to my customers.

The system log was downloaded during the 3.5 upgrade when it would not connect via Ethernet but there is nothing of interest in there other than cnmaestro trying to connect.

** UPDATE**  For kicks, I uploaded 3.5RC7 to the AP. Same behavior as 3.5 - no communication out the Ehternet port.

For now, running 3.4.1 with no issues.  I am happy to send the json config file to Cambium for lab testing to reproduce this.


We wil test this scenario with provided configuration files asap and revert back to you.

Thank you.

Hi,

I tested in an epmp 2000 that I have and is not currently in use, updated to version 3.3 and later to version 3.5, everything happened normally without major problems.

I have one on the test bench that is working well with 3.5, but I have held firmware roll out on our network until it is clear what is going on.

A bit dissapointing if this is a issue as stable firmwares was one of the advantages Cambium has over Ubiquiti. After the 3.3 reboot issue :(

Could Cambium please enlighten us on what quality testing these firmwares go through prior to been released?

Example: How many Units do you test a firmware on prior to release? What sort of test bench to you have etc

Would love to see a write up on this to explain the procedure as I am sure a lot of us are curious how a company could handle this.

I am sure we all understand that you can only test so much in house and once out in the wild new bugs maybe be discovered due to different setups used.

Thanks.

P.S. Heres to hoping I do not have to clear the cookies everytime I change a password in the next firmware. 

I will tell you this...  With in minutes of me posting the problem on this forum, Cambium tech support emailed me asking me for my config files for that AP so they could complete internal testing. I sent them the test files and am sure they are trying to duplicate this. I could have bad hardware. There could be something in the way I am set up in software that is causing this. I am happy to hear it is not wide spread.

However, I know Cambium is listening and taking it seriously. I have worked with their tech support in the past (as well as UBNT, Mimosa and others) and found Cambium support to be very responsive. Yes, it sucks when firmware comes out with issues. It's how you deal with that that counts. Let's see what happens in the next few days.  I upgraded 120 SMs with no issues and a dozen 1000 GPS APs with no issues. I can see the throughput gains on our 40 Mhz APs. It's not made up. But, my first 2000 AP failed and I can reproduce the problem over and over. I have an issue. Maybe it's just me, maybe not. I think we'll hear from Cambium Monday with internal testing results.

2 Likes

@Chris-T wrote:

I have one on the test bench that is working well with 3.5, but I have held firmware roll out on our network until it is clear what is going on.

A bit dissapointing if this is a issue as stable firmwares was one of the advantages Cambium has over Ubiquiti. After the 3.3 reboot issue :(

Could Cambium please enlighten us on what quality testing these firmwares go through prior to been released?

Example: How many Units do you test a firmware on prior to release? What sort of test bench to you have etc

Would love to see a write up on this to explain the procedure as I am sure a lot of us are curious how a company could handle this.

I am sure we all understand that you can only test so much in house and once out in the wild new bugs maybe be discovered due to different setups used.

Thanks.

P.S. Heres to hoping I do not have to clear the cookies everytime I change a password in the next firmware. 


Chris, 

Without revealing too many details, we have more than a dozen different test setups that go through testing each release. The test setups range from PTP to small number (4-6) to medium (25-40) to 120 SMs. We test all types (ePMP 1000, F180, F190, F200, ePMP 2000) of radios and bands. There are also different modes and configurations that are tested using automation and traffic generators. Then there are upgrade/downgrade tests using GUI/SNMP/CNUT/CNSS/cnMaestro etc. We also have our field test setup where we run configurations which we can legally run in an outdoor system.  Once this is all done, we proceed to open beta where we spend at least two weeks having beta customers (can't thank these customers enough for helping out!) try out the new release and getting feedback on field configuration which we possibly cannot simulate in a lab or simply did not think of. 

That said, with the number of configuration options and modes and bands and radio types, it is near impossible to cover all permutations and combinations. We work to make continous improvements to our test and quality process to prevent escaped defects but its an endless, on-going process as we rapidly add more capability and complexity to the product line. 

If you are ever on this side of the Pacific and visit Chicago, I'd be more than happy to take you on a tour of our test labs. It is something we're proud of and a legacy we carried over from Motorola. 

Thanks,

Sriram

1 Like

@Cambium_Sri wrote:

If you are ever on this side of the Pacific and visit Chicago, I'd be more than happy to take you on a tour of our test labs. It is something we're proud of and a legacy we carried over from Motorola. 

Thanks,

Sriram


Having seen thier testing lab myself, it is pretty cool.

1 Like

Thanks for the write up Sriram all sounds good.

Look forward to future updates. Any more word on 3.5? Have you managed to replicate this issue others have experienced?

 


@Chris-T wrote:

Thanks for the write up Sriram all sounds good.

Look forward to future updates. Any more word on 3.5? Have you managed to replicate this issue others have experienced?

 


Hi Chris, 

We have been unable to reproduce the issue that AU Wireless ran into with his ePMP 2000 with the configuration he sent us. We'll continue to work with AU and provide updates as we progress further. 

Thanks,

Sriram

1 Like

any update on this?

We cannot reproduce issue in our Lab with configuration provided by Chadwick.

All possible scenarious were tested.

Thank you.

I think I have figured out a relevant point.  Are all the devices that have shown this behavior connected to a Netonix switch?  I don't believe this is restricted to just 2000 series, but 1000 series devices as well.

Rolling out 3.5 on our network has been very smooth with the exception of devices connected to a Netonix switch and where remote access of said devices is only through the Netonix switch (AP downstream of a BH connected through a Netonix).  Any device connected on a stand alone power supply, CMM3, or CMM4 has upgraded just fine and remains accessible. 

In almost all cases, I've not been able to access the affected AP's via a SM (but access via Wireless is not enabled).  I've been able to plug the affected device in to a stand-alone power supply and access it via the secondary IP of 169.254.1.1 but not on the normal management IP.  If I make any change in config, in my case I enabled management access via wirless, save changes, then the device is accessible via the management IP.  When I plug it back in to the Netonix switch, everything has returned to normal.

I will note that management of our devices is on a separate VLAN than customer data and all affected devices have been those with GPS sync.

I have 10 APs all connected to and powered by Netonix switches. 3 of those are 2000’s and 7 are 1000’s. All have GPS. None of the 1000’s had any upgrade issues. Two of the 2000’s failed (same Netonix). The 3rd 2000 I have not tried to upgrade.

The Netonix connected to the 2000’s is a WS-8-150-DC. All the rest are WS-6-Mini switches.