SM's Disconnecting


ninedd wrote: 

Also, I don't see anything about it losing GPS Sync - I set the Syncronization Holdoff TIme  up to 600 seconds to try that, but I don't see that it's losing Sync or that increasing the holdoff time has made any difference.  GSP losing sync would be indicated in the Access Point's  MONITOR -> SYSTEM LOG - correct?


The onboard memory is limited, hence with all the Device Agent messages being printed out, the GPS sync loss messages may have been overwritten. Would you be able to re-direct the syslogs to a syslog server? Or perhaps disable cnMaestro (remote management) temporarily so we can be sure we're not missing anything in the syslogs. 

With a mass SM drop, its unlikely its interference at the SMs. Interference at the AP could be a reason but like you said, if other APs on his tower are ok, its unlikely that its uplink interference. 

Is there anything specific you changed recently that you can remember? You said you just upgraded the sector to 3.1. Did it start happening after the upgrade? How about the other sectors, are they running 3.1 or are they running older software?

>Is there anything specific you changed recently that you can remember?

>You said you just upgraded the sector to 3.1. Did it start happening after the upgrade?
>How about the other sectors, are they running 3.1 or are they running older software?

No, this has always been the case on this sector. I just upgraded to 3.1 to see if that would help, and because I had that one SM that was stuborn to upgrade. However, all the SM's are 3.1 now.

Later tonight (like 3 AM) I'll run a Spectrum on the AP and I'll see what I can see.  It's a shame I can't run a Spectrum on an SM and see what they can hear. :(  But - I'll run one on the AP and see what's up from that perspective.  Maybe there's just a crapload of noise northward that isn't there S / E / W.  Maybe it's that simple.

The other thing I could try - these SM's are Force200's - so I could go try changing one of them out to an  ePMP1000 with a 22dBi antenna.  That could gain ~6 dB at the AP and if that SM lives though the de-register cycle, then that would point to noise at the AP.


@Cambium_Sri wrote:

Those error messages are from the Device Agent, the process that communicates to cnMaestro. It simply means the Device Agent on the radio is not able to communicate with cnMaestro as it is not able to reach cnMaestro.


OK.  So two questions then.
1) Where should I find the disconnect reason displayed then?

2) Does the inability for the Device Agent to communicate with cnMaestro relate to this issue I reported then?
http://community.cambiumnetworks.com/t5/ePMP-2000-and-1000/cnMaestro-Device-Offline-when-not-offline/m-p/61630#U61630


@ninedd wrote:

@Cambium_Sri wrote:

Those error messages are from the Device Agent, the process that communicates to cnMaestro. It simply means the Device Agent on the radio is not able to communicate with cnMaestro as it is not able to reach cnMaestro.


OK.  So two questions then.
1) Where should I find the disconnect reason displayed then?

2) Does the inability for the Device Agent to communicate with cnMaestro relate to this issue I reported then?
http://community.cambiumnetworks.com/t5/ePMP-2000-and-1000/cnMaestro-Device-Offline-when-not-offline/m-p/61630#U61630


1) The disconnect reason is in the logs you posted. The device is not able to reach cnMaestro and therefore Device Agent disconnects from it. Can you try a ping test from the device to the url of cnMaestro to double check if it can be reached? Also, make sure DNS server is reachable as well. Remember that both port 443 and 80 must be open in your network for cnMaestro to work. 

2) If the Device Agent is not able to reach cnMaestro, it is possible that it will show as offline on cnMaestro while in reality all other radio functions are operating normally. 

Thanks,

Sriram

OK, we added a new client to that sector today, and we put a 22dBi grid on it instead of using a Force200 with it's 16dB anntenna.  That's brought this clients signals in the mid 50's, with an SNR better than 40.


So, maybe if it's just simply interference - then perhaps this SM will stay connected when the others disconnect.  As you can see in this attached image, the first 3 SM's all disconnected at the same time 5:35 ago.  Actually, the 4th (Sally) disconnected at that same time 5:35 ago as well - but she aslo disconncted 3:39 ago.

In any event, this might let us know if it's as simple as needing more signal due to interference on this sector, or if it's still something else unknown.


Does eDetect on this AP show you anything (something other than the 3 other APs on this tower)?

Also the SNR is just one measure of the noise. Only an SA scan will show you the actual interference at the AP. Is it at all possible to disrupt this AP and run an SA scan?

OK, this client just disconnected while I was watching a few seconds ago.

The SM's uptime is 22 hours, 31 minutes (so it didn't reboot or power cycle).

eDetect on the SM says 'No interferers' and doesn't show anyting other than its own MAC address.

I can't run a Spectrum on the SM of course, so I don't know...

The logs on the SM shows this (his disconnection was a 2:01 PM our time - so right at the bottom of the log, and nothing has overwritten it or scrolled anything off.

Nov 15 13:55:51 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (3)
Nov 15 13:56:09 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (4)
Nov 15 13:56:26 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (5)
Nov 15 13:56:26 Frank Suchorab DEVICE-AGENT[2881]: Missed 5 consecutive PONGS, disconnecting with server
Nov 15 13:56:58 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 13:57:10 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]
Nov 15 15:25:55 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 15:26:03 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]
Nov 15 15:27:40 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 15:27:47 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]
Nov 15 15:38:57 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 15:39:05 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]
Nov 15 19:59:32 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (1)
Nov 15 19:59:58 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (2)
Nov 15 20:00:23 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (3)
Nov 15 20:00:52 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (4)
Nov 15 20:01:17 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (5)
Nov 15 20:01:17 Frank Suchorab DEVICE-AGENT[2881]: Missed 5 consecutive PONGS, disconnecting with server
Nov 15 20:01:48 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 20:01:57 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]

And this same client disconnected just now again at 2:13 PM.


Nov 15 19:59:32 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (1)
Nov 15 19:59:58 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (2)
Nov 15 20:00:23 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (3)
Nov 15 20:00:52 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (4)
Nov 15 20:01:17 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (5)
Nov 15 20:01:17 Frank Suchorab DEVICE-AGENT[2881]: Missed 5 consecutive PONGS, disconnecting with server
Nov 15 20:01:48 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 20:01:57 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]
Nov 15 20:12:12 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (1)
Nov 15 20:12:33 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (2)
Nov 15 20:12:56 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (3)
Nov 15 20:13:14 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (4)
Nov 15 20:13:37 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (5)
Nov 15 20:13:37 Frank Suchorab DEVICE-AGENT[2881]: Missed 5 consecutive PONGS, disconnecting with server
Nov 15 20:13:55 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 20:14:01 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]

This does mean that sometimes it's not all the SM's at the same time.  We do obviously see that (because sometimes all the Session Times on the AP are all identical) but since I'm sitting here and staring at it today, it's also true that they disconnect individually at different times.

And the logs do seem to indicate 'disconnectin from server' and 'callback_websocket' right at the point of disconnection.

PLUS - I need to make another correction...  this AP is still 3.0.1 and all the clients are 3.1 presently.  There's no way to see client throughput on the AP and the ''ePMP GUI+'' browser extension isn't compatible with 3.1 (and I haven't had time to work on a 3.1 version) so I guess the AP was reverted to 3.0.1 so that we could use ePMP GUI+ and see what clients are trasferring.


@ninedd wrote:

And this same client disconnected just now again at 2:13 PM.


Nov 15 19:59:32 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (1)
Nov 15 19:59:58 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (2)
Nov 15 20:00:23 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (3)
Nov 15 20:00:52 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (4)
Nov 15 20:01:17 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (5)
Nov 15 20:01:17 Frank Suchorab DEVICE-AGENT[2881]: Missed 5 consecutive PONGS, disconnecting with server
Nov 15 20:01:48 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 20:01:57 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]
Nov 15 20:12:12 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (1)
Nov 15 20:12:33 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (2)
Nov 15 20:12:56 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (3)
Nov 15 20:13:14 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (4)
Nov 15 20:13:37 Frank Suchorab DEVICE-AGENT[2881]: Not received PONG for the last ping (5)
Nov 15 20:13:37 Frank Suchorab DEVICE-AGENT[2881]: Missed 5 consecutive PONGS, disconnecting with server
Nov 15 20:13:55 Frank Suchorab DEVICE-AGENT[2881]: callback_websocket: LWS_CALLBACK_CLOSED
Nov 15 20:14:01 Frank Suchorab DEVICE-AGENT[2881]:  SMs pmac [00:04:56:CA:0E:0B]

This does mean that sometimes it's not all the SM's at the same time.  We do obviously see that (because sometimes all the Session Times on the AP are all identical) but since I'm sitting here and staring at it today, it's also true that they disconnect individually at different times.

And the logs do seem to indicate 'disconnectin from server' and 'callback_websocket' right at the point of disconnection.

PLUS - I need to make another correction...  this AP is still 3.0.1 and all the clients are 3.1 presently.  There's no way to see client throughput on the AP and the ''ePMP GUI+'' browser extension isn't compatible with 3.1 (and I haven't had time to work on a 3.1 version) so I guess the AP was reverted to 3.0.1 so that we could use ePMP GUI+ and see what clients are trasferring.


Can you disable Remote Management (if not already) on this SM and check to see if it stays up?

Well, yesterday before leaving work, I setup a continuous ping (ping -t) to each SM.  Since I only see the ''Session Time'' - I wanted to have an idea how often each SM was going offline each night.  But, to my surprise, everything stayed linked for the last 14+ or more hours.

So - I'm not sure what that means...  perhaps the pings served as some sort of keep-alive, or maybe it's an unrelated fluke, or maybe the two new clients we added yesterday (Deb and Brad) who are both a bit better signals than most, maybe they have helped everyone somehow...   or maybe it's a fluke.

I'm also going to upgrade evertyhing to 3.1 today across the whole ePMP network here.  I had still been running some AP's as 3.0.1 because of the ePMP GUI+ Extension we use... and V3.1 broke it.


However, the programmer got a new V3.1 version completed today, and so we're able to upgrade all our ePMP to V3.1 today.  So, that will reset all the uptimes when the AP is upgraded - but then we will have everything talking V3.1 to V3.1 on all AP's and SM's.

Hello again.  Well, simply having a ping keepign connections alive isnt' the issue.  Here's a screen shot of how things look right now.  On the left I'm showing pings from my desktop, through our network to the AP (10.23.2.31)  and to each of the SM's.

So, 'BRAD' at the top of the list, he's been connected for 1D:23H currently.  However, as you can see, everyone else all disconnected 3H:42M ago, all at the same time.  Now, BRAD is the best signal of the bunch, he's measured at -58 on the AP.  Of course, SALLY is a -59 at the AP and she disconnected along with all the others.

I saw this happen yesterday - so ping times started going up, lots of dropped pings to all SM's and then everyone (except BRAD) de-registered.  Everyone re-reg'd right away, within a few seconds - and everyone was stable right away again.

SO - extreme interference at the AP side isn't ruled out, maybe it's just some noise in the 60's that is only occasinally.  This is a 2.4Ghz AP, so we do know there is lots of noise around of course.  But, our other AP's and SM's are all fighting with lots of noise in 2.4Ghz eveywhere and this is the only place that it really acts like this, and I need to track down the why's of what is happening.

SO - how can I tell what the AP and SM's think the reason for the disconnections are?

Hi,

Could you please attach full System Log from Ap and one of SM which experiencing the issue?

Hope we will find reason of disconnects there.

Thank you.

What's your logging level on the AP and SMs?  The default setting will not log the disconnect reason.


@dkeltgen wrote:

What's your logging level on the AP and SMs?  The default setting will not log the disconnect reason.


Hmmm...  Where do I configure that?


@ninedd wrote:

@dkeltgen wrote:

What's your logging level on the AP and SMs?  The default setting will not log the disconnect reason.


Hmmm...  Where do I configure that?


Under Configuration->System->Syslog. Just check all the boxes

syslog.png

AH - thank you.  I had been over the GUI with a fine-toothed comb... but not fine enough!  :)

OK, now that I have logging of all things checked on, this is what I see in the SM's when a bunch of them disassociate at the same time.

Nov 24 08:28:55 SM NAME kernel: [676116.380000] SM disassociated from AP[00:04:56:ca:0e:0c] F=2457 11nght10. Reason: 48 (COMMUNICATION LOST)
Nov 24 08:28:59 SM NAME kernel: [676120.400000] TPC set initial Tx-Power to 30dbm
Nov 24 08:29:15 SM NAME kernel: [676136.920000] SM associated with AP[00:04:56:ca:0e:0c]
Nov 24 08:29:15 SM NAME kernel: [676136.940000] connection is UP

So, ''Communication Lost'' on each of the SM's doesn't tell me much. :)  Maybe 'Reason 48' means something to you?

On the AP, a tthe same time - this is what I see

Nov 23 18:48:49 AP NAME kernel: [626441.810000] SM[00:04:56:c9:66:25] aid=4 disassociated. Reason: COMMUNICATION LOST
Nov 23 18:49:06 AP NAME kernel: [626458.570000] SM[00:04:56:f9:58:03] aid=1 disassociated. Reason: COMMUNICATION LOST
Nov 24 06:46:39 AP NAME kernel: [669511.850000] SM[00:04:56:f9:5b:03] aid=2 disassociated. Reason: COMMUNICATION LOST
Nov 24 08:28:55 AP NAME kernel: [675647.050000] SM[00:04:56:f9:58:03] aid=1 disassociated. Reason: COMMUNICATION LOST
Nov 24 08:29:08 AP NAME kernel: [675660.590000] SM[00:04:56:c9:66:25] aid=4 disassociated. Reason: COMMUNICATION LOST

Hi,

Please find belov link to KB wjith description of Registration State Codes:

http://community.cambiumnetworks.com/t5/ePMP-Troubleshooting/Network-Entry-State-Definitions-and-Courses-of-Action/m-p/40712#U40712

Communication Lost means the SM cannot be reached by AP polling, resulting in SM disassociation.

Please check RSSI and SNR levels.

Also please check with Spectrum Analyzer tool whether there is any bursty interference on the channel which may cause Radio Link quality degradation.

Thank you.

OK, thank you.

The RSSI's are between -56 and -73, and the SNR's are all about 30-40 range - so that should be pretty decent.

However, those two that are close to -70 are Force200's so I've been planning on changing those out to Connectorized radios and 22 dBi Grid Antennas, that should put them at -65 or so, and maybe that'd make a difference.  However, there are connections on here which are -64/64 signal and -38/38 SNR and they both unlinked 5:25 ago

So, maybe it is just occasionaly LOUD noise, but that'll be tricky to catch on the Spectrum - because I'd have to basically watch it live for 6 hours at a time.  Also, it seems strange because the logs show these clients seem to reassociate within seconds and performance it good for them right away it seems.

However - the simplest thing for me to try on this end is to repalce the Force200 with ePMP1000-22dB Antennas and see if an extra 6 dBi has a good effect on them.  It does seem like the client with the best signal (-56) does have the least disconnection problems.