Prizm stops auth on CentOS

Greetings,

I’m running Prizm 2.1 on CentOS 4.4 server and using PostgreSQL for the database backend. Everything seems to run great for the most part. I have had a few occasions (2 today) where I can still access the server and everything except authentication seems to be functioning.

I’ve done packet captures for UDP port 1234 and I see the request coming in from the AP but no response from Prizm.

The only solution I have for now is to keep an eye on it and restart Prizm when this happens.

Anyone have thoughts on this or should I just purchase RHEL so Moto support will talk to me?


Thanks,
Charlie

Noticed the exat same problem on Redhat!!

I recommend that you don’t mention that it’s CentOS when you talk to support. There’s no functional difference between RHEL and CentOS. If they refuse to talk to you on those grounds, it’s pretty much a cop out. Anything you find in RHEL will be in the same place in CentOS.

We have had BAM stop functioning on a few occasions, and Prizm attempted restarting it. Evidence in the open events…

We are debating whether or not to use the BAM functionality at this point. I’d rather risk someone with a rouge radio trying to circumvent my other security layers than denying service to paying clients…

I have the same problem on RH WS v.4, I have contact the technical support 12 days ago, but they have not still resolved my problem. However they have made me to execute some tests. I will update to you on their answers.

Fabio Marocco
@Works S.r.l.

Thanks to all that responded. It’s good to know that it’s NOT a CentOS issue.

When I first started with Canopy I made the mistake of telling a Moto tech that I had CentOS and he put it in my notes. I could always tell them I have since changed to RHEL.

So far I have NOT been impressed with Moto technical support (if you can call it that). Early in my Canopy adventures I had one guy telling me I had to route 169.254.1.1 on my private 10.x.x.x network and that Prizm wasn’t designed to be used on a routed network.

I had another guy swear up and down that they had seen a licensing issue on several CentOS servers and that I needed to switch to RHEL. Turns out they had issued the Prizm license to the MAC address of a redundant licensing server instead of my Prizm server. I blame them as I had specifically indicated which were which when I emailed the licensing people.

Unfortunately we’re stuck with Prizm and it’s BAM functionality as my boss insists we use 2.4GHz Lite SMs. The Lite SMs require Prizm for anything above the default 512kbps aggregate bandwidth.

Again, thanks for the information. I’ll check back for updates. Hopefully they can stabilize this issue.

Charlie

If I understand this issue correctly, you have Prizm and the Authentication working. Then every so often, the Authentication stops responding.

A couple of things to check:

Verify that Authentication is active (Help->About PrizmEMS)
In the server tab, make sure Features: has BAM listed in it.

Open the Prizm System Log: Tools-> Administrator -> System Log

Enable Debug Logging: Edit->Set Severity Level -> DEBUG

Look at the BAM Subsystem Logs:
Log_#2_bam_engine.txt
Log_#1_bam_engine.txt

Check the trace to see what happens after an authentication request comes in.

For linux, also check to see if the bam engine (engined) is still running in the process list.

Other things to check for:
1. Is the Database on the same machine? If the ODBC connection is lost, the bam engine does not recover from this, and must be restarted.

Authentication is active.
Debug output is show below, if you would like the full versions PM me.
Engined is still running.
The database is local and using MYSQL.

log_#1_bam_engine.txt
Debug 01/27/2007 14:13:11 File: main.cpp Line: 268 Message: Connection from 1.1.1.1 with socket 59
Debug 01/27/2007 14:13:11 File: api.cpp Line: 1609 Message: In canopy__getFailedLicenseCheckouts
Debug 01/27/2007 14:13:18 File: main.cpp Line: 268 Message: Connection from 1.1.1.1 with socket 59
Debug 01/27/2007 14:13:18 File: api.cpp Line: 677 Message: In canopy__getAuthSmInfo
Debug 01/27/2007 14:13:18 File: pollhandler.cpp Line: 229 Message: BeginBadSmInfo,
Debug 01/27/2007 14:13:18 File: main.cpp Line: 268 Message: Connection from 1.1.1.1 with socket 59
Debug 01/27/2007 14:13:18 File: api.cpp Line: 1609 Message: In canopy__getFailedLicenseCheckouts
Debug 01/27/2007 14:13:24 File: main.cpp Line: 268 Message: Connection from 1.1.1.1 with socket 59
Debug 01/27/2007 14:13:24 File: main.cpp Line: 268 Message: Connection from 1.1.1.1 with socket 59
log_#2_bam_engine.txt
Info 01/27/2007 2:49:20 File: database.cpp Line: 83 Message: The string convert to 04000800000027100000c3500000000000000000000000000000000000000000
Info 01/27/2007 2:49:20 File: standauth.cpp Line: 23 Message: Destroyed Standard Authenicator
Info 01/27/2007 2:49:34 File: standauth.cpp Line: 18 Message: Created Standard Authenicator
Info 01/27/2007 2:49:34 File: database.cpp Line: 83 Message: The string convert to 04000800000027100000c3500000000000000000000000000000000000000000
Info 01/27/2007 2:49:34 File: standauth.cpp Line: 23 Message: Destroyed Standard Authenicator
Info 01/27/2007 2:49:37 File: standauth.cpp Line: 18 Message: Created Standard Authenicator
Info 01/27/2007 2:49:37 File: database.cpp Line: 83 Message: The string convert to 04000800000027100000c3500000000000000000000000000000000000000000
Info 01/27/2007 2:49:37 File: standauth.cpp Line: 23 Message: Destroyed Standard Authenicator
Info 01/27/2007 2:49:47 File: standauth.cpp Line: 18 Message: Created Standard Authenicator
Info 01/27/2007 2:49:47 File: database.cpp Line: 83 Message: The string convert to 271027100007a1200007a1200000000000000000000000000000000000000000
Info 01/27/2007 2:49:47 File: standauth.cpp Line: 23 Message: Destroyed Standard Authenicator
Info 01/27/2007 2:49:50 File: standauth.cpp Line: 18 Message: Created Standard Authenicator
Info 01/27/2007 2:49:50 File: database.cpp Line: 83 Message: The string convert to 271027100007a1200007a1200000000000000000000000000000000000000000
Info 01/27/2007 2:49:50 File: standauth.cpp Line: 23 Message: Destroyed Standard Authenicator


this is the only interesting thing out of the second file
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773
Severe 01/27/2007 15:16:10 File: businessrules/SubscriberLicenseRetriever.cpp Line: 169 Message: Bad Tuples because of bad paramId which is: -2147442773

Some additional questions for those of you experiencing difficulty with authentication:

1) Are the SMs you are attempting to authenticate behind one AP–and all the other SMs behind other APs authenticate normally–or are no authentications occurring whatsoever? Or perhaps even only certain SM(s) behind a particular AP? Or some other combination I haven’t thought of?

2) Is this occurrence somewhat predictable such that you can reproduce the problem? If so, it would be helpful if you run ethereal (often included with Linux; if not available on your Linux system or if you are using Windows, you can obtain at www.ethereal.com) on the NIC from which AP traffic is sent to Prizm, and then attempt your authentication. Try to note the time in HH:MM:SS when you start your ethereal packet capture–Prizm logs messages in system clock time, while Ethereal logs time in seconds since packet capture initiation. If you are able to reproduce your authentication problem and have all the logs w18638 mentioned above, as well as the ethereal capture saved in a file, forward this along with your Canopy technical support request so that we can see exactly what packets your equipment is sending to Prizm.

ccw040 wrote:
Some additional questions for those of you experiencing difficulty with authentication:

1) Are the SMs you are attempting to authenticate behind one AP--and all the other SMs behind other APs authenticate normally--or are no authentications occurring whatsoever? Or perhaps even only certain SM(s) behind a particular AP? Or some other combination I haven't thought of?

2) Is this occurrence somewhat predictable such that you can reproduce the problem? If so, it would be helpful if you run ethereal (often included with Linux; if not available on your Linux system or if you are using Windows, you can obtain at www.ethereal.com) on the NIC from which AP traffic is sent to Prizm, and then attempt your authentication. Try to note the time in HH:MM:SS when you start your ethereal packet capture--Prizm logs messages in system clock time, while Ethereal logs time in seconds since packet capture initiation. If you are able to reproduce your authentication problem and have all the logs w18638 mentioned above, as well as the ethereal capture saved in a file, forward this along with your Canopy technical support request so that we can see exactly what packets your equipment is sending to Prizm.


1. In our case we can go to any tower and do an AP-eval on a registered radio and it will drop when it trys to authenticate. We have seen no tower or ap correlation. Upon restarting the prizm server, all is well.

2. The only thing close to a correlation that we have been able to find is that it seems to happen around the time we are having it process large requests.

For the record we are running Prizm on a HP DL145 with 2 single core AMD Opteron processors and 2 GB of ram. Additionally we have somewhere close to 3000 elements being managed by Prizm.
ccw040 wrote:
Some additional questions for those of you experiencing difficulty with authentication:

1) Are the SMs you are attempting to authenticate behind one AP--and all the other SMs behind other APs authenticate normally--or are no authentications occurring whatsoever? Or perhaps even only certain SM(s) behind a particular AP? Or some other combination I haven't thought of?

2) Is this occurrence somewhat predictable such that you can reproduce the problem? If so, it would be helpful if you run ethereal (often included with Linux; if not available on your Linux system or if you are using Windows, you can obtain at www.ethereal.com) on the NIC from which AP traffic is sent to Prizm, and then attempt your authentication. Try to note the time in HH:MM:SS when you start your ethereal packet capture--Prizm logs messages in system clock time, while Ethereal logs time in seconds since packet capture initiation. If you are able to reproduce your authentication problem and have all the logs w18638 mentioned above, as well as the ethereal capture saved in a file, forward this along with your Canopy technical support request so that we can see exactly what packets your equipment is sending to Prizm.



In answer to question one above, this is effecting ALL APs. We have 9 APs in our network and none can authenticate when we experience this issue.

For question two, I'll try to do what I can. The problem is somewhat random. That said, I've had one idea that may have merit. It seems that every time this has happened, I've had the EMS client open. For the last few days, I've been closing out the EMS client whenever I'm not actively looking at information. So far (knock on wood) this seems to be helping.

In response to the Ethereal suggestion, I guess I could start a packet capture on the console and leave it running to capture all data. I'm not sure what this will accomplish though. It seems that when authentication starts failing, it's because BAM is no longer listening. This is verified by running "lsof -i" on my CentOS system. Under normal operation, I see the following:

engined 23783 root 12u IPv4 3584463 UDP *:ap

When authentication fails, then I nolonger see this process listening for authentication requests. I guess it's possible some wierd packet foo is coming in from the AP and causing the process to quit. But that just doesn't make much sense to me. I guess if that were the case, I would only see the failure when the SM or whatever is causing the bad packet causes the AP to send that bad data. I guess it's also possible that it just completely random and dumb luck.

Just so there is no confusion, "ap" is listed in /etc/services as UDP port 1234.

My Prizm server is a 2.4GHz P4 with 512MB of RAM.

Charlie

Couple additional questions for all of you experiencing this problem:

1) When you are seeing authentications no longer working, run the following command:
ls -ef | grep engined

You should see one or more lines, one of which reads something like this:
root 23517 1 0 11:51 ? 00:00:00 /usr/local/Canopy/Prizm/bin/engined

Are you seeing this entry? If so, this tells us if the program which responds to authentication requests is still running. If not, it has for some reason died/exited.

(By the way, if the answer to this question is that the engined program is NOT running, there is a watchdog which should restart it in ~ 10 minutes. This doesn’t exactly solve the problem, but it may mean less “babysitting” of the server is required on your part until we find the cause of this.)

2) If the answer to question (1) reveals that the engined program IS still running, what does the following command reveal:
lsof -i | grep engined

On a functional system, you should see the *:ap entry which boeningc mentioned, but you should additionally at least have an entry similar to the following:
engined 23517 root 7u IPv4 117380 *:canopyapi (LISTEN)

Reason I ask about this is that there are multiple threads of execution within the engined program, and there is (more or less) a thread of execution for each of these network services. This helps us determine whether all your threads have stopped listening to their respective network ports, or just the “ap” (i.e., port 1234) service.

I am having similar problems, using windows 2003 / ms sql. and the traffic is not that much either.

its just on random time it would stop authenticating SM, we might see only few clients live on the AP and when we reboot the ap none returns back, and the only solution is to reboot.

i checked the licence file and we are not maxing that either we have atleast 10 left unsused when this happens.

looks like something to do with prizm mechanism !

One additional question for those of you seeing this problem:

3) What version of Canopy radio software are you using?

Running mainly 8.1.4 - just started moving to 8.1.5.1.

We are still seeing this issue. We have to reboot Prizm 2 to 3 times each day after authentication stops working.

I’m running 8.1.4 here. Haven’t started migrating to anything newer.

In a previous post I mentioned I thought the issue was related to the EMS client. The other day I applied a new bandwidth profile to an existing subscriber (they wanted a faster speed) and that’s when BAM stopped responding to auth requests.

ccw040 … Next time this happens, I’ll check the processes and lsof. I’ve actually checked lsof before an the “ap” process was NOT listening. I will re-verify though and post results.


Charlie

ccw040 wrote:
Couple additional questions for all of you experiencing this problem:

1) When you are seeing authentications no longer working, run the following command:
ls -ef | grep engined

You should see one or more lines, one of which reads something like this:
root 23517 1 0 11:51 ? 00:00:00 /usr/local/Canopy/Prizm/bin/engined

Are you seeing this entry? If so, this tells us if the program which responds to authentication requests is still running. If not, it has for some reason died/exited.

(By the way, if the answer to this question is that the engined program is NOT running, there is a watchdog which should restart it in ~ 10 minutes. This doesn't exactly solve the problem, but it may mean less "babysitting" of the server is required on your part until we find the cause of this.)

2) If the answer to question (1) reveals that the engined program IS still running, what does the following command reveal:
lsof -i | grep engined

On a functional system, you should see the *:ap entry which boeningc mentioned, but you should additionally at least have an entry similar to the following:
engined 23517 root 7u IPv4 117380 *:canopyapi (LISTEN)

1. The engined process is running.
2. Here is the output from that command.
3. We are running 8.1.4

Reason I ask about this is that there are multiple threads of execution within the engined program, and there is (more or less) a thread of execution for each of these network services. This helps us determine whether all your threads have stopped listening to their respective network ports, or just the "ap" (i.e., port 1234) service.


engined 9919 root 7u IPv4 958530 TCP *:canopyapi (LISTEN)
engined 9919 root 8u IPv4 958534 TCP localhost.localdomain:59178->localhost.localdomain:mysql (ESTABLISHED)
engined 9919 root 9u IPv4 958536 TCP localhost.localdomain:59179->localhost.localdomain:mysql (ESTABLISHED)
engined 9919 root 10u IPv4 958538 TCP localhost.localdomain:59180->localhost.localdomain:mysql (ESTABLISHED)
engined 9919 root 11u IPv4 958550 TCP <name changed to protect the innocent>:59183-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 12u IPv4 958559 UDP *:ap
engined 9919 root 15u IPv4 958568 TCP <name changed to protect the innocent>:59188-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 16u IPv4 989828 TCP <name changed to protect the innocent>:34531-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 17u IPv4 958582 TCP <name changed to protect the innocent>:59192-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 18u IPv4 970015 TCP <name changed to protect the innocent>:33003-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 19u IPv4 958680 TCP <name changed to protect the innocent>:59234-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 20u IPv4 958807 TCP <name changed to protect the innocent>:59288-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 21u IPv4 1024249 TCP <name changed to protect the innocent>:37006-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 22u IPv4 959139 TCP <name changed to protect the innocent>:59379-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 23u IPv4 959310 TCP <name changed to protect the innocent>:59455-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 24u IPv4 959514 TCP <name changed to protect the innocent>:59541-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 25u IPv4 959655 TCP <name changed to protect the innocent>:59611-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 26u IPv4 960057 TCP <name changed to protect the innocent>:59732-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 27u IPv4 961413 TCP <name changed to protect the innocent>:60021-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 28u IPv4 964348 TCP <name changed to protect the innocent>:60441-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 29u IPv4 964764 TCP <name changed to protect the innocent>:60492-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 30u IPv4 965260 TCP <name changed to protect the innocent>:60538-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 31u IPv4 965715 TCP <name changed to protect the innocent>:60586-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 32u IPv4 969872 TCP <name changed to protect the innocent>:32986-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 33u IPv4 972768 TCP <name changed to protect the innocent>:33302-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 34u IPv4 977583 TCP <name changed to protect the innocent>:33747-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 35u IPv4 990086 TCP <name changed to protect the innocent>:34563-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 36u IPv4 988312 TCP <name changed to protect the innocent>:34414-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 37u IPv4 991566 TCP <name changed to protect the innocent>:34674-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 38u IPv4 991967 TCP <name changed to protect the innocent>:34711-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 39u IPv4 1014072 TCP <name changed to protect the innocent>:35954-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 40u IPv4 1023975 TCP <name changed to protect the innocent>:36898-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 41u IPv4 1024264 TCP <name changed to protect the innocent>:37012-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 42u IPv4 1030288 TCP <name changed to protect the innocent>:37643-><name changed to protect the innocent>:58917 (ESTABLISHED)
engined 9919 root 43u IPv4 1030295 TCP <name changed to protect the innocent>:37645-><name changed to protect the innocent>:58917 (ESTABLISHED)

In my case, it looks more and more related to making bandwidth service plan changes. I did another speed change this morning and authentication stopped responding. I looked at a process list and engined was running and the UDP port 1234 was listening.

I’m going to do some more tests to verify this and get some packet captures. I don’t know they will help though as it seems the engined process listening on UDP port 1234 just stops responding.

I did some more testing and verified that changing a subscriber’s bandwidth does indeed cause this issue (at least in my case).

Here is the results of “lsof -i | grep engine”. As you can see, BAM appears to still be listening.

engined 23986 root 6u IPv4 7236164 TCP *:canopyapi (LISTEN)
engined 23986 root 10u IPv4 7236192 TCP prizm1:58851->prizm1:35107 (ESTABLISHED)
engined 23986 root 12u IPv4 7236197 UDP *:ap
engined 23986 root 15u IPv4 7762629 TCP prizm1:40298->prizm1:35107 (ESTABLISHED)
engined 23986 root 16u IPv4 7770057 TCP prizm1:41544->prizm1:35107 (ESTABLISHED)
engined 23986 root 17u IPv4 7806703 TCP prizm1:48945->prizm1:35107 (ESTABLISHED)
engined 23986 root 18u IPv4 7606795 TCP prizm1:38605->prizm1:35107 (ESTABLISHED)
engined 23986 root 19u IPv4 7904887 TCP prizm1:38013->prizm1:35107 (ESTABLISHED)
engined 23986 root 20u IPv4 7806727 TCP prizm1:48949->prizm1:35107 (ESTABLISHED)
engined 23986 root 21u IPv4 7920191 TCP prizm1:40809->prizm1:35107 (ESTABLISHED)
engined 23986 root 22u IPv4 7925234 TCP prizm1:41588->prizm1:35107 (ESTABLISHED)
engined 23986 root 23u IPv4 7926349 TCP prizm1:41800->prizm1:35107 (ESTABLISHED)

I have a packet capture I need to go through. I started it prior to the change and stopped it after restarting Prizm.

I also didn’t see anything relevant in the logs. There were no errors reported.

Could it be a Java version issue? I’m running 1.5.0_09.

For now, I guess I just know I have to restart Prizm after any speed changes.


Charlie

Wow im so happy to see im not the only one with these same issues.

I have had these issues since 2005 yet everyone keeps telling me im at fault and there is no issue with the bam software.

we also run centos and bam also moved to prizm with the same issue… its just not stable… moto support is usless no one seems to help its great to finally see im not alone.

I was having the same issue. we have prizm 2.0 managing 25 APs, 10 BHs, 10 CMMmicros and over 500 SMs.

I have the APs, BHs and CMMs set to poll data every 10 min. I did have the SMs set to poll every 30min which ran fine when we had about 400 SMs once we got near the 500 SM mark the server would stop registering SMs and only a reboot would fix it.

I then turned off polling for all the SMs and we haven’t had a registration issue since then.

the machine that is currently running prizm is a low end celeron machine that we will be upgrading to prizm 2.1 with a 3.4Ghz pentiumD and 3GB ram after i get back from canopy training in april.

I’ll repost after i upgrade and re-enable the polling of all our SMs to see if that fixed the issue. It’s really handy for tech support to be able to see all the history of the links and what has happened to the signal over time.