Bizarre DHCP problem last night....

Okay, here’s a stumper we ran into last night:

Right now, our customers are all a mixture of 900Mhz/5.7Ghz. All running 7.07. Some customers have static IPs, most run DHCP. Our DHCP server is hosted by another, local ISP. Yesterday morning, ALL of our DHCP clients dropped out, while ALL of our static customers remained on line.

ALL APs and ALL SMs appear to be functioning normally.

Here’s what we found:

1) At the central CMM2, I was able to plug a laptop into one of the open switch ports and get online as a DHCP client and pull an IP from our DHCP server, while no Canopy customers were able to do this. So, I concluded that my DHCP server is working correctly, and that I may have a bad switch in my CMM2.

2) I routed around my switch with another, external switch. Same problems. All DHCP customers still off line, but static customers remain fine. CMM2 switch eliminated.

3) We check to see if we possibly have a rogue server on our network. We can release/renew without getting some bizarre IP, so we are 99% confident we don’t have a rogue server.

4) After much frustration, we decide we’re just going to issue all of our DHCP customers static IPs and bite the bullet to get them on line.

5) Before we can begin calling customers to issue them static IPs, however, our DHCP server all of a sudden begins handing out IP addresses!! What. The. Hell. :evil:

This morning, everything is fine – like nothing happened. Bizarre.

We called Motorola, and they say that since we are passing static IPs, that eliminates Canopy altogether. But… how was I able to plug into my CMM2 switch as a DHCP client and pull a valid IP??

We are up and running, but confused as hell. Motorola guys have no idea what might have happened, either. I thought I’d post this and go straight to the real world operators to see if you all can come up with some ideas/suggestions how we can prevent this from happening again. :frowning:

Thanks, everyone.

This is a bit of grasping at straws, but was there some issue with the broadcasts for DHCP requests somehow being lost by the Canopy kit. I’ve not seen anything like this with Canopy, but have seen issues where breakage of switches in networks can affect its ability to forward broadcast and multicast traffic properly.

I don’t suppose you sniffed any packets whilst plugged into the CMM switch to see if the requests from the customers were making it that far? It does sound like the DHCP server was working - but was never being asked for addresses from the Canopy connected clients.

Paul.

I didn’t run a packet sniffer, at least not yet. You are right, though: it seemed as if no broadcasts were going through to the DHCP clients. Since we’re up and running, I just ordered a backup POe switch from NewEgg just in case I have to bypass our CMM2 switch :frowning:

Our next step will be to run a sniffer, I just haven’t done it yet.

Beware that normal Ethernet PoE is not the same as the way Motorola do power over Ethernet - so if you remove the CMM you’ll need to inject the power onto the wire with SM power supplies (or something similar).

I know, its a pain… :x

Paul.

Were the failed DHCP-supplied clients getting an address, mask, and gateway from the DHCP server, or did their request fail?

If they received a proper address, but lacked connectivity, how did you test connectivity? A failed ping to an address, or a Server Not Found message in a browser? Did you check DNS? If your notebook had statically assigned DNS entries, but the customers received theirs via DHCP – or the reverse – DNS problems could explain the symptoms.

If their DHCP requests were failing behind an SM, but succeeding ahead of the AP, then you’re back to a sniffer. It’d be faster, however, to reboot the AP and an affected SM to see if the problem disappeared.

Teknix:

It was a failed request. The customers never received an address at all. Something we did try was to shut off APs, one by one, and see if it was isolated by sector – it wasn’t, and the problem stayed. What we also tried was to remove ALL APs from the CMM2, and then one by one, plug them back in… same results.

We tested connectivity via ping and browser message. If my notebook had a statically-assigned IP, everything was fine – all our customers who had static IPs also had no problems whatsoever.

Would the broadcast repeat count paramater in the AP’s have anything to do with this issue?

msmith wrote:
Would the broadcast repeat count paramater in the AP's have anything to do with this issue?


Hmmmm...don't know. I don't believe we checked that. What part of expanded stats is this located?

It is a config that you set on the AP’s. The GUI says it takes a value from 0-2 as input. Mine are set to 2, which I think is the default.

When you said you plugged your laptop into the CMM and you were able to get an IP from DHCP, where did that broadcast traffic have to travel? Do you have a BH connected to the CMM which at the other end interfaces with your DHCP server? What type of DHCP server is it?

We have the DHCP server (Redhat) running directly into the CMM switch – no BH is located before it. Broadcast is set at 2 (default).

That makes the problem a lot more interesting…hmmm.

I think that given the DHCP server is plugged into the CMM switch, and the CMM switch clearly does work OK (a laptop plugged in gets an address from the server OK) then it looks like the broadcast from the client wasn’t making it.

I think that if you see this again a quick tcpdump on the Redhat box is called for to see if its actually receiving the requests - my bet would be that it isn’t.

Doesn’t help find out why though. Perhaps some use of the packet capture on the AP might show you if the DHCP requests are getting that far.

Paul.

That’s going to be our next step: some kind of packet sniffer at the AP. We’ve been 100% up since everything magically fixed itself, I may just have to wait until we’re down again to get a real-world view of this problem.