Some questions for others using PPPoE

brubble1 · June 14, 2018, 2:57pm

Those of you using PPPoE what hardware are you using for the server ?

What do you have the MTU/MRU/MRRU set to on the server ?

Do you change the MTU on the radios ?

Do you have MSS Clamping enabled on the radios ? The server ?

Any gotcha's you have came across doing PPPoE ?

I ask because our PPPoE server of 10 years died recently and we are having some odd issues since then related to PPPoE MTU/MRU MSS CLamping , etc... I set the old server up 10 years ago and other than adding/changing some IP pools on it and the occasional OS update I have not messed with it or touched it. So needless to say I'm more than a little rusty when it comes to setting up a new PPPoE server. The old server created and emailed backups every night but was an old x86 Powerrouter and the new PPPoE server is a CCR1036-12G-4S(tile) so the configs could not be restored to the new PPPoE server and I had to configure it from scratch (something I had not done for 10 years).

Luis · June 14, 2018, 6:25pm

Hello

There is a basic PPPoE config article here, if that helps in any way.

Regards

brubble1 · June 14, 2018, 8:08pm

Yeah, looked at that the other day and tried setting up accordingly but didn't help.

nathana · June 15, 2018, 1:22pm

@brubble1 wrote:
Those of you using PPPoE what hardware are you using for the server ?
What do you have the MTU/MRU/MRRU set to on the server ?
Do you change the MTU on the radios ?
Do you have MSS Clamping enabled on the radios ? The server ?

Mix of CCR and RB1100x2. In short, RouterOS.

MTU/MRU 1492, MRRU disabled (even if you set it to something, if you are using ePMP CPE as PPPoE client it is doubtful it supports MRRU, so setting it on the concentrator site is going to make no difference).

We bridge-mode our radios and leave their MTU at 1500. Customer routers will do 1492 over PPP interface, and I'd wager most consumer routers implement MSS clamping on their side, too.

Yes to MSS clamping on server side, though if everything else is properly engineered (as in, no stupid ICMP filtering) it technically shouldn't be necessary. The problems come when bad actors on the 'net break PMTUD through their bloody ignorance, and unfortunately there are (still) a lot of bad actors out there.

The old server created and emailed backups every night but was an old x86 Powerrouter and the new PPPoE server is a CCR1036-12G-4S(tile) so the configs could not be restored to the new PPPoE server and I had to configure it from scratch (something I had not done for 10 years).

So if it was x86 PowerRouter, I take it that it was also running RouterOS, and that you didn't buy a PowerRouter only to wipe ROS off of it and replace it with something else (e.g. Linux with Roaring Pengiun running atop)? If so, then of course you can restore the backups from your old router onto the new. Not everything will restore correctly, but the only stuff that will need fixing is pretty much anything that maps to particular interfaces that are of course no longer present on the new hardware. So you'll have to go in and fix those kinds of things manually. But it's absolutely doable, and if you can't spot whatever the config difference is in your new router that is causing your recent headaches when you compare side-by-side to the backups you took from the old router, perhaps abandoning the new config you came up with and instead attempting to patch up the old config so that it works on the new router is the right way to go in this instance.

-- Nathan

brubble1 · June 16, 2018, 4:26pm

Well, that's exactly how my old x86 (yes, RouterOS) was set up except MSS Clamping was just "default" but those settings sure don't work now. It looks like the most recent routerOS update broke PMTUD on my own network and changed how MSS Clamping works on the PPPoE server. I made heavy use of port groups on the switches and the new OS took those away. So it looks like I need learn the new way doing things on the tik switches (split horizing bridges?) because this bridge/root port thing the upgrade replaced my port groups with broke stuff.

The backups from the old x86 are not txt exports they were backups and not human readable by anything I knew to load them into. The CCR refused to load them at all while an almost as old and much less powerful (than the old powerrouter) x86 RouterOS machine we have here loaded them just fine (so we know the backups were good) but it couldn't handle the load.

nathana · June 16, 2018, 7:32pm

ROS can't/won't break PMTUD unless you are defining some kind of explicit ICMP filtering in your IP > Firewall > Filter rules, in which case that wouldn't be ROS's fault, that would be you shooting yourself in your own foot...

I've never had a binary backup generated via "/system backup save" outright refuse to load on any ROS device. It might have some missing stuff after loading the backup, sure, because not everything matches up hardware-wise between the old device and the new, but it will still *try* to put as many things back as possible. What *exactly* happened when you tried to restore it on the CCR / what message(s) did it display?

If you *can* make the backup load onto a different x86 box, why don't you: 1) restore backup onto the slow x86, 2) upgrade that x86 box to same ROS version as your CCR afterwards, 3) generate a new backup on that x86 box under the new software, 4) try to load that backup on your CCR? Failing that, run "/export compact file=<blah>" on the x86 box after backup restore and software upgrade, which will generate a human-readable "backup" as "<blah>.rsc". Unlike a binary backup, it absolutely will not restore (commands that reference interfaces that don't exist will cause the script to stop executing at that point), but you can use it as a point of reference for configuration.

Also, this leads me to ask: what ROS version was running on the PowerRouter before it kicked the bucket?

We are running latest 6.x on our PPPoE servers and having none of the issues you are bringing up.

MSS clamping behavior was changed/updated in newer ROS...like, several years ago. With MSS clamping enabled, very very old ROS (I want to say 3.x or possibly 4.x???? ancient history) would generate dynamic IP > Firewall > Mangle rules that change the MSS on EVERY TCP SYN that it forwarded, regardless of what the original MSS was. This is Bad(tm) because there would be scenarios where it would accidentally "upclamp" instead of only ever "downclamping" (e.g., device on the internet actually happens to have lower MTU than your customer has, sends your customer TCP SYN/ACK with MSS of say 1260, 'tik indiscriminately changes MSS to 1452 before it hits your customer, so your customer's equipment thinks other side has MTU of 1492 instead of MTU of 1300). ROS since at LEAST v5.x (and possibly older) has changed this behavior so that it changes MSS if *and only if* the original MSS is higher than the MSS value you are trying to clamp to, which is exactly the right thing to do.

As far as bridging changes go, I have no idea what you are talking about re: "port groups", which sounds to me more like a hardware switch thing than a software bridge thing. STP "root ports" and "split horizon" have existed since ROS 3.x, and the last release in the 3.x line was released in 2009! This *really* makes me wonder what you were running before...

-- Nathan

brubble1 · June 17, 2018, 4:02pm

"ROS can't/won't break PMTUD"

So here is what I see. Fiber customers connect to a calix c7 which connects to a port on the CCR router with a PPPoE server bound to that port just for them. On another port of the CCR all of the Wireless customers connect and have a PPPoE server bound to that port just for them. Both servers (PPP > PPPoE Servers) are configured IDENTICAL. One is called Wireless and the other is called FTTH, one is bound to Port 3 and the other to Port 4, both were MTU/MRU set to 1492 , MSS Clamping was set to "default" every single setting was identical between the two servers. There are no firewall rules of any kind on any of the switches and the only firewall rules on the PPPoE server are to stop brute force connection attempts.

New router running 6.42.2 (old one running 6.32) and switches updated from 6.32 to 6.42.2 . Fiber customers, who pass through not a single MT Switch are unphased and continue to work perfectly as they always have. Wireless customers who all pass through updated MT switches suddenly and without exception can not access much of the interenet at all and what they can access is very slow with massive packet loss. Turning MSS Clamping on on their radios appears to fix it. What changed other than the software on the switches? Even with MSS Clamping turned on the radios sporadically can't access much of the internet (always the same sites, netflix, yahoo.mail, fast.com loads but wont run etc... while ebay, google, msn, youtube all run great). Anything that forces the radio drop and reconnect to the PPPoE seems to correct the problem until it happens again which may be hours or days. Also several customs who use VPN to connect to work/home office/whatever start complaining about constant VPN disconnects, packetloss, slow speeds.

Setting the MTU/MRU on the Wireless PPPoE server to 1480/1480 (has been 1492 since the beginning of time) AND having MSS Clamping turned on on the radios appears to remedy or greatly reduce the sporadic "can't reach much of the internet problem". The VPN users have so far reported "Still get dropped once in a while, which almost never use to happen, but running much better than it was over the last week or so". Note that in an attempt to figure out this problem I had even set a few customers to bridge mode and set their netgear/belkin/lynksys/crap wifi routers to do PPPoE and some of them STILL had the random "can't reach much of the internet" problem and just had to start rebooting their routers instead of our radios.

So FTTH crossing no MT switches still running like a charm. Wireless customers crossing MT Switches forced to turn on MSS Clamping and set PPPoE server MTU/MRU down to 1480.

As far as the backup goes. When attempting to load the x86 backup onto the CCR (and I tried on a RB 1200 also) it simply didn't load. It went though the motions and even rebooted but not a single thing was changed, nothing. I did export the restored config from the other x86 and load it onto the CCR, but there was so much crap that just wasn't taking that I decided setting it up from scratch was better/faster than trying to hack and slash the config file to work.

Regarding port groups.. Since every single mikrotik switch comes from the factory with either 1 or 2 master ports configured and the other ports slaved to that master (in groups if there is more than one master port) and this is how it has been all the way up until 6.41.1 have no idea how you could not know what I'm talking about.

Mikrotik Wiki example:

This example requires a group of switched ports. Assume that all ports used in this example are in one switch group configured with master-port setting.

# pre-v6.41 master-port configuration
/interface ethernet

set ether6 master-port=ether2

set ether7 master-port=ether2

set ether8 master-port=ether2

set ether9 master-port=ether2

set ether10 master-port=ether2
# post-v6.41 bridge hw-offload configuration
/interface bridge

add name=bridge1 igmp-snooping=no protocol-mode=none

/interface bridge port

add bridge=bridge1 interface=ether2 hw=yes

add bridge=bridge1 interface=ether6 hw=yes

add bridge=bridge1 interface=ether7 hw=yes

add bridge=bridge1 interface=ether8 hw=yes

add bridge=bridge1 interface=ether9 hw=yes

add bridge=bridge1 interface=ether10 hw=yes

S

nathana · June 17, 2018, 9:30pm

Okay, so by "port groups" you WERE talking about hardware switching, not software bridging.

One of the problems here seems to be that either you omitted important details or didn't explain things well the first time, and maybe think that the rest of us can somehow read minds? Your two separate topics that you started here only talked about a PPPoE server, one that started out as a "PowerRouter"-brand x86 box, and which recently was replaced with a CCR. Your posts mentioned nothing about MikroTik switches in between the wireless network and your CCR. So when you started talking about "port groups" on your "MikroTik" in a later post, and the only MikroTiks that have been mentioned so far are an x86 one and a CCR, neither of which has a built-in switch chip, yes: I did find that extremely confusing.

So, yes, MT did change the way that the hardware switching functionality is exposed in the UI for models that have switch chips. In the past, ports that were "slaves" to a "master" port could not both be members of a hardware switch port group AND members of a bridge, for hopefully obvious reasons. They apparently decided to collapse the switching and bridging functionality UI-wise into the "bridge" configuration interface, and to replicate the switch config with the new UI, you enable "hardware offload" as you have already shown.

I have not done much testing with the new software's switching support, so perhaps you have already hit upon part of the issue? Maybe it has zero to do with the PPPoE server MikroTik and something to do with the switches in between? What if you backlevel the RouterOS on the switches to what they were running before? Is there a reason that you decided to upgrade them in the first place? The ROS on the switches and the ROS on the CCR doing your PPPoE do not have to match.

Also, troubleshooting is most effective when you are only changing one variable at a time. In this case, it sounds like 3 variables were all changed at roughly the same time:

1) The PPPoE server hardware

2) The OS version on the PPPoE server

3) The OS version on the switches

Since 1 is non-negotiable (given that you experienced a hardware death), what if you tried to not change 2 or 3? Since the x86 box and the switches were running 6.32 until recently, and that seemed to be working for you, backlevel BOTH the switches AND the CCR to 6.32. If the problem is gone, then change ONE THING: either the CCR software or the switch software, and see if that breaks things again. If running 6.32 across the board still doesn't resolve things and you are 100% positive that everything is configured the way it was back when things were working, then at that point you can conclude that it is somehow related to the x86 > CCR change (though how or why exactly, I can't fathom). If 6.32 works but then upgrading past that on either the CCR or the switches breaks things again, then you're at least closer to knowing the rough location of the problem.

If backleveling software is not an option for whatever reason, or if you went through those steps and now have a better idea of whether the switch OS upgrade is to blame or the CCR upgrade is to blame and want to solve the root cause, perhaps at this point I might suggest 2 other troubleshooting steps, neither one of which will fix the problem, but will perhaps get you closer to the answer:

---

1. The next time a customer is impacted by the issue, have them try to send ICMP Echo down their PPPoE tunnel towards a host that will respond to it, with the do-not-fragment bit set, and have them ratchet the packet size up or down until they find the various breaking points. IF there is a "large packet transport" problem, then there should be three distinct results:

a) They get a proper response

b) They get no response

c) They get a response from a host downstream (most likely their CPE) that the packet is too big and it can't fragment it because DF (don't-fragment) is set.

If you have MTU/MRU set at 1480, then they should be able to ping a host out on the internet with unfragmented packets up to 1480 in size and still get responses back. If there is a point at which they stop getting responses BEFORE they hit 1480, then you have a problem with larger packets getting through for some reason.

Note that this is not necessarily a "PMTUD" issue. PMTUD issue is one where a host that sits between two network segments with different MTUs does NOT send the proper ICMP response ("packet too big, can't fragment") to notify the sender even if it knows about the discrepancy. In this case, the problem may be that the path MTU of the circuit is SUPPOSED to be 1480, but something in the middle is not reliably able to forward packets that large for *whatever* reason, even if it should be able to. It's simply not possible to predictably "discover" that condition in software...the fault is not with the path MTU discovery process but with whatever gear is supposed to be transporting those large frames.

Note that the ping implementation in various operating systems can be quite different, and so what they interpret you to mean by "packet size" can vary. Some (e.g., MikroTik's built-in ping) take "size" to mean the entirety of the IP packet, headers and all. Others (e.g., Windows ping) do NOT count either the IP *OR* ICMP header size in the value...IPv4 header is typically 20 bytes, and ICMP is 8 bytes, so if you tell Windows ping to send a 1500-byte ping, it actually attempts to send a 1528-byte ping; thus 1472 really == 1500, 1464 == 1492, etc. And even others (e.g., seemingly many *nix-like platforms) take into account the ICMP header but not the IP header, etc. So you need to be familiar with the ping utility you are using or are asking your customers to use in the course of troubleshooting.

Here's an example of me pinging from a MikroTik that is acting as a PPPoE client:

[admin@MikroTik] > /ping www.facebook.com size=1492 do-not-fragment              
HOST                                     SIZE TTL TIME  STATUS                                                   
31.13.76.70                              1492  59 21ms 
31.13.76.70                              1492  59 25ms 
31.13.76.70                              1492  59 60ms 
31.13.76.70                              1492  59 24ms 
    sent=4 received=4 packet-loss=0% min-rtt=21ms avg-rtt=32ms max-rtt=60ms

[admin@MikroTik] > /ping www.facebook.com size=1493 do-not-fragment  
HOST                                     SIZE TTL TIME  STATUS                                                   
                                                        packet too large and cannot be fragmented                
aaa.bbb.yyy.zzz                           576  64 10ms  fragmentation needed and DF set                          
    sent=1 received=0 packet-loss=100%

(where aaa.bbb.yyy.zzz in the second example is the IP address it was assigned on the PPPoE tunnel, so the MikroTik here is replying to itself: it itself has the lowest MTU along the entire path from it to Facebook, on the PPPoE interface)

If some kind of artificial packet-size cutoff (lower than the PPPoE MTU) is discovered with a customer who is actively being impacted, and then they reboot their CPE and magically that cutoff doesn't appear in the ping test results anymore, it's hard not to conclude from that that the ePMP CPE is not somehow a variable in this equation...

---

2. To conclusively prove that (or at least test whether) the issue is not with any of the ePMP gear, it seems like you should be able to plug something that has the ability to act as a PPPoE client *directly* into a port on one of the switches that an ePMP AP also plugs into. Take a router of some kind (preferably one with remote access...maybe a cheap little MikroTik?) up to one of your AP sites, plug it in to the same switch that the AP is plugged into, have it make a PPPoE connection, leave it running for a while, and see if it starts suffering from the same problem that your wireless customers are experiencing. If it does, well, then you can discount ePMP as being a factor.

Hope at least some of this helps to inspire some new troubleshooting ideas, or gives you new places to look,

-- Nathan

Mathew_Howard · June 18, 2018, 4:48pm

If I remember correctly, we had some odd issues with PPPoE on ePMP clients connected to newer routerOS versions (like 6.40 and abover, or something like that), which we resolved by enabling MSS clamping on the server.