Users suddenly can't access 1/2 the internet, reboot radio fixes.

Since setting up a new PPPoE server (the old one died) we now have a problem we didn't have before. Customers will suddenly find that they can not go to Netflix (even the website netflix.com will not load), Yahoo Mail (yahoo.com loads slowly but mail.yahoo.com will not load at all), fast.com loads but the speed test will never start. They can go to hulu.com or amazon.com but no show/movie will ever load, it just sets there like's it buffering..  At the same time they can go to ebay.com, google.com, gmail, msn, youtube (and play youtube videos just fine).   

Reboot the radio and they can go everywhere again.

Something notable is the fact that before the old PPPoE server crashed, in the 10+ years we have been doing PPPoE we never had MSS Clamping turned on on the radios (starting with the old Canopy, then ubiquiti and now a mix of FTTH, canopy, 450i and ePMP). It's off by default and everything always worked so I never messed with turning it on.  

Immediately after we got the new PPPoE server up and running 1/2 the internet didn't work for anyone.  People could go to MSN, youtube, ebay, google, all that but just as above, no netflix, no yahoo.mail , fast.com would load but not run .  Turning on MSS Clamping fixed it, mostly because now it still happens randomly even though everyone has MSS Clamping turned on now.  If you turn MSS Clamping off you lose 1/2 the internet. So I'm pretty sure this is an MTU/MRU , PPPoE problem of some kind but I've packet captured and wiresharked myself silly and figure out what the problem is.

Maybe MSS clamping implementation bug in ePMP CPE firmware?

If so, and since it worked for you in the past without enabling it customer-side, have you tried to enable it on the PPPoE server instead and let it do that heavy-lifting instead of the CPE?

-- Nathan

There was in fact a change in the way Mikrotiks handle MSS Clamping in recent updates and it for sure affected ePMP radios but I don't know why. For now, turning on MSS Clamping and chaning the MTU/MRU (on the PPPoE server) to 1480 ( worked at 1492 for 10 years) seems to have greatly reduced the "suddenly can only go to half the internet"  problem but not completely gotten rid of it.  Also having issues with customers that work though a VPN complaining they get dropped several times a day now or that their throughput changes drastically throughout the day.  Chaning the MTU/MRU to 1480 helped some VPN users but seems to have made it worse for others... I don't know what is going on... weirdest most frustrating problem ever...