Random customers can't reach 1/2 the internet doing PPPoE on ePMP

Been going on for 5 or 6 months now.

Completely random problem effects a single ePMP customer usually. Behaves exactly like MTU/MSS Clamping/Framented packet problems we have seen in the past.

An ePMP customer calls reports internet is down.
Some investigation reveals 1/2’ish the internet is down. No netflix no openspeedtest.com, fast.com loads but won’t run, youtube works but very very slow, lots of buffering, some websites are reachable and work fine some are not or very very slow to load . It’s always the same websites/services and it’s websites we know will not work if packets are being fragmented.

Appears the problem comes and goes randomly. Can lasts hours or days and suddenly start working on its own. Usually once a customer has this problem they randomly continue to have it until we bridge their radio and move PPPoE to their device.

Bridging the ePMP radio and moving the PPPoE to the customer’s wifi router (or other device) permanently resolves the problem every time.

Customers behind our old Canopy PMP 100 and 450i which are configure just like the ePMP do not have this problem nor do our GPON customers but the GPON customer provides the their own device/router to do PPPoE on that.

Powercycling or rebooting the radio does not fix.

Spent several hours at a customers home a while back.

(1) Connecting my tablet directly to the radio I confirmed I could not reach all the same places the customer couldn’t , so problem wasn’t their network.

(2) Connected to my VPN at the NOC where the PPPoE server is located . While connected to the VPN the internet works fine. I can load/access all websites/services. So, VPN tunnel inside PPPoE tunnel, internet works fine while no VPN and just PPPoE tunnel, internet does not work.

(3) Bridge the radio and use PPPoE client on my Tablet. Everything works fine.

(4) Set radio back (it had been rebooted several times at this point) to doing PPPoE and confirmed I still could not reach much of the Internet.

(5)Used ping to confirm that packets had to be much smaller than expected (I think like 1280 or something I don’t remember but was smaller than it should have been) in order to not be fragmented with MTU of 1480.

(6) Change the MTU on the PPPoE server to smaller and smaller. I would change MTU PPPoE server was set to use and reboot radio so it would connect with the new MTU (and confirm the new MTU was being used). I tried going back up to 1492, 1000 , 500 and I think even 250. Nothing fixed it.

(7) I removed the the IP the customer was being handed from the pool and rebooted so the radio/PPPoE session got a new IP but it didn’t help.

(8) I changed the default ethernet MTU on the radio from 1538 to matching the PPPoE MTU to less than the PPPoE MTU to the size of the smallest packet I could send without fragmenting. Nothing affected the problem.

(9) I enabled / disabled MSS Clamping on the radio PPPoE settings. Also changed the MTU settings on the PPPoE settings to same as PPPoE, smaller than PPPoE . No help.

(10) Upgraded firmware from 4.4.3 to… I think 4.6RC something at the time was most recent available. Nothing

(11) Downgraded firmware to 4.something else, don’t remember, didn’t help.

(12) I logged into the PPPoE server, created an entirely new PPPoE server with an entirely new profile then specified that server by name so the radio would use it to. It did and the problem was uchanged.

(13) Gave up, bridged radio and set up PPPoE on customers router. Problem solved, haven’t heard from them since.

Yesterday a customer calls, same thing. An F200 radio running 4.4.3. I upgrade radio to 4.6.0.1 and I think it’s fixed ! It works ! Wooooo until 8:30 that night , customer calls, internet down again. Same thing. I went out this morning, worked with it for an hour or so, tried down/upgrading firmware… no joy. So my previous upgrade that I thought fixed it just coincided with it randomly deciding to start working.

Bridged their radio, set PPPoE up on their router and they are working again.

I’m at a complete loss here. I would assume this was a Mikrotik problem except no other devices are having this problem AT ALL… Not the canopy PMP or the 450i which are both configured just like the ePMP , located on the same towers as the ePMP, crossing the same switches as the ePMP and connecting to the same PPPoE server as the ePMP. Not the 100’s of different brands of customer owned routers/devices doing PPPoE on the GPON and not the growing number of devices behind now bridged ePMP radios.

Any ideas what I can try ? Or something else I can check ? Or anyone had this same problem and fixed it ?