Okay, so by "port groups" you WERE talking about hardware switching, not software bridging.
One of the problems here seems to be that either you omitted important details or didn't explain things well the first time, and maybe think that the rest of us can somehow read minds? Your two separate topics that you started here only talked about a PPPoE server, one that started out as a "PowerRouter"-brand x86 box, and which recently was replaced with a CCR. Your posts mentioned nothing about MikroTik switches in between the wireless network and your CCR. So when you started talking about "port groups" on your "MikroTik" in a later post, and the only MikroTiks that have been mentioned so far are an x86 one and a CCR, neither of which has a built-in switch chip, yes: I did find that extremely confusing.
So, yes, MT did change the way that the hardware switching functionality is exposed in the UI for models that have switch chips. In the past, ports that were "slaves" to a "master" port could not both be members of a hardware switch port group AND members of a bridge, for hopefully obvious reasons. They apparently decided to collapse the switching and bridging functionality UI-wise into the "bridge" configuration interface, and to replicate the switch config with the new UI, you enable "hardware offload" as you have already shown.
I have not done much testing with the new software's switching support, so perhaps you have already hit upon part of the issue? Maybe it has zero to do with the PPPoE server MikroTik and something to do with the switches in between? What if you backlevel the RouterOS on the switches to what they were running before? Is there a reason that you decided to upgrade them in the first place? The ROS on the switches and the ROS on the CCR doing your PPPoE do not have to match.
Also, troubleshooting is most effective when you are only changing one variable at a time. In this case, it sounds like 3 variables were all changed at roughly the same time:
1) The PPPoE server hardware
2) The OS version on the PPPoE server
3) The OS version on the switches
Since 1 is non-negotiable (given that you experienced a hardware death), what if you tried to not change 2 or 3? Since the x86 box and the switches were running 6.32 until recently, and that seemed to be working for you, backlevel BOTH the switches AND the CCR to 6.32. If the problem is gone, then change ONE THING: either the CCR software or the switch software, and see if that breaks things again. If running 6.32 across the board still doesn't resolve things and you are 100% positive that everything is configured the way it was back when things were working, then at that point you can conclude that it is somehow related to the x86 > CCR change (though how or why exactly, I can't fathom). If 6.32 works but then upgrading past that on either the CCR or the switches breaks things again, then you're at least closer to knowing the rough location of the problem.
If backleveling software is not an option for whatever reason, or if you went through those steps and now have a better idea of whether the switch OS upgrade is to blame or the CCR upgrade is to blame and want to solve the root cause, perhaps at this point I might suggest 2 other troubleshooting steps, neither one of which will fix the problem, but will perhaps get you closer to the answer:
1. The next time a customer is impacted by the issue, have them try to send ICMP Echo down their PPPoE tunnel towards a host that will respond to it, with the do-not-fragment bit set, and have them ratchet the packet size up or down until they find the various breaking points. IF there is a "large packet transport" problem, then there should be three distinct results:
a) They get a proper response
b) They get no response
c) They get a response from a host downstream (most likely their CPE) that the packet is too big and it can't fragment it because DF (don't-fragment) is set.
If you have MTU/MRU set at 1480, then they should be able to ping a host out on the internet with unfragmented packets up to 1480 in size and still get responses back. If there is a point at which they stop getting responses BEFORE they hit 1480, then you have a problem with larger packets getting through for some reason.
Note that this is not necessarily a "PMTUD" issue. PMTUD issue is one where a host that sits between two network segments with different MTUs does NOT send the proper ICMP response ("packet too big, can't fragment") to notify the sender even if it knows about the discrepancy. In this case, the problem may be that the path MTU of the circuit is SUPPOSED to be 1480, but something in the middle is not reliably able to forward packets that large for *whatever* reason, even if it should be able to. It's simply not possible to predictably "discover" that condition in software...the fault is not with the path MTU discovery process but with whatever gear is supposed to be transporting those large frames.
Note that the ping implementation in various operating systems can be quite different, and so what they interpret you to mean by "packet size" can vary. Some (e.g., MikroTik's built-in ping) take "size" to mean the entirety of the IP packet, headers and all. Others (e.g., Windows ping) do NOT count either the IP *OR* ICMP header size in the value...IPv4 header is typically 20 bytes, and ICMP is 8 bytes, so if you tell Windows ping to send a 1500-byte ping, it actually attempts to send a 1528-byte ping; thus 1472 really == 1500, 1464 == 1492, etc. And even others (e.g., seemingly many *nix-like platforms) take into account the ICMP header but not the IP header, etc. So you need to be familiar with the ping utility you are using or are asking your customers to use in the course of troubleshooting.
Here's an example of me pinging from a MikroTik that is acting as a PPPoE client:
[admin@MikroTik] > /ping www.facebook.com size=1492 do-not-fragment
HOST SIZE TTL TIME STATUS
22.214.171.124 1492 59 21ms
126.96.36.199 1492 59 25ms
188.8.131.52 1492 59 60ms
184.108.40.206 1492 59 24ms
sent=4 received=4 packet-loss=0% min-rtt=21ms avg-rtt=32ms max-rtt=60ms
[admin@MikroTik] > /ping www.facebook.com size=1493 do-not-fragment
HOST SIZE TTL TIME STATUS
packet too large and cannot be fragmented
aaa.bbb.yyy.zzz 576 64 10ms fragmentation needed and DF set
sent=1 received=0 packet-loss=100%
(where aaa.bbb.yyy.zzz in the second example is the IP address it was assigned on the PPPoE tunnel, so the MikroTik here is replying to itself: it itself has the lowest MTU along the entire path from it to Facebook, on the PPPoE interface)
If some kind of artificial packet-size cutoff (lower than the PPPoE MTU) is discovered with a customer who is actively being impacted, and then they reboot their CPE and magically that cutoff doesn't appear in the ping test results anymore, it's hard not to conclude from that that the ePMP CPE is not somehow a variable in this equation...
2. To conclusively prove that (or at least test whether) the issue is not with any of the ePMP gear, it seems like you should be able to plug something that has the ability to act as a PPPoE client *directly* into a port on one of the switches that an ePMP AP also plugs into. Take a router of some kind (preferably one with remote access...maybe a cheap little MikroTik?) up to one of your AP sites, plug it in to the same switch that the AP is plugged into, have it make a PPPoE connection, leave it running for a while, and see if it starts suffering from the same problem that your wireless customers are experiencing. If it does, well, then you can discount ePMP as being a factor.
Hope at least some of this helps to inspire some new troubleshooting ideas, or gives you new places to look,