Even on ATmega32U4 (8-bit "old-style" AVR), the native USB peripheral has 832 bytes of dual-ported RAM, but to the AVR core, it's more like a configurable set of FIFOs dedicated for USB.
(It is quite a lot, considering it has only 2560 bytes of SRAM, but it is how it can sustain about 1 Mbytes/sec over USB, about 8 Mbit/s of payload data.)
The most RAM variants I know of on a microcontroller is definitely on
Teensy 4.1: normal built-in RAM (OCRAM), tightly-coupled data memory (DTCM), tightly-coupled instruction memory (ITCM), and external RAM (I use an additional 8 or 16 Mbytes of PSRAM). Plus there is 32k of instruction cache and 32k of data cache, which is transparent to the code (except for cache eviction and prefetch instructions so one can get the maximum use out of them). All with only one core.
Fortunately, it has a single 32-bit address space, and even PSRAM is accessed without any shenanigans in the very same address space. Aside from timing differences and architectural details when doing
shenanigans (DMA and FlexIO support functions), one doesn't really need to worry about it. (There are even two GPIO peripherals to choose from for each I/O pin, depending on which interconnect bus you want to use.)
(One reason I like Teensy 4.1 so much is that for USD $42, you can get a Cortex-M7 with hardware FP (32- and 64-bit) running at 600 MHz with 19 MiB (512k + 512k + 8M + 8M = 19,922,944 bytes) of RAM, 8 MiB of Flash, "lockable" with run-time code encryption/decryption support, 100Base-T Ethernet, two high-speed (480 Mbit/s) native USB interfaces (with support for one Host and one Client port currently), almost 50 usable I/O pins with lots of peripherals, and pretty good
Teensyduino support library for development in the Arduino environment with sources at
GitHub; plus a
forum where the developer participates actively. Very nice for hobbyists like me. The one proprietary part is the programming/bootloader
chip.)
I'm pretty sure the RAM model for tile processors (like XMOS xCore) has even more parts, because there each core has some local RAM, and one or more RAM regions of variously shared RAM.
On architectures with virtual memory support, we can go an order of magnitude more complicated. Over a decade ago, I
created an example of how to use memory-mapped files in Linux for manipulating a terabyte (1,099,511,627,776 bytes) -sized object in memory, as long as you have sufficient disk storage, speed limited only by the storage-RAM bandwidth. Not to mention things like mapping the same RAM pages twice consecutively, so that you can treat a cyclic buffer (whose size is a multiple of page size) as always having linear data (assuming the cache coherency allows this safely; they often do not).
There definitely is a limit as to how deep one should dive into the details, before actually needing the info and applying it in practice.
I personally only bother to remember the key ideas, and how the different features can be useful; and just look up the details whenever I need them (for example to write this post).
(That is also why I prefer written text over face-to-face for technical discussions. Face-to-face, I'm always repeating "oh wait, I recall I had read something pertinent about that... now, what was it again.. lemme quickly look it up and review before we continue". Unless we're doing research or experimenting on a new-to-all thing, so that everyone is interested in those other references, in which case it's fun and pretty effective, too. Definitely does not impress the administrative types, though!)