I don't think you'll see hybrids take over. The barrier to entry has always been higher, and there hasn't been any indication that it will fall. (It would be one thing to have a cheap 3D printed process that includes PnP and die bondout -- this would at least facilitate prototypes -- but quite another to have a pattern-printing process that's actually production-worthy and just as cheap to setup and operate!)
The trend is well underway already: if you're going to go for integration, don't half-ass it, go all the way. Remember the Pentium Pro, in that rectangular package, that literally had the extra cache strapped in alongside the CPU? (With the ceramic package, it's kind of on topic. Of course, multi-die packages have practically always been around, and aren't about to go away.) More and more CPU-adjacent parts have integrated; today, the "North Bridge" has been absorbed completely so that the CPU includes bus and DRAM control (and maybe PCIe?) functions.
I don't need to say 'soon', because we already have SoCs of supercomputer capability* for just a few watts, peripherals included on chip. Multicore ARM A9s and stuff at 800MHz+ are astonishing power, for practically peanuts both in money and electricity. (Not that all of those are fully integrated with every single peripheral, let alone RAM/ROM. But those are available, usually in the scaled down models.)
*Comparable to, like, an early model Cray or something like that. 80s era, when supercomputing meant one or a few, really freaking beefy and fast, CPUs.
So far, top-of-the-line desktop/workstation CPUs haven't reached that kind of integration yet, but that could have many reasons. We're talking serious die area, so monolithic yields would be bad. (Polylithic yields might be bad too, but you're not staked on one chip for all. That said, they've also been making them configurable after test, to cut out derpy cores or memory segments, which helps recover still more yield.) Processes aren't necessarily optimal for putting high density RAM (let alone Flash and other) on chip. Power dissipation is huge and disparate, so it doesn't make sense to put everything side by side, let alone stacked on top of each other.
Which reminds me, I'd like to see full scale development of that "lossless logic" that was talked about some years ago. There are a few approaches, I think. The one I had read about was, a system of synchronous, quadrature (or other phased) switches, which connect (or don't) the switching nodes (which have capacitance) to a main commutation rail, which is always transitioning up or down. Therefore the commutation rail sees a capacitive load (which could be driven by a tuned clock oscillator), and very little gate power is spent forcing capacitances to transition. The figure given (in the paper I'm thinking of, some logic, a multiplier I think, was fabricated at MOSIS) was ~3x lower density and >10x savings, which seems worthwhile.
Tim