He he, welcome to power supply design :-)
Big picture comment: to design a power supply means juggling heat, voltage, current, stability, accuracy, speed, sometimes cost, and functionality, among other things. It's a lot to juggle.
I get that you like to push the envelope. That's great! One of the great things about at least *investigating* something others don't usually do is that you will find out why they don't do it. Then, if you are lucky, you'll think outside the box and come up with something others haven't thought of. I think that's called "progress" :-) If you are not so lucky, you'll have to relax a constraint or two, but at least you'll know *why* you're doing so. You will also gain a real appreciation for just how great some of the designs you'll see really are (and how crappy some others are).
In dealing with heat, there are the usual suspects:
1) paralleling devices *does* work--there are lots of commercial PSUs that have paralleled a few devices; the HP 36xx series I mentioned earlier paralleled four pass devices. (I mention that series because the schematics are available online.) You still need a (usually big) heat sink, but it spreads the load across more heat sink area and each device (obviously) needs to dissipate less heat. The issue with paralleling devices is that you need to take steps to get them to share the current equally; the resistance of some devices (BJTs) drops as they warm up, so they pass more current which heats them up more, and you get thermal runaway. If you try to parallel them, one will inevitably carry a bit more current than the others, so it heats up more, starting the thermal runaway scenario. Mosfets typically *raise* their resistance as they heat up, so you don't tend to get thermal runaway. There are techniques for dealing with BJTs in parallel, but that's outside my area of knowledge, so you'd have to research that one on your own. Mosfets are paralleled all the time--for example, the controllers for electric vehicles (many KW) have been doing it for years.
2) You can switch to a pass device with a lower Rtheta. Rtheta isn't a searchable characteristic on websites like Digikey, but you can use Power as a surrogate. Mosfets typically have a lower Rtheta, sometimes *very low* (.05 W/C is the lowest I've seen). I don't recall what the lowest Rtheta is for BJTs. The 3055 (but not SMD) mentioned by K is about 1.52 W/C in the TO-3 case and the datasheet claims you can run it at 200C, just to give you a reference point ( was looking at the one from ON-Semiconductor).
3) There are pre-regulators and there are pre-regulators. I agree wholeheartedly with the decision to not tackle a "real" pre-regulator for now, but you can use a simple one: the "high low" switch. Transformers today typically do not have a bunch of taps like they did in "the old days". However, many transformers *do* come with one "tap"--the outputs can be put in series or parallel. Use a DPDT switch to switch the transformer leads from series to parallel when you want low voltage--it will cut the drop across the pass device(s) by half which cuts the power dissipation by the same amount.
4) Don't know if you've gone down this path yet, but you can stitch together (with vias) top side and bottom side PCB copper to get increased dissipation. I've never gone that route, but lots of designs for other things do, so the info is likely out there regarding Rtheta of that construction. I suspect it wouldn't be sufficient for 40W, however. You could also try to solder some small copper "fins" to the SMD's pad to decrease Rtheta.
5) You could solder an SMD pass device to a sheet of copper and get a lower Rtheta for that vs. what you'll get on a PCB, but I doubt you'd get as low as you need to and it probably defeats the whole reason you're considering SMD in the first place. Just to continue the thought, however, soldering to copper sheets is done with lighting LEDs all the time. There are online calculators for *simple* (flat plate) heat sinks. You can also search Digikey for heatsinks to get a *rough* idea of how much heat can be dissipated by the various types, but the Mfrs often use extremely optimistic numbers; I'm not a passive heatsink guru (though I've done my share of heat calcs for solid bars of various materials) but I'd suggest cutting them by half. The idea here being that you'd make a copper heatsink like an aluminum one; you'd use the Rtheta of the commercial aluminum one to guide your design of the copper one. You can also use the flat plate of copper as a heat "spreader" and fasten that to a larger aluminum heat sink. In the end, you'd probably have to do a prototype of just the transistor and see how hot it really gets. The bottom line, of course, is that PCB copper can only dissipate a certain amount of heat--period.
5) forced air; yeah, I know :-( But it really gets rid of a lot of heat....
Of course, you can use a combination of the above. Sometimes that can get you where you need to go without making too many sacrifices. E.g., a high-low switch and a different pass device. Or you find a pass device with a higher junction temperature rating. And/or you decide to run right at the junction temp rating (*not* recommended for a whole bunch of reasons).
You may find yourself making some unpalatable design decisions here....as I said at the outset, I went back and forth on pass devices many times before I settled on one. Part of it was heat--mosfets were great for heat, terrible for stability. BTJs the opposite. When I thought I had heat licked using mosfets, I fought instability until I went nuts. Then I went back to BJT and fought with heat until I went nuts. I looked at *a lot* of data sheets :-)
--Steve
Edit: corrected Rtheta j-c for 3055 due to misreading the datasheet....