That surprises me, I have looked at RAS and CAS on the scope before and seen swings almost all the way to Vdd and Vss. (Used the 2GHz Siglent scope we had with a soldered on test lead.) Is this driver dependent? It certainly seems like the Zynq uses most/all of the swing for address/ctrl. Not sure about data lines as no easily accessible test pads on current board (an omission I will look to correct.)
I don't have the equipment to check myself (I would love to, but it's too damn expensive!), so I'm going off of what specs and manufacturers' appnotes say on this. Unfortunately, I couldn't find the source of that 1mW number in my documents library, so take it as unconfirmed for now as I can't remember where exactly I got this number from. The driver side probably drives this all the way to 0 or Vddq, but termination is at the receiving end, and so the swing isn't going to be as dramatic due to transmission line losses, at least during operation, and DDR3L memory modules datasheet only requires 0.1-0.175 V offset from Vref to register a valid level, so one can use lower drive levels to still have interface working and yet reduce reflections. Zynq of course can only drive memory at 533 MHz max, which is a far cry from 933 MHz top JEDEC spec, so it's drivers are not super-high strength, which actually helps with termination-less connections.
Can't, unfortunately, our stack is all dependent on things like Linux networking, file I/O, caching etc. A huge amount to rewrite and not worth it for a tiny power saving. Linux does have decent power management on Zynq, it can use cpuidle and suspend, but we can't do some of the more advanced stuff like switching DRAM to self refresh in idle state, without modifying the kernel. Our application does require the CPU to be available most of the time, so I don't know that it makes that much difference.
In this case I'd suggest to inspect your kernel/services config with a magnifying glass and disable everything that you don't need, there is a lot of stuff there than can potentially consume CPU resources. Also like I said above try downclocking everything and see if you get any benefit at all (evey SoC has a "sweet spot" mode in which it's the most efficient, but that is typically not the top clock frequency, but not the lowest one either).
On a hardware side, you might want to have a look at your PDS design and see if you can increase it's efficiency as well. Most DC-DC designs involve trades of solution size vs cost vs efficiency, so you can gain some efficiency by using lower ESR inductor (which are typically largers and/or more expensive), you can also use switchers which support margining and bring all rails closer to the lower end of the spec, or even lower if combined with downclocking.
I also looked at LPDDR2 datasheet, and it looks like the current consumption is about the same as the same capacity DDR3L module, so I don't think it's going to be worth it, but of course if you decide to prototype it anyway to confirm, please let us know as hard facts always trump any theories.