If you don't need constant-current drive, NXP's PCA9685 is a cheap PWM driver with internal clock and I2C control.
This. Also the TLC59116 from TI is I2C and 8 bit. It also has brightness control for the entire chip. It's designed to be used in this sort of application and has enough bus lines to have 8 of them active at the same time. They have built in resistors, work on 5V and 3.3V, open collector outputs. So use common-anode bicolor LEDs.
They have some sort of global address where you can send commands to all of them simultaneously.
In my opinion, it is much more elegant than bit banging PWM on a shift register. Less MCU cycles needed.
I have used them myself. Right now they are about $3 on 10 of quantities.
What Mike suggests, multiplexing the PWM driver, sounds like a good idea.
I'd do a compromise. Personally, I'd pick the TLC59116, have 4 of those to handle the 64 LEDs. Now what about the 2 colors? I multiplex them with PNP transistors. That means I'd need common cathode LEDs.
That makes things simple for when you want it to be monochrome. When you want it to be bi-color it's still relatively easy because the multiplexing is not hard on you or the MCU. You're only multiplexing the 2 colors, not 8 banks of LEDs.