All this shifting and masking and extracting 3 bit fields that span bytes and reassembling doesn't look like much less work than just sampling a bit and then toggling a GPIO twice with a few NOPs in between (which an 8 MHz AVR can do no problem).
Assuming you want to do other things too, you would need to prepare a buffer, fill it with data, let it sail and free the CPU.
For every bit you want to send, you need to
- with PWM - select one or the other duty cycle and store the result
- with UART - if the bit is 0 then do "or" and store after every 3 bits
If you want to save processing time, generating PWM buffers will be somewhat faster (unless there are too many bus wait cycles) than UART.
If you want to prefill more buffers, UART will let you store 3 times more data (6 times if PWM duty cycle is 16-bit long).