Another recommending bulk bit banging, with a common SCL line, find a 16 or 32 bit micro so you have atleast 16 pins on the same port (not all micros expose all of those pins conveniently, and just write and read at a port level, this is much easier if you can precompute the patterns, that way you just need to use very few cycles to just copy a state to the port and toggle the SCL pin twice.
If you need to generate everything on the fly, well its not hard, just ugly to make as fast as possible, as you have 16 LED's with a byte of data you want to send, that you need to rotate into 16 bit slices for each bit of those bytes, so its mainly going to be using the bit manipulation e.g. ADDRESS |= (1<<BIT) handle each bit in 2-3 clock cycles, ironically this would result in almost perfect utilization of say a 16MHz AVR, if instead of running at a hard 400KHz you settled on like 380Khz, you would even have cycles left to say read in external data via SPI, but yes, it would be as bare metal coding as you can manage.