I did resort to sort of microcode in FPGAs before.
i used to make accelerators that way.
on an ARM 7 processor you can set the bus timing. i set the read timing all the way to the end of the cycle. the fpga has a clock generator that makes two clocks and a multiphase enable signal. One clock feeds the arm processor, the other clock feeds the fpga logic and the multiphase ( no overlapping ) enables. i could create up to 8 time slots in one arm bus cycle ( between start of cycle and the point where the arm actually 'latches' the data. for a write cycle the arm would latch in the beginning of the cycle. this gives met 7 more time slots to perform something in the fpga. for a read the arm would read at the end of the cycle , giving me another 7.
If the arm has to write something to the fpga to be 'executed' and read back the result i have 14 time slots to perform what i needed to do. No waitstates needed.
for example the ARM writes 'read adc channel 7' to the fpga. The fpga would set the channel mux , start the a/d , grab the data (serially) , scale it , sign adjust it , add an offset and put the data on the bus. by the time the processor does the read all that stuff has happened.
the transport to the a/d or peripheral ran at high clockspeed so i could turn around the data within the available time slots.
I had a programmable transport generator.
for example : read a byte , alter bits 3 to 5 with new payload and send out.
The 32 bit operation was encoded as : 8 bits for 'instruction', 3 argument bytes ( all in one 32 bit word )
Lets assume for a second that you can an adc control register with the following map :
- [busy][channel (2..0)][gain (2:0)] that resides in register 0x14 of the adc
The following 'instruction' word would access 'channel(2:0)' :
0x22 , 0x14,0b00111000,0x02 : FPGA instruction 0x22 : read address 0x14 of ADC, alter bits 5 to 3 ( indicated by the 00111000) to '02' and write it back
So i would define a 'constant' 0x22143800 that encodes the 'FPGA instruction'.
To perform the operation on the adc all i needed to do is write the memory location
`define Set_ADC_channel = 0x22143800
in the arm code : (fpga_execute is a memory location)
FPGA_execute = Set_ADC_Channel + 02; ' the arm would write 0x22143802 to the FPGA 'instruction' register.
the fpga decodes the instruction : 0x22 means it needs to execute the following microcode :
- read memory location (0x14) from the adc
- alter bits 5:3 to 0x02
- write back to adc
- take reading from adc
so the next ARM instruction would be :
ADC_result = FPGA_readback;
no need to cache anything in the fpga or arm , no need for the arm to waste time doing bit-twiddling , shift operations, sign expansions or any other 'data massaging'. all the complicated 'transport' related stuff was handled by the FPGA as emulated operations.
All the bit fields that resided somewhere in registers , over a serial transport, were 'virtualized' as single arm operations. some transport was over spi , other over i2c. others parallel. the arm didn't care. it just said 'i need to modify these bits in that register to this.' and it could do that in a single memory access. the fpga did all the work.
combine that with true dualport mailboxes and you could create a system that never needed any printf or scanf. the 'host' would write something in the mailbox , the fpg would set things up , fire the interrupt tot eh arm , the arm picks up his 'instruction' ( from the host program on the pc) and arguments , and does his thing. some the those things are in turn accelerated by the fpga. the results are written back to the mailbox memory that clears the wait flag so the host could read the answer. the host communicated bulk usb packets with fixed frame length. one out, one in. Everything was timed to zero waitstates.
the host could do
for channel 0 to 7
print read_adc_channel(channel)
next