An example-
a 256k pic32mm has a chip erase time of 20ms, a row write time of ~1.5ms which means 1024rows * 1.5ms + 20ms = ~1.5s to program the whole chip (these are self program numbers)
usb hid fs can do 64bytes x 1000 for irq transfers = 64000bytes/s (although I think one can use control transfers/reports to crank that up significantly)
so, the chip can do 256k in 1.5s, usb fs hid can transfer 256k in ~4sec, but it takes about ~80 seconds to program the 256k from the pickit3/pkob- I know there are a number of little gears between the input and output of this transmission system, but it does seem like the clutch (software) has some major slippage.
They could probably double the speed (or more) with not much effort, it would seem to me.
Similar example:
MSP430G2955 is new device, but it belongs to 20 years old 2xx flash family. It has the same flash module as 20 years old 2xx devices, and I am using it for example only because of largest memory in 2xx family (and due to largest memory it will give best performance numbers for my flasher).
You can calculate maximum flash writing rate using datasheet numbers for block write (at 476kHz, that is max flash frequency), and it is (if I remember right) little over 50 KByte/sec (these are self program numbers). My flasher using little lower flash frequency of 461.5 kHz for target device flash, and by datasheet calculation for 128 bytes in flash clock cycles is...
25 (1st word, block start) + 63 * 18 (63 words) + 6 (block termination) = 1165
and with 461.5 kHz flash frequency, flash writing rate will be
49.5 KByte/sec (these are self program numbers).
Benchmark (write command) for my flasher with this target device connected, on Win / Linux / OS X, will show
48.3 KByte/sec. In this rate (time) is included elapsed time from the moment before PC send write command (with all data) to flasher and after PC receive information from flasher that job is done. For communication between PC and flasher is used CDC (OS drivers) that doesn't need any extra installation, plug and play.
Double? You can do much better. 1.5 sec is too optimistic, you need to transfer data to the chip, then spend some time on verification. But about 5-6 sec sounds about right. And you don't need FPGAs, fancy 32-bit MCUs to achieve this speed. You can get pretty good speed with a little programmer made of little PIC16.
And my flasher is based on 10 years old 24 MHz MSP430 device. Most important thing is to have all things running in parallel without any latency, but this must be hard coded in assembler, and there is nobody crazy enough (except me) today, to lose time on this. C++ with 5 abstraction layers is mostly used today (as I can see in TI FET's open sources), and in combination with people that don't know how to do it, result is as it is.