I can see how FP16 would allow to represent smaller values in the [0, 1] interval compared to fixed point of similar width. What I'm still not convinced about is the real benefit over fixed point. I'd like to see comparative examples which clearly show that FP16 would yield better results overall. Thing for instance is, when at some point you have numbers that are so small that they would be represented as 0 with fixed point, and some value (but with low precision) with FP16... but would that contribute to the overall weighted sum enough to matter? (Given that even big NNs are usually limited in the number of "neurons", and as I understood, we tend to favor more layers these days rather than more neurons at each layer level...)
To sum it up, I would like to see a real comparison between FP16 and fixed point that would clearly show the benefits in real cases.
Maybe in the end, FP16 takes less resources overall than fixed point, given that even though fixed point itself is less expensive to implement, more care is needed for the calculations using that, so possibly in the end fixed point would be more expensive. I'm just not quite sure or convinced at this point, and would be interested in reading papers about that specifically if there are any.
There are many research papers, that show even INT8 provides acceptable results for neural nets, if it is translated to that. Usually the training works in FP32, that takes large memory (GBs) and a long time to do. The end result is a few megabytes large, describing the neural net. Take this as an example:
https://github.com/tesseract-ocr/tessdataThis is google's text recognition neural net. One of the goal of reduced precision is to shrink this descriptor, that the main functions of an OS, this descriptor, and the actual net running fits into the cache of a CPU. If it is in the cache, it runs a lot more efficient and fast, compared to memory. English is 22 MB, which is already kinda doable, as there are CPUs with 32MB+ L2 cache. Similarly, you might want to run the thing on GPU, where cache is similarly small, and distributed.
So if you reduce the precision from FP32 to FP16 or the new bfloat16, the model takes half the size, and the internal date is half the size. But you have to translate the network to run on this, and verify that it still works. bfloat is great, because it is just a truncation, so this is an easy step, without much risk.
Note that for a Cortex M type MCU, the entire external memory size is about the size of the cache, so you are IO limited anyway.
I've read some papers about the INT8, which apparently very fast and energy efficient, with only a small drop of accuracy (meaning for example word recognition).
The real problem is that most big companies are doing their own thing, even with new silicon produced around a concept, so results are varying.
In any case, I have an application, where I have an expensive link (GSM or 4/5G) to the mothership, and there is relatively large amount of data being generated that needs analysing (audio stream). I would need to process the data locally, as much as possible, and send as little as possible, if an event happened. The more type DSP or hardware acceleration I have the better.