Cortex M55

#25 Reply
Posted by SiliconWizard on 27 Feb, 2020 19:25
We're digressing a bit, but anyway. The Q notation is always a bit ambiguous and there are actually two conventions, one including the sign bit and the other, not. If you know the word length is 16-bit, Q1.15 usually means that the leading 1 bit is just the sign bit (as you assumed), but that's confusing as it's meant to be the integer part, so definitions vary a bit even from one vendor/textbook to the other. As long as you know what is talked about exactly, it's fine...

So anyway, if you want to represent exactly -1 and +1 (assuming that you really need that), you'd use Q1.14 that would mean 1 integer bit, 14 fractional bits, and an implied sign bit.

I've often used unsigned fixed point. A whole range of calculations don't need negative numbers. Some definitions would write what I meant as: UQ1.15 ('U' as in unsigned).

#26 Reply
Posted by Yansi on 27 Feb, 2020 19:46
Quote from: SiliconWizard on 27 Feb, 2020 19:25
So anyway, if you want to represent exactly -1 and +1 (assuming that you really need that), you'd use Q1.14 that would mean 1 integer bit, 14 fractional bits, and an implied sign bit.

However that implies you have a 15bit register. That is a pretty nonstandard use of the notation and I would strongly suggest against using it.

For a 16bit register:
Q1.15 gives you -1.00000 to +0.99997
Q2.14 gives you -2.00000 to + 1.99994
Q3.13 gives you -4.00000 to +3.99988
etc...
Q15.1 gives you -16384 to +16383.5

The range is from -2 ^m-1 to 2 ^m-1 - 2^-n for signed fractional fixedpoint Qm.n (if I did not make a mistake in the formula). m+n shall always equal to the number of bits of the register in question.

#27 Reply
Posted by langwadt on 27 Feb, 2020 19:50
Quote from: SiliconWizard on 27 Feb, 2020 19:25
We're digressing a bit, but anyway. The Q notation is always a bit ambiguous and there are actually two conventions, one including the sign bit and the other, not. If you know the word length is 16-bit, Q1.15 usually means that the leading 1 bit is just the sign bit (as you assumed), but that's confusing as it's meant to be the integer part, so definitions vary a bit even from one vendor/textbook to the other. As long as you know what is talked about exactly, it's fine...

So anyway, if you want to represent exactly -1 and +1 (assuming that you really need that), you'd use Q1.14 that would mean 1 integer bit, 14 fractional bits, and an implied sign bit.

I've often used unsigned fixed point. A whole range of calculations don't need negative numbers. Some definitions would write what I meant as: UQ1.15 ('U' as in unsigned).

if you add a signbit you lose all the advantages of two-complement

#28 Reply
Posted by Yansi on 27 Feb, 2020 19:51
Regarding unsigned fractional, UQm.n, it is like so:

UQ1.15 means from 0.00000 up to 1.999something.

I think wikipedia is quite clear about this: https://en.wikipedia.org/wiki/Q_(number_format)

#29 Reply
Posted by SiliconWizard on 27 Feb, 2020 20:02
Quote from: Yansi on 27 Feb, 2020 19:51
Regarding unsigned fractional, UQm.n, it is like so:

UQ1.15 means from 0.00000 up to 1.999something.

I think wikipedia is quite clear about this: https://en.wikipedia.org/wiki/Q_(number_format)

Yes. Exactly what I meant. Allows to represent '1' exactly. If that ever matters in a given situation.

What NANDBlog was meaning earlier with "You cannot represent that number with INT16" is that he was likely thinking of either Q1.15 (for which I personally find the Q1.14 notation more consistent), or UQ0.16.

#30 Reply
Posted by SiliconWizard on 27 Feb, 2020 20:09
Quote from: Yansi on 27 Feb, 2020 19:46
Quote from: SiliconWizard on 27 Feb, 2020 19:25
So anyway, if you want to represent exactly -1 and +1 (assuming that you really need that), you'd use Q1.14 that would mean 1 integer bit, 14 fractional bits, and an implied sign bit.

However that implies you have a 15bit register. That is a pretty nonstandard use of the notation and I would strongly suggest against using it.

Nope. Read again the very article you linked to.

Code: [Select]
here are two conflicting notations for fixed point. Both notations are written as Qm.n, where: Q designates that the number is in the Q format notation – the Texas Instruments representation for signed fixed-point numbers (the "Q" being reminiscent of the standard symbol for the set of rational numbers). m. (optional, assumed to be zero or one) is the number of bits set aside to designate the two's complement integer portion of the number, exclusive or inclusive of the sign bit (therefore if m is not specified it is taken as zero or one). n is the number of bits used to designate the fractional portion of the number, i.e. the number of bits to the right of the binary point. (If n = 0, the Q numbers are integers – the degenerate case). One convention includes the sign bit in the value of m,[1][2] and the other convention does not. The choice of convention can be determined by summing m+n. If the value is equal to the register size, then the sign bit is included in the value of m. If it is one less than the register size, the sign bit is not included in the value of m.
And I just do prefer the second convention that doesn't include the sign bit in "m", as I find it more consistent. I know the other notation is also common.

#31 Reply
Posted by Yansi on 27 Feb, 2020 20:53
I have actually never seen anyone use a notation where (m + n) != number of register bits. But I have not done that much DSP stuff yet.

#32 Reply
Posted by langwadt on 27 Feb, 2020 21:31
Quote from: SiliconWizard on 27 Feb, 2020 20:09
Quote from: Yansi on 27 Feb, 2020 19:46
Quote from: SiliconWizard on 27 Feb, 2020 19:25
So anyway, if you want to represent exactly -1 and +1 (assuming that you really need that), you'd use Q1.14 that would mean 1 integer bit, 14 fractional bits, and an implied sign bit.

However that implies you have a 15bit register. That is a pretty nonstandard use of the notation and I would strongly suggest against using it.

Nope. Read again the very article you linked to.

Code: [Select]
here are two conflicting notations for fixed point. Both notations are written as Qm.n, where: Q designates that the number is in the Q format notation – the Texas Instruments representation for signed fixed-point numbers (the "Q" being reminiscent of the standard symbol for the set of rational numbers). m. (optional, assumed to be zero or one) is the number of bits set aside to designate the two's complement integer portion of the number, exclusive or inclusive of the sign bit (therefore if m is not specified it is taken as zero or one). n is the number of bits used to designate the fractional portion of the number, i.e. the number of bits to the right of the binary point. (If n = 0, the Q numbers are integers – the degenerate case). One convention includes the sign bit in the value of m,[1][2] and the other convention does not. The choice of convention can be determined by summing m+n. If the value is equal to the register size, then the sign bit is included in the value of m. If it is one less than the register size, the sign bit is not included in the value of m.
And I just do prefer the second convention that doesn't include the sign bit in "m", as I find it more consistent. I know the other notation is also common.

if it is signed magnitude that might make sense, but for the majority using twos-complement calling the msb a sign bit
is iffy because it doesn't just set the sign, i.e. you can negate a number just by inverting the "sign bit", the msb is a bit just all the others it just has negative weight

#33 Reply
Posted by Howardlong on 27 Feb, 2020 21:55
Quote from: Yansi on 25 Feb, 2020 08:29
Assembler? That is what the intrinsics are for. Perfectly fine to be used within the C environment. Even the CMSIS DSP library is written completely in C, using the intrinsics.

(I would not call that an assembly language).

Unless it’s been updated recently, CMSIS DSP is unfinished. The filter/decimator for example is very unoptimised.

Sometimes you need to amalgamate algorithms knowing the architectural register limitations to minimise store/load inter-algorithm memory buffer operations which are very expensive. Optimisers generally won’t know to do this because they have little clue about your data or use case.

I did an AM radio on an M4F not so long ago, with an antenna connected to the ADC input. To do the quadrature oscillator, mixer/downconversion and polyphase filter using CMSIS DSP functions was expensive because you need to store and read buffers between blocks. If you can achieve all three functions in-register without resorting to storage between stages, you save a lot of time.

I can’t remember why, but I couldn’t get the compiler to do what I wanted, including pipelining, unrolling loops, intrinsics and so forth, so there’s a hundred lines of assembler or so at the pinch point.

Otherwise I agree, assembler these days is to be avoided.

#34 Reply
Posted by Howardlong on 27 Feb, 2020 21:57
The only reason I can think for the 16 bit float is that you can reduce expensive store/load operations if 32 bit is unnecessary.

#35 Reply
Posted by ehughes on 28 Feb, 2020 02:25
meh. Kids these days and there fancy 32-bit floating point.

We had 8-bit floats

http://synthmuseum.com/ensoniq/ensmirage01.html

#36 Reply
Posted by brucehoult on 28 Feb, 2020 03:55
Quote from: Howardlong on 27 Feb, 2020 21:57
The only reason I can think for the 16 bit float is that you can reduce expensive store/load operations if 32 bit is unnecessary.

And total storage. And you fit twice as many of them into a given size vector register.

#37 Reply
Posted by DBecker on 28 Feb, 2020 15:05
Quote from: SiliconWizard on 27 Feb, 2020 17:04
Quote from: mach665 on 27 Feb, 2020 17:00
From what I read, an important usage of FP16 is for applications that needs high dynamic range, but not very high SNR. Speech recognization as an example, the audio samples can have huge dynamic range, but recognizing the sound may only need a moderate SNR. Either you speak right next to the microphone or 10 feet away, algorithms implemented with FP16 works well in both cases.

I see, but FP16 has only a 5-bit exponent... the dynamic range will still not be very "high".
As to dynamic range vs. SNR, that would be an interesting discussion.

Quote from: mach665 on 27 Feb, 2020 17:00
Someone may argue that a good AGC plus VGA for the microphone can get the work done too. But obviously that's harder to get right and much less appealing in this digital world.

Yes that too. Well if you're relying on FP operations, you're still in the digital domain. So if it can be done with FP, it can be done with a purely digital AGC with fixed point (and get you better SNR).
But yeah it again all comes down to keeping things very simple, and likely relying on ready-made FPUs as I hinted above.

You are mis-understanding the targeted application area -- neural networks.

The trained networks are very fuzzy, in the sense of being imprecise. No element or weight has been studied and calculated to have the correct value, or even a reasonable value. Instead they just run the training until the wrong results are reduced in influence enough that a not-obviously-wrong output is produced. Each element contributes a little bit of the influence for the result, and each intermediate response is weighted.
Like in the real world, some voices get a weight of 0.0000000001 and others get a weight of 0.99. With enough inputs, it doesn't matter if a particular weight is 0.51 or 0.52. But you don't want the 0.000000001 rounded up to 0.01 or forgotten completely.

The point here is that, while superficially similar to SNR, it's quite different in detail. In particular the concept of AGC doesn't apply because you simultaneously want to add in very tiny contributions while being heavily influenced by the strong weightings.

#38 Reply
Posted by mach665 on 28 Feb, 2020 17:29
Quote from: DBecker on 28 Feb, 2020 15:05
Quote from: SiliconWizard on 27 Feb, 2020 17:04
Quote from: mach665 on 27 Feb, 2020 17:00
From what I read, an important usage of FP16 is for applications that needs high dynamic range, but not very high SNR. Speech recognization as an example, the audio samples can have huge dynamic range, but recognizing the sound may only need a moderate SNR. Either you speak right next to the microphone or 10 feet away, algorithms implemented with FP16 works well in both cases.

I see, but FP16 has only a 5-bit exponent... the dynamic range will still not be very "high".
As to dynamic range vs. SNR, that would be an interesting discussion.

Quote from: mach665 on 27 Feb, 2020 17:00
Someone may argue that a good AGC plus VGA for the microphone can get the work done too. But obviously that's harder to get right and much less appealing in this digital world.

Yes that too. Well if you're relying on FP operations, you're still in the digital domain. So if it can be done with FP, it can be done with a purely digital AGC with fixed point (and get you better SNR).
But yeah it again all comes down to keeping things very simple, and likely relying on ready-made FPUs as I hinted above.

You are mis-understanding the targeted application area -- neural networks.

The trained networks are very fuzzy, in the sense of being imprecise. No element or weight has been studied and calculated to have the correct value, or even a reasonable value. Instead they just run the training until the wrong results are reduced in influence enough that a not-obviously-wrong output is produced. Each element contributes a little bit of the influence for the result, and each intermediate response is weighted.
Like in the real world, some voices get a weight of 0.0000000001 and others get a weight of 0.99. With enough inputs, it doesn't matter if a particular weight is 0.51 or 0.52. But you don't want the 0.000000001 rounded up to 0.01 or forgotten completely.

The point here is that, while superficially similar to SNR, it's quite different in detail. In particular the concept of AGC doesn't apply because you simultaneously want to add in very tiny contributions while being heavily influenced by the strong weightings.

Well, I understand your explanation. But a lot of ML algorithms use 8bit fix point. What's the difference here? What kind of ML algo needs floating point, what kind of ML algo only requires 8bit fixed point?

#39 Reply
Posted by SiliconWizard on 28 Feb, 2020 18:18
Quote from: DBecker on 28 Feb, 2020 15:05
Quote from: SiliconWizard on 27 Feb, 2020 17:04
Quote from: mach665 on 27 Feb, 2020 17:00
From what I read, an important usage of FP16 is for applications that needs high dynamic range, but not very high SNR. Speech recognization as an example, the audio samples can have huge dynamic range, but recognizing the sound may only need a moderate SNR. Either you speak right next to the microphone or 10 feet away, algorithms implemented with FP16 works well in both cases.

I see, but FP16 has only a 5-bit exponent... the dynamic range will still not be very "high".
As to dynamic range vs. SNR, that would be an interesting discussion.

Quote from: mach665 on 27 Feb, 2020 17:00
Someone may argue that a good AGC plus VGA for the microphone can get the work done too. But obviously that's harder to get right and much less appealing in this digital world.

Yes that too. Well if you're relying on FP operations, you're still in the digital domain. So if it can be done with FP, it can be done with a purely digital AGC with fixed point (and get you better SNR).
But yeah it again all comes down to keeping things very simple, and likely relying on ready-made FPUs as I hinted above.

You are mis-understanding the targeted application area -- neural networks.

The trained networks are very fuzzy, in the sense of being imprecise. No element or weight has been studied and calculated to have the correct value, or even a reasonable value. Instead they just run the training until the wrong results are reduced in influence enough that a not-obviously-wrong output is produced. Each element contributes a little bit of the influence for the result, and each intermediate response is weighted.
Like in the real world, some voices get a weight of 0.0000000001 and others get a weight of 0.99. With enough inputs, it doesn't matter if a particular weight is 0.51 or 0.52. But you don't want the 0.000000001 rounded up to 0.01 or forgotten completely.

The point here is that, while superficially similar to SNR, it's quite different in detail. In particular the concept of AGC doesn't apply because you simultaneously want to add in very tiny contributions while being heavily influenced by the strong weightings.

As mach665 saw as well, I think your reply was more to him than to me, as he was the one mentioning applications using AGCs to deal with an extended dynamic range, and my point to this was just to say that if you're dealing with it purely in the digital domain, FP16 would not provide any special benefit IMO compared to fixed point.

I for one know quite well what the basic algorithms for NN are, and as I mentioned a few times already, the core of them is basically series of weighted sums.

I can see how FP16 would allow to represent smaller values in the [0, 1] interval compared to fixed point of similar width. What I'm still not convinced about is the real benefit over fixed point. I'd like to see comparative examples which clearly show that FP16 would yield better results overall. Thing for instance is, when at some point you have numbers that are so small that they would be represented as 0 with fixed point, and some value (but with low precision) with FP16... but would that contribute to the overall weighted sum enough to matter? (Given that even big NNs are usually limited in the number of "neurons", and as I understood, we tend to favor more layers these days rather than more neurons at each layer level...)

To sum it up, I would like to see a real comparison between FP16 and fixed point that would clearly show the benefits in real cases.
Maybe in the end, FP16 takes less resources overall than fixed point, given that even though fixed point itself is less expensive to implement, more care is needed for the calculations using that, so possibly in the end fixed point would be more expensive. I'm just not quite sure or convinced at this point, and would be interested in reading papers about that specifically if there are any.

#40 Reply
Posted by jnz on 01 Mar, 2020 03:25
I don’t get it.

Who is really doing AI / ML at the far edge on cortex processors? I’m sure someone is, but really how many people could this really apply to?

#41 Reply
Posted by brucehoult on 01 Mar, 2020 11:11
Quote from: jnz on 01 Mar, 2020 03:25
I don’t get it.

Who is really doing AI / ML at the far edge on cortex processors? I’m sure someone is, but really how many people could this really apply to?

You may be right. This is an area which is still in very rapid flux and many people are going to RISC-V because of the ease (and legal ability) to experiment with custom functional units and instructions and data types to find out what is actually best for price / performance / efficiency for any particular application.

See for example the Kendryte K210 chip, or this very recent research paper https://arxiv.org/pdf/2002.12151.pdf

#42 Reply
Posted by ali_asadzadeh on 01 Mar, 2020 11:45
Quote
See for example the Kendryte K210 chip
Is there a complete Datasheet and user-manual for that thing, what about tool-chains and tutorials for it, it seems very interesting but docs and info are lacking, or at least I could not find them

#43 Reply
Posted by SiliconWizard on 01 Mar, 2020 15:03
Quote from: ali_asadzadeh on 01 Mar, 2020 11:45
Quote
See for example the Kendryte K210 chip
Is there a complete Datasheet and user-manual for that thing, what about tool-chains and tutorials for it, it seems very interesting but docs and info are lacking, or at least I could not find them

There isn't any decent documentation, and that's the key issue with Kendryte.

But the toolchain itself is the usual binutils/GCC, and they provide an SDK and a few examples. So far you need to use those to get started, and as some of us have done a few months ago with a Sipeed dev board (see corresponding thread), you can do quite a bit with the SDK, it's not hard to grasp. But yes, documentation is definitely lacking.

#44 Reply
Posted by SiliconWizard on 01 Mar, 2020 15:15
Quote from: jnz on 01 Mar, 2020 03:25
Who is really doing AI / ML at the far edge on cortex processors? I’m sure someone is, but really how many people could this really apply to?

Potential applications are many. What exactly would make you think "Cortex" processors are not appropriate for this?
(BTW, "Cortex" doesn't mean much - it defines a whole range of CPU cores, from ones for very small MCUs to the most powerful ones. So you probably meant to say "Cortex-M" processors, which are the cores which target MCUs. A Cortex A53 is very far away from a Cortex M0 for instance. Just a note, agreed the M55 is in the Cortex M line and thus targets MCUs. But still, there's also a large gap between M0 cores and M7 cores for instance.)

Whether we think shoving AI everywhere is dubious in itself or not is another debate entirely, but the benefits of being able to implement some AI on "small" processors (low-cost, low-power) directly seem pretty obvious to me. What's the alternative? Big and hungry processors? Remote AI on servers? (Inefficient, high latency, big privacy concerns, unavailable if no network access...)

TensorFlow is currently usable on a whole range of MCUs, it's not there just to be cute. Any hardware able to accelerate this (and do as much for lower power) will win some markets.

#45 Reply
Posted by tszaboo on 02 Mar, 2020 11:57
Quote from: SiliconWizard on 28 Feb, 2020 18:18
I can see how FP16 would allow to represent smaller values in the [0, 1] interval compared to fixed point of similar width. What I'm still not convinced about is the real benefit over fixed point. I'd like to see comparative examples which clearly show that FP16 would yield better results overall. Thing for instance is, when at some point you have numbers that are so small that they would be represented as 0 with fixed point, and some value (but with low precision) with FP16... but would that contribute to the overall weighted sum enough to matter? (Given that even big NNs are usually limited in the number of "neurons", and as I understood, we tend to favor more layers these days rather than more neurons at each layer level...)

To sum it up, I would like to see a real comparison between FP16 and fixed point that would clearly show the benefits in real cases.
Maybe in the end, FP16 takes less resources overall than fixed point, given that even though fixed point itself is less expensive to implement, more care is needed for the calculations using that, so possibly in the end fixed point would be more expensive. I'm just not quite sure or convinced at this point, and would be interested in reading papers about that specifically if there are any.
There are many research papers, that show even INT8 provides acceptable results for neural nets, if it is translated to that. Usually the training works in FP32, that takes large memory (GBs) and a long time to do. The end result is a few megabytes large, describing the neural net. Take this as an example:
https://github.com/tesseract-ocr/tessdata
This is google's text recognition neural net. One of the goal of reduced precision is to shrink this descriptor, that the main functions of an OS, this descriptor, and the actual net running fits into the cache of a CPU. If it is in the cache, it runs a lot more efficient and fast, compared to memory. English is 22 MB, which is already kinda doable, as there are CPUs with 32MB+ L2 cache. Similarly, you might want to run the thing on GPU, where cache is similarly small, and distributed.
So if you reduce the precision from FP32 to FP16 or the new bfloat16, the model takes half the size, and the internal date is half the size. But you have to translate the network to run on this, and verify that it still works. bfloat is great, because it is just a truncation, so this is an easy step, without much risk.
Note that for a Cortex M type MCU, the entire external memory size is about the size of the cache, so you are IO limited anyway.

I've read some papers about the INT8, which apparently very fast and energy efficient, with only a small drop of accuracy (meaning for example word recognition).
The real problem is that most big companies are doing their own thing, even with new silicon produced around a concept, so results are varying.
In any case, I have an application, where I have an expensive link (GSM or 4/5G) to the mothership, and there is relatively large amount of data being generated that needs analysing (audio stream). I would need to process the data locally, as much as possible, and send as little as possible, if an event happened. The more type DSP or hardware acceleration I have the better.

#46 Reply
Posted by wizard69 on 02 Mar, 2020 12:57
It seems really strange to have ML in a processor targeting embedded, battery powered devices.

#47 Reply
Posted by brucehoult on 02 Mar, 2020 13:43
Quote from: wizard69 on 02 Mar, 2020 12:57
It seems really strange to have ML in a processor targeting embedded, battery powered devices.

There is a massive difference between training the NN and applying the results of the training.

#48 Reply
Posted by SiliconWizard on 02 Mar, 2020 15:26
Quote from: brucehoult on 02 Mar, 2020 13:43
Quote from: wizard69 on 02 Mar, 2020 12:57
It seems really strange to have ML in a processor targeting embedded, battery powered devices.

There is a massive difference between training the NN and applying the results of the training.

Yup of course. Training algorithms (and the amount of data required to train NNs) are impractical to run on small targets - and that's not the point at all. Even on big machines using NNs, the training phase is usually done beforehand and comparatively takes a lot of time.

There are many applications (and this is the case when using TensorFlow) in which NNs are just pre-trained, and only the trained NNs (a bunch of coefficients in practice) are stored and executed on the target.

Beyond this difference, I think there's a lot of misconception about what AI is used for (and how) in real applications (or what it can be used for).
Some people seem to be influenced by the big solutions and the big words associated with them. One example would typically be Google translate, which requires enormous amounts of data and which constantly adapts its training while being used. There are many applications for AI in which pre-trained NNs are perfectly adequate though, and relatively small ones can yield useful results.

Typical examples: voice/face/fingerprint/iris recognition, basic classification and localization of objects (such as identifying faces in a picture)...

#49 Reply
Posted by ejeffrey on 02 Mar, 2020 17:58
Quote from: mach665 on 28 Feb, 2020 17:29
Quote from: DBecker on 28 Feb, 2020 15:05
Quote from: SiliconWizard on 27 Feb, 2020 17:04
Quote from: mach665 on 27 Feb, 2020 17:00
From what I read, an important usage of FP16 is for applications that needs high dynamic range, but not very high SNR. Speech recognization as an example, the audio samples can have huge dynamic range, but recognizing the sound may only need a moderate SNR. Either you speak right next to the microphone or 10 feet away, algorithms implemented with FP16 works well in both cases.

I see, but FP16 has only a 5-bit exponent... the dynamic range will still not be very "high".
As to dynamic range vs. SNR, that would be an interesting discussion.

Quote from: mach665 on 27 Feb, 2020 17:00
Someone may argue that a good AGC plus VGA for the microphone can get the work done too. But obviously that's harder to get right and much less appealing in this digital world.

Yes that too. Well if you're relying on FP operations, you're still in the digital domain. So if it can be done with FP, it can be done with a purely digital AGC with fixed point (and get you better SNR).
But yeah it again all comes down to keeping things very simple, and likely relying on ready-made FPUs as I hinted above.

You are mis-understanding the targeted application area -- neural networks.

The trained networks are very fuzzy, in the sense of being imprecise. No element or weight has been studied and calculated to have the correct value, or even a reasonable value. Instead they just run the training until the wrong results are reduced in influence enough that a not-obviously-wrong output is produced. Each element contributes a little bit of the influence for the result, and each intermediate response is weighted.
Like in the real world, some voices get a weight of 0.0000000001 and others get a weight of 0.99. With enough inputs, it doesn't matter if a particular weight is 0.51 or 0.52. But you don't want the 0.000000001 rounded up to 0.01 or forgotten completely.

The point here is that, while superficially similar to SNR, it's quite different in detail. In particular the concept of AGC doesn't apply because you simultaneously want to add in very tiny contributions while being heavily influenced by the strong weightings.

Well, I understand your explanation. But a lot of ML algorithms use 8bit fix point. What's the difference here? What kind of ML algo needs floating point, what kind of ML algo only requires 8bit fixed point?

A lot of this is very recent research (relatively speaking). So for a long time it was conventional to use relatively high precision computation such as 32 bit floats. But with the demand to push ML into lower power devices, or increase throughput with SIMD operations, people have packed them into smaller word sizes and discovered to some peoples surprise that they still often work fine. I think the first generation Google TPU used some some custom 8 bit word, some kind of logarithmic encoding. Other people have used 16 bit floats, 16 bit integers, 8 bit integers, and probably a bunch of other things. That doesn't mean that every application uses (or can use) 8 bit.

16 bit floats are also very useful in graphics applications because they are easier to use than fixed point and have enough resolution for human vision and common display technology. Probably a lot of other applications have this same character: half-precision float has about 0.1% resolution which is good enough for almost any sort of physical world interaction. Any application that either benefits from the high dynamic range of a floating point number (including for intermediate calculations) or just can't be bothered to deal with the annoyances of fixed point can probably use 16 bit floats. So implementing it is kind of a no-brainer since it is both ML buzzword compliant and has real applications