The bottom end max10 has 16 18x18bit multipliers. So at 18 bit if you:
p3 <= p1 * p2 ;
p6 <= p4 * p5 ;
p9 <= p7 * p8 ;
p12 <= p10 * p11 ;
p13 <= p3 * p6 ;
p14 <= p9 * p12 ;
final_result <= p13 * p14 ;
You would eat 7 multipliers, 126 logic cells with a clock delay from inputs p1,p2,p4,p5,p7,p8,p10,p11 to output final result would be a 3 clock pipe. This code would be doing 1 billion multiplies a second with a 150MHz source clock.
I am not counting where the source logic cells where the p1,p2,p4,p5,p7,p8,p10,p11 are stored. If they are all 8 bit, that would be 64 logic cells.
To think that a sub 10$ chip can do anything a billion time a second and still be relatively empty and have room to do quite a bit more is kid of amazing yet there is so much more larger and faster out there.