You also have to be careful about the Optimization settings of your compiler; sometimes they make a bigger difference than other times. For example, if I run "square" through avr-gcc with the -O3 setting, I get:
00000000 <square>:
0: 88 9f mul r24, r24
2: 90 01 movw r18, r0
4: 89 9f mul r24, r25
6: 30 0d add r19, r0
8: 98 9f mul r25, r24
a: 30 0d add r19, r0
c: 11 24 eor r1, r1
e: c9 01 movw r24, r18
10: 08 95 ret
(That's 4.3.3, rather than 4.5.3, but I get approximately the same improvement with 4.9.x.)
(Also note that "int" on an AVR is 16bits, while "int" on the ARM and x86 is 32bits...)