I don't know what you man by "in byte variables for speed", but "int" is guaranteed to be faster, since on 16- and 32- bit CPUs compilers have to issue extra instructions to comply with 8-bit semantics. Some people think they are optimizing (like using uint8_t in short "for" loops), but in reality they are just making things worse.
I am sure you know all this, so I am surprised by your question.
Take the Z80. HL holds a uint16_t x and you are doing x << 4.
ld b, 4
loop: add hl, hl
djnz loop
The loop counter is a byte.
If you were forced to an "int", which on an IAR C compiler was 16 bits, you would need something like
ld bc, 4
loop: add hl, hl
dec bc
ld a, b
or c
jr nz, loop
There is a million cases where 8 bit values is far faster, on those micros. One also had 8x8=16 multiply, 16/8=8 divide, etc. Converting every int8 into 16 bit arithmetic bloats the code 5x to 10x. In fact I remember rewriting a lot of IAR code in asm and wondering why they are doing everything in 16 bits, with the top byte set to 0. 8x8=16 is probably 10x faster than 16x16=32 and then discarding the top 16 bits. No wonder C got a crap name back then for generating crap code
In later years, CPUs became "genuinely 16 bit" even though they still had the crippling 64k address space. H8/500 was one such.
On the arm32 I use int or uint32_t for loop counters; it merely wastes RAM
Re your example, that is wrong code
because anybody adding up even two bytes should know the result can be > 255. Classic integer maths overflow. If using integers (rather than floats) one needs to be intimately familiar with the
actual range of values i.e. the actual data. This was well known since for ever. The Apollo guidance stuff used int32 heavily and had to deal with this. So avg should be a uint16_t and possibly a uint32_t if adding up lots of them. Auto promoting avg to uint16_t only saves your skin partially, and hides the real problem.
People who can't be bothered with actual data ranges have to use floats - and pay a heavy price it... in the old days, of x100 to x1000 less speed. Not today, with a 168 megaflop 32F417.
AIUI,
uint32_t avg = (a[0] + a[1]+ a[2]+ a[3]+ a[4]) would promote each of the five items to a uint32_t before doing the addition. That's probably wrong too