"bl" is a linked jump, meaning that the return address is saved in lr. If that is handwritten assembler, then I agree, you could optimize it to a non-linked jump. But in C, it depends on the compiler to figure out if a function is supposed to exit.
But it won't change the sp reservation. I presume your main() has plenty of code that have locals which require some space on the stack. Typically the function prologue will increase sp, which is restored in the epilogue. I would expect that most locals in your main function will then be stored/retrieved with [sp + 0xXX] , where "XX" is between 0 and 591.
Any nested function call can also have a sp prologue/epilogue, so the total stack usage of your main call tree could be even bigger. I think there are ways of measuring that (filling stack with a waterlevel code, such a 0xCC, and then after sufficient observing how high the 'waterlevel' got), or with static analysis.
I think the only optimization here is in the stmdb instruction. If main() never exits, there is no point in storing the r4-r11 and lr registers from the caller. But I'm not sure if the C convention allows one to change that. Maybe by defining main as a naked function? The GCC documentation says that usage of locals should be avoided, so that suggests that would emit the stack prologue/epilogue. And apperently you need it, as your main requires 592 bytes of stack space.