So maybe this: in the beginning of each function (except some super performance critical funcs), use some helper macro to allocate 4-8 bytes out of stack, and store some magic number, plus any kind of identifier for the function, like a #defined unique constant for each .c file plus __LINE__ number - these would fit in maybe 20 bits.
It is surprisingly hard to get a C or C++ compiler to reliably put something on the stack. Even expressions like
*(void **)__builtin_alloca(sizeof (void *)) = (void *)(expression);can be optimized out, and when you force it, say
_label: *(void *volatile *)__builtin_alloca(sizeof (void *)) = &&_label;you end up generating a lot of code instead of the simple mov+push one might expect (on ARM Thumb or x86 or x86-64, for example).
This is why I suggested using an external stack for call graph and stack pointers, because that you can do with very tight extended asm, and thus low overhead.