My 2cts..
I've written a RTOS in the past as a learning experience and the least amount of assembly required is that of a stack switch. It will typically look something like:
[Save all registers]
[Store SP]
CALL osSwitchStacks
[Load SP]
[Load all registers]
[Resume task]
The switchContext function has a prototype like void* osSwitchStacks(void*). It's called with the old SP, saves it to the active task object (to know what the last top of stack was), and then switches out for the new SP. To prepare a task, you would simply fill all initial values for the registers in the order they are pushed/popped, where the entry code is used for PC.
This is really a minimal example. There are lots of details to get right, like FP registers, coprocessor (if relevant to architecture), stacks for kernel/tasks w.r.t. interrupt handling, etc. By no means am I a RTOS expert nor have dealt recently with them to recall all details completely.
Note that the above code doesn't run the scheduler itself. In my view, the scheduler could be triggered at a (fixed) timer interval to see if it has to change the active task based on the current conditions (e.g. perhaps a timer has passed, or you have a round-robin scheduler). You would also need to run the scheduler when a task blocks on some call (e.g. a delay, a wait on some resource, etc.).
In my experience working with a RTOS, you'll find that most tasks end up in a blocked state until something triggers a change in the system. You could also have more direct context switches without running the scheduler. For example, say you have a low priority task currently running and a high priority task is waiting for an event flag to be set. The function that sets the flag is called, and can look something like:
void osSignalEvent(OsTask_t* pTask, uint32_t flag) {
pTask->signal_flags |= flag;
if (pTask->signal_mask & pTask->signal_flags) {
pTask->state = READY;
if (osActiveTask->priority < pTask->priority) {
osPendingTask = pTask;
osPendContextSwitch();
}
}
Finally, to estimate how much stack you need for a given task.. with the GCC compiler it can output a stack consumption estimation by passing -fstack-usage during compilation. I'm not sure how accurate it is depending on the code you use (I have no idea how good it is at tracking down function pointer calls).
Other methods include filling the stack with a predictable pattern, and then investigating after sufficient system runtime up till which point that pattern has been touched.