As for the way it worked, what I remember (which may be wrong) was that the second CPU was run far enough behind that it could be interrupted and its pre-fault stack puke saved. Then, when the main CPU was ready to resume the faulted process, it would reload that good stack puke and resume at the instruction that had faulted.
The more I think about this the more I think this one instruction behind business can't be correct.
I don't think the 2nd CPU could be more than 1 instruction behind otherwise you would have all sorts of problems keeping the CPUs executing the same code due to the effect of, or lack of writes to memory. Even then it might be hard, the lead CPU would probably have to have writes ignored - eg what happens if the lead CPU executes a test & set (an atomic R-M-W operation on the 68000). If the write is ignored, fine the 2nd CPU will see the unchanged value and take the same path but if not it will possibly take a different branch (since test-and-set is usually followed by a branch).
However if writes from the lead CPU are ignored think about what happens if code reads a value, modifies it, writes it back and then immediately reads it again. You probably wouldn't write code like that by hand but it could be generated by a compiler - especially one without much optimisation. The lead CPU write goes ignored so the read will get the old value.
You might fix the above by prioritising lag CPU writes of main memory before lead CPU reads within the same cycle but what about instructions with wildly differing timings. Eg a register op followed by a DIV. The lead CPU will do the register op then start the DIV, at the same time the lag CPU will be on the register op but will move to the DIV while the lead CPU is still executing it. In that case the lead CPU would no longer be one instruction ahead, it would just be a few clock cycles ahead.
We haven't even thought about arbitrating the bus between the two CPUs
In short I can't see that the idea of executing the same code works.
What would work, however is that one CPU executes the code and the other just handles page faults. The CPU which is executing code is just delayed - perhaps by simply not asserting DTACK until the memory operation can complete.
Looking at a few of the online notes that is how the scheme is described.