400Mhz, 8kb D/I cache
IDT-79RC32H435, made in 2006
1 instruction per cycle, five stage pipeline
Oh wow. I took a look at the µarch.
That's *maximum* one instruction per cycle, but there are going be a lot of times that isn't achieved.
The TLB setup in particular is very minimal by modern standards, with a 16 entry "JTLB" that generates an exception on a miss, so OS software will then do a software walk of the page table and insert the translation into the JTLB before returning. There are also 3-entry ITLB and DTLB that go at full speed, and cause a 2 cycle delay if an address is accessed that misses in the ITLB or DTLB but is in the JTLB.
That's the same kind of structure as modern CPUs, but tiny!
For example the Arm A53 has 10 entries each for ITLB and DTLB, and 512 entries in the shared L2 TLB. And hardware refill of the L2 TLB on a miss. SiFive's U74 core (competitor to Arm A53 and A55) has 40 ITLB and 40 DTLB entries and 512 shared L2 TLB entries.
AMD Zen2 has 64 L1 TLB entries and 2048 L2 TLB entries.
What this means is that on the MIPS 79RC32H435, using 4k pages, the JTLB can cover any 64k of RAM without misses. If you're using more RAM that that then there will be very frequent software interrupts to refill the JTLB.
Arm A53 and SiFive U74 on the other hand both cover 2 MB of RAM with their L2 TLBs. U74-MC in e.g. VisionFive 2 also has 2 MB of L2 cache, so that matches quite well. Arm A53 has a maximum of 2 MB of L2 cache, but for example the Pi 3 / Pi Zero 2 have 512k of L2 cache.
The 79RC32H435 appears to not have L2 cache, just 8k each of icache and dcache.
I don't see anything about branch prediction. I guess there is none, and it's just depending on the MIPS branch delay slot to cover fetching the instruction at the branch target.
Sooo ... the 79RC32H435 should be able to run at 1 IPC a lot of the time with carefully written small code and data, but running things such as bash, awk, python etc is going to be spending a LOT of time waiting for TLB and cache misses.