Interesting project.
By and large, prefetching isn't worth the effort for anything but istream, and the case isn't all that compelling there. (There are other ways, see below.)
But there is still room for improvement in your system.
Associativity isn't all it's cracked up to be. In particular, 8-way is far beyond the optimal point for a 32KB cache and "normal" (that is to say, "nearly all") programs. Back in the 68030 time, direct mapped was pretty good, and 4-way was an outer limit. (This is based on extensive research I saw in 1984 or so. It simulated hundreds of cache configurations against real address traces from all kinds of programs from numerical apps to database systems.) A much more important dimension is the block size. While the DRAM might want to deliver 64 bytes, 64 bytes is a sub-optimal (far sub-optimal) size for a 32 kB cache. 16 is about where I'd think the right compromise might be. Yes you throw away 3/4 of the DRAM fetch, but it probably wasn't worth it anyway.
If I were asked, I'd say 8 byte blocks, two way (max).
Some will make the argument that streaming justifies large blocks. Fine, but unless you are doing some DFT stuff, it isn't worth the cost of the loss of 4x the number of distinct cache indexes. If you really believe that streaming is important, there are two things you can do:
1. leave the DRAM page open after each read.
2. Implement a "recently fetched" buffer (a cache) of about 48 or 2 x 48 bytes to hold the remains of the last (last two?) DRAM fetches.
As for stream prefetching, it is likely the best bet for something in the 68030 time is an 8 instruction fetch buffer. This would hold a typical basic block following a branch target. It was a feature of the first of the one-chip VAX processors and worked well. Stream fetching for data is generally not worth the effort, as it pollutes the cache.
---
I am sure that "caches are evil" was meant tongue in cheek, but if not: unless the goal is to know exactly how many cycles each instruction takes, a cache is important. No practical general purpose processor has been without some kind of cache since at least the 90's if not the late 70's. Certainly all of the last 10 general purpose designs that I've participated in had caches. (four Alphas, three VAXes, three MIPS)
The important metric here, by the way, is "miss rate" and not "hit rate" even though the two are directly related (m = 1-h). Improving the hit rate from 98 to 99 percent sounds insignificant, but it actually cuts the number of external accesses by half. That's a big deal, especially if the DRAM access time is significantly longer than the cache hit time. Sure, it is only a matter of appearance, but when you plot it on a useful axis, the hit rate curve is almost meaningless.