| Products > Vintage Computing |
| 68030 prefetch thoughts? |
| (1/2) > >> |
| ZaneKaminski:
Anyone ever implement a prefetcher for an MC68030 or similar generation system? For fun, I'm working on a RAM controller for a 33 MHz MC68030-based system. The RAM controller is implemented in a modern FPGA and the RAM is 133 MHz SDR SDRAM connected to the 68030 data bus through some 74LVC245A level-shifting buffers. I have achieved 4-clock read access latency for "random reads" and 3-clock latency when there’s a hit in an open DRAM row. The row size is 2 kB and there are 8 SDRAM banks with independent sense amps. My controller always leaves the row open after a read/write and it interleaves each 2 kB chunk of RAM across the 8 banks so that you have to go 16 kB to get a bank conflict. In case of a bank conflict, the precharge and activate can both be completed in 4 clocks, same as if the bank was closed. So with this arrangement, plus a 32 kB, 8-way, 3-clock latency L2 cache in the FPGA, I expect to be getting 3-clock accesses much of the time. The pin-to-pin latency of the FPGA through the level shifting buffers is too slow to do a 2-clock access in any case, so 3 clocks is the best I can do. In addition to the cache and fast-ish RAM, I’d like to implement a “prefetcher” to reduce the remaining number of 4-clock RAM accesses. Since a row hit is 3 clocks and I can’t go faster than that, what I really need is not so much a prefetcher, but a pre-precharge-and-activator. It needs to decide to close a particular row in one of the 8 banks and open a different one, thus converting a future 4-clock access into a 3-clock access (at least if the prediction is right). Any thoughts on a good algorithm? A lot of prefetchers have been studied in academia and industry, but this particular situation is kind of odd. Modern systems prefetch something on the order of 64 or 128 bytes at a time and the latency is reduced by 10x when there’s a correct prediction. Here I just want to open the right row, so the equivalent prefetch size 2 kB and the latency is only reduced by 25%. Quite a different situation from what has been studied. |
| bson:
A bit belated, but why even bother with DRAM and caching anymore. 2MB * 16 bit 10ns SRAM: IS61WV204816BLL-10TLI Runs up to 100MHz async, with zero ws. |
| DiTBho:
--- Quote from: bson on December 24, 2024, 08:34:46 pm ---A bit belated, but why even bother with DRAM and caching anymore. 2MB * 16 bit 10ns SRAM: IS61WV204816BLL-10TLI Runs up to 100MHz async, with zero ws. --- End quote --- Nice idea, cache is evil :D |
| radiogeek381:
Interesting project. By and large, prefetching isn't worth the effort for anything but istream, and the case isn't all that compelling there. (There are other ways, see below.) But there is still room for improvement in your system. Associativity isn't all it's cracked up to be. In particular, 8-way is far beyond the optimal point for a 32KB cache and "normal" (that is to say, "nearly all") programs. Back in the 68030 time, direct mapped was pretty good, and 4-way was an outer limit. (This is based on extensive research I saw in 1984 or so. It simulated hundreds of cache configurations against real address traces from all kinds of programs from numerical apps to database systems.) A much more important dimension is the block size. While the DRAM might want to deliver 64 bytes, 64 bytes is a sub-optimal (far sub-optimal) size for a 32 kB cache. 16 is about where I'd think the right compromise might be. Yes you throw away 3/4 of the DRAM fetch, but it probably wasn't worth it anyway. If I were asked, I'd say 8 byte blocks, two way (max). Some will make the argument that streaming justifies large blocks. Fine, but unless you are doing some DFT stuff, it isn't worth the cost of the loss of 4x the number of distinct cache indexes. If you really believe that streaming is important, there are two things you can do: 1. leave the DRAM page open after each read. 2. Implement a "recently fetched" buffer (a cache) of about 48 or 2 x 48 bytes to hold the remains of the last (last two?) DRAM fetches. As for stream prefetching, it is likely the best bet for something in the 68030 time is an 8 instruction fetch buffer. This would hold a typical basic block following a branch target. It was a feature of the first of the one-chip VAX processors and worked well. Stream fetching for data is generally not worth the effort, as it pollutes the cache. --- I am sure that "caches are evil" was meant tongue in cheek, but if not: unless the goal is to know exactly how many cycles each instruction takes, a cache is important. No practical general purpose processor has been without some kind of cache since at least the 90's if not the late 70's. Certainly all of the last 10 general purpose designs that I've participated in had caches. (four Alphas, three VAXes, three MIPS) The important metric here, by the way, is "miss rate" and not "hit rate" even though the two are directly related (m = 1-h). Improving the hit rate from 98 to 99 percent sounds insignificant, but it actually cuts the number of external accesses by half. That's a big deal, especially if the DRAM access time is significantly longer than the cache hit time. Sure, it is only a matter of appearance, but when you plot it on a useful axis, the hit rate curve is almost meaningless. |
| SiliconWizard:
You could read this: https://dl.acm.org/doi/pdf/10.1145/2907071 |
| Navigation |
| Message Index |
| Next page |