The fundamental limits are the speed of light (i.e. latency) and heat generation. We are pushing both: compare the heat flux with that in a kettle or nuclear reactor, and the time it takes a signal to get across a chip (worse: to another chiplet) with the clock period.
The AMD/intel chips (and SMP big iron) all rely on cache coherency. The messages associated with cache coherency protocols become a limiting factor: too many messages, too long latency.
There are a few techniques for not requiring cache coherency, e.g. CSP and descendents, or MapReduce etc. We need more techniques.
You are right. There are two ways of looking at the situation, and I was mainly promoting the other way.
One way is what is the maximum number of cores you can have. (Ignoring the practicalities or limits on how many cores a system/program can usefully use). So I meant that in theory. Assuming there is going to continue to be, ever lower cost cores, in the future.
So (hypothetically), at some future date, we may see (domestic home computers), with 1,000 cores, and maybe later still 1,000,000. Later still even a billion or trillion cores. It is not clear, where the boundaries are going to be (especially if quantum computers, become a thing), and at what cost.
The fastest computer, I can quickly find by googling (Summit), seems to have around 2.5 million cores.
https://www.top500.org/system/179397Cores: 2,414,592
So, presumably, one day we will see home computers, with that kind of power (2.5 million cores, Theoretical Peak (Rpeak) 200,795 TFlop/s), for under $1,000. But when ?
2021, 2030, 2050/60, 2199, 9998 ?
But on the other hand (what you seem to be describing), there are limits as to how many cores an actual real life problem or program(s), can actually, really use.
For some problems, such as converting a long video, from one format to another (or compressing it), the video, can be split into hundreds of thousands of individual frames, and each frame can have millions of pixels.
So, millions of cores, can fairly easily be used in parallel.
But many other problems, need communications (such as the cache you mentioned, as part of shared memory), which can severely limit the number of cores that can be usefully used to parallelize a particular task.
Also (as you said), latency, can be the fundamental limit. Which can be the case, especially in some real time embedded systems. E.g. A car, which essentially has to respond to certain events (such as air bag deployment), in rather limited amounts of time. Otherwise the accident (or whatever the driver/car was trying to do), is over, and it is too late.
So, things like the XMOS xCORE, may well help provide a starting point for such work. Which is what the Inmos Transputer, and its Occam programming language. Were trying to achieve, many years ago.
Sadly, the reality is that single core performance, is still the main performance factor, for many things, and C/C++ (like) languages, rather than highly parallel languages, are still the norm.
tl;dr
Even if hardware, with many cores, becomes cheap and common place. The software to use such potentially huge computing power, could be a decade or many decades away. In a number of cases, it is theoretically believed, that it is not even possible (such as what is known as Amdahl's law).
https://en.wikipedia.org/wiki/Amdahl%27s_law