Of course you can only guess, until it's actually done. And even then, that's just one instance, probably not an optimized one; and after much optimization, soul-searching, and ponderings over coffees, still not perfectly optimized (as there most likely cannot be such a thing
), but perhaps getting closer.
So, what then? Well, of course if you have much experience, you can guess that such-and-such operations will take up this-and-that LUTs/gates/whatever, and probably be well enough in the ballpark. Say +/-20%, more than good enough to budget for the right chip (say, choosing between the 10k or 16k LUT model).
If you don't have that much experience, you'll have to break it down further. What, on a high level, do you need to implement your system? Probably some block RAM, or a bus interface to external RAM of some type. Counters, latches and decoders, to drive state machines and interfaces. Miscellaneous gates for tying everything together. How wide are each of these interfaces? How much arithmetic needs to be performed on each of them? (Counters might only need an increment block, but an advanced state machine might need, say, several full adders? If you're doing DSP, you'll need lots of adders, and multipliers too. Etc.) Then, what does your target platform support -- do they have just LUTs and that's it, do they have hardware adders (or carry chains, or carry lookahead for that matter)? Hardware multipliers? Block RAM? Other special parts? With some hand-waving, you can roughly guess (maybe hopefully a say +100/-50% ballpark?) how much resources your problem is likely to take, given a typical synthesis.
And if that's still too advanced -- break it down again! Learn introductory VHDL or Verilog; get a dev kit, make some very basic things: counters, blinkers, interfaces, decoders, whatever. Read and understand the output (synthesis and compilation), usually given in terms of RTL (register transfer logic -- buses, latches, adders, etc., the kinds of components I mentioned above) and gates (layout and allocation).
You could break it down one more level still, and synthesize gates yourself -- but this would actually not be such a great idea, unless you really want to solve it by hand, pad and pen, discrete gate ICs, that sort of thing. Writing out combinatorial logic as basic process statements is actually a bad idea: sure, you can create any kind of gate imaginable, but the synthesizer doesn't know how to map those into the gates it has. It's not designed to solve backwards like that. (Or at least, this was the case as of, whatever it was, Quartus Web 10.0 I might've been using back then.) It will end up taking hugely more LUTs and propagation delay than a slightly higher level solution will.
So implement it primarily in behavioral statements: understand that a
CASE statement instantiates a mux, say; a
if (clk'event and clk='1') statement instantiates a(n edge-triggered) register; arithmetic statements instantiate adders or such; etc.
Partial aside: you can generally ignore connections, I think, but highly interconnected systems may become routing constrained, and then you won't be able to use as many LUTs. I've not thought about this much before (let alone done much of anything in an FPGA, for quite some time!), but it is just another resource, and subject to availability all the same. Most problems at least can be solved with an average number of connections, and this is likely the ratio of gates to interconnects the manufacturer chose for their design. Related reading:
https://en.wikipedia.org/wiki/Rent%27s_ruleSo, as for your particular case -- I would guess there's buffering in there somewhere? You're smashing down a lot of bandwidth. Namely, 12 * 75 = 900 Mbps, versus a 320x240 x 24bpp x 60Hz ~= 110.6 Mbps display (or less depending, but probably not much more?). Where does the excess bandwidth go? Will you be discarding it wholesale (decimation)? If so, can you just skip over the useless data at the source itself? (Maybe not, that would need an addressable image sensor which I would guess isn't common.) Averaging or other filtering? Per line, or between lines (requires some line buffer RAM and arithmetic blocks)? Maybe the scan rate or resolution is wholly different, so you need a good 2Mb of frame buffer RAM? Presumably there'll be a state machine to handle startup and operation, pixel or scan counters, etc.
Would guess startup can just be ROM blocks (really just emulated by muxes I think, unless block RAM can be instantiated and initialized? Don't know offhand), you're just reading off values and streaming them into the port (and maybe streaming out state info as well, like delay cycles in case some options require initialization time).
That's still pretty far from even a tentative RTL schematic, but gives some ideas where to go in that direction. And yeah, I would guess it'll fit in less than 10k LUTs/gates, but I'd be very nervous about just 1 or 2k, especially if any kind of buffering or filtering is added.
As for high speed, what they said -- edge rate is what's critical, but on top of that, you don't have any room to slow down the edges because you have merely 13ns per cycle to work with, and spending even 2ns per edge will give you some pretty iffy timing windows.
At lower rates, you can (and should!!) slow down the edge rates, usually with series resistors or ferrite beads. Keep this in mind for low bandwidth signals, status, I/O, whatever.
For point of reference, that's faster than 74HC logic can move. It's fast enough that, in a foot of ribbon cable say, and at a given instant in time, you can have ~VCC at one end of a wire, and ~GND at the other, as the edge propagates along it near the speed of light. You will need good grounding (solid ground plane PCB; alternating GND and signal for cables, preferably shielded as well), and not really controlled impedances but just mindful impedances, in that you can't have traces being every which way, but they should be consistent, and terminated accordingly, at the source, load or both (as applicable). Lengths matched as closely as needed.
If you can use differential signaling, do take advantage of it; especially for long runs, anything in cables, etc. LVDS and others have definitely been applied, in commercial applications, at lower clock rates than these! Differential alleviates not only cable EMI issues, but supply bypassing demands as well -- you don't have a half dozen signals drawing tiny gulps of charge at the same time, instead it's a constant-current load, and can be less average current anyway (in much the same way that TTL vs. CMOS trades off idle vs. dynamic current draw with speed).
Expect to employ timing constraints on your platform; this will be specified somehow, maybe directly in code, maybe in supplementary project files (I think Quartus does, or did, it this way?). An output of which, will be a listing of the timing constraints on each signal, which informs your physical layout -- how closely the lengths need to be matched. Be careful that some blocks can be instantiated in very different ways, so as to meet some constraints (like propagation delay from a clock) but missing others (maybe it has a latency of one or more cycles); pipelining is a powerful example of this, and every paralleled instance of the pipelined block takes up its own resources; use carefully! (You probably won't need anything pipelined for this, I would guess, but keep this in mind if you find you're doing more advanced arithmetic -- division and floating point being common examples.)
Tim