Author Topic: FPGAs with embedded ARM cores, who makes them? (Read 12525 times)

daveshah · « **Reply #50 on:** August 03, 2018, 07:16:41 am »

Critical path delay, among other things

Siwastaja · « **Reply #51 on:** August 03, 2018, 09:07:09 am »

Quote from: NorthGuy on August 03, 2018, 01:32:19 am

Optimization is when you try to optimize something - that is find a solution which produces the maximum (or minimum) value of something while satisfying given conditions and constraints. What do you think the FPGA routing optimizes?

1) The longest delay from any certain flip-flop output, through the logic and routing matrix, to another flip-flop input, called critical path delay. This sets the maximum clock frequency. So you obviously want to minimize it.

OTOH, the longest delay on the clock domain defines the clock speed. Once you have optimized the longest delay to be as short as possible, the rest do not matter. So you don't want to overoptimize them, because that would be limiting for the other things to optimize for:

2) Number of logic resources used. You see, by duplicating some logic, you'll be able to shorten the critical path.

3) Placement and routing resource usage. Placement of the LUTs is highly critical optimization process. If unoptimally placed, you'll soon run out of routing resources.

Actually, if you look at the settings of your synthesis tool, you'll find out shitload of options to adjust this optimization process, for example, a slider so you can balance between speed and area (#1 and #2).

So, many metrics to optimize for. Some work against each other, so you need to balance them.

FPGA is like a super-complex PCB with very limited number of layers, and tens of thousands of components. It's impossible to route by a human; hence "autorouting" is necessary.

Yes, even though part of the reason for the slow tools is bloat, they still are complex inside. Which is part of the reason no open source synthesis tools exist.

All of the complexity is hidden from the designer. The synthesis tools feel sucky, and yes we all hate them, as we always love/hate EDA tools we use, but actually they are quite some cool shit.

The only reason why it takes 10 hours for "place & route" algorithm to meet timing requirements and fit the design into the available device, is not the bloat.

If you had ever used FPGA, you would know most of this - pretty basic stuff. The compilers tend to be quite verbose, as well, and the GUI shows you the optimization results such as the critical path and design resource usage in very explicit way. Your comments clearly show you have no idea about FPGAs whatsoever, so why bother commenting like that?

I agree that FPGA place&route is probably hard enough that finding the most optimal solution would possibly take years of synthesis time even for a fairly simple design. So it's all about getting close enough in manageable time and software complexity.

daveshah · « **Reply #52 on:** August 03, 2018, 09:38:14 am »

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

Yes, even though part of the reason for the slow tools is bloat, they still are complex inside. Which is part of the reason no open source synthesis tools exist.

https://github.com/YosysHQ/yosys
https://github.com/YosysHQ/nextpnr
https://github.com/verilog-to-routing/vtr-verilog-to-routing

While I accept they are not at the level of complexity of the commercial tools, there are certainly open source FPGA flows out there - and they win on startup time compared to the vendor tools if nothing else. You can go from Verilog to a programmed iCE40 with a small design in about 2 seconds.

rhb · « **Reply #53 on:** August 03, 2018, 01:23:35 pm »

@Siwastaja gave a good overview of the optimization problem in terms of an FPGA synthesis. My math background let me look at the physical hardware, recognize it constituted a Traveling Saleman variant at which point I actually know quite a lot without knowing much about the details. For example, at a certain density of utilization it is very desirable to move to the next larger part in the line. That's obvious from the mathematics. But when that point is reached is much harder to determine.

NP hard problems are intractable. To find the desired answer you may have to test all of the possible answers. For large problems this is physically impossible. However, for smaller problems you might find a near optimal solution after a week or two of computer time. I would guess that a lot of the "bloat" is a large collection of such solutions to common customer synthesis blocks which are then used as the starting point for the synthesis of the entire FPGA. You'll meet the constraints much faster if you start out close to a solution.

The software engineering of Vivado is poor. That's obvious from the way it is packaged. Quartus lets you download the pieces you want. Still large, but less likely to fail. I had a 17 GB Vivado download fail after some 10-12 hours. I've got a 3 Mb/S link. Packaging Vivado for all platforms in a single file is crazy. That person should be fired for gross incompetence.

Writing code to solve NP hard problems is the most difficult class of programming. Doing a good job requires a person who spends a good bit of their personal time buying books, reading journal papers and experimenting on their own systems. I spent at least 4-8 hours each week doing that and several thousand dollars each year. It the person tasked with doing the work is a 9-5 type, the results will be very poor. The best programmers are more than 10x better than the average programmer. It's for the simple reason they care about the subject and their work.

I'll have to take a look at the FOSS synthsizer. There was some very important work by Emanuel Candes and David Donoho in 2004-2006 which proved that optimal solutions to certain L0 (aka NP hard) problems could be found in L1 (simplex method and similar) time. That's very different from finding a near optimal solution in L1 time which is current practice. There are some serious restrictions, but it has also led to some interesting results on regular polytopes in N dimensional space which run much faster than simplex or interior point methods.

Observing Amdahl's law is critical to making good use of an FPGA with embedded hard cores. You write the code on the ARM, profile it and move the slow parts into the FPGA. Of course, for a DSO the acquisition has to start in the FPGA. But once it hits memory there are more options. I want split screen waterfall and time domain displays among many other things that are not available in typical DSOs.

NorthGuy · « **Reply #54 on:** August 03, 2018, 02:30:44 pm »

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

1) The longest delay from any certain flip-flop output, through the logic and routing matrix, to another flip-flop input, called critical path delay. This sets the maximum clock frequency. So you obviously want to minimize it.

This is a constraint, not something to optimize for. If the delay meets the setup/hold requirement od the receiving flip flop, it works. If the setup/hold requirements are not met, it doesn't work. It doesn't make any sense to try to make the delay shorter if the setup is already met. Worse yet, if you make it too short you risk failing the hold.

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

2) Number of logic resources used.

This is also a constraint, imposed either by the number of available logic or by floorplaning. It is to be met, not optimized.

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

You see, by duplicating some logic, you'll be able to shorten the critical path.

BTW: On a number of occasions, I came across a situation where I had duplicated registers to meet the timing, but the tool "optimized" my design and replaced duplicated registers with one, and then failed timing. Of course, this has nothing to do with mathematical optimization.

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

3) Placement and routing resource usage. Placement of the LUTs is highly critical optimization process. If unoptimally placed, you'll soon run out of routing resources.

Again, there's no reason to optimize anything. The placement you create either allow satisfactory routing which meets your timing constrains, or it doesn't.

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

FPGA is like a super-complex PCB with very limited number of layers, and tens of thousands of components. It's impossible to route by a human; hence "autorouting" is necessary.

I'd say FPGA has much better routing capabilities than PCB. The tools let you route manually if you wish, but it is just as tedious as routing PCBs. You wouldn't want to do this except for some limited cases.

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

Yes, even though part of the reason for the slow tools is bloat, they still are complex inside. Which is part of the reason no open source synthesis tools exist.

All of the complexity is hidden from the designer.

We don't actually know what is hidden from the designer. I suspect reverse-engineering the FPGA bitsreams and creating your own tools for place and route would make things much faster. But reverse-engineering is slow and boring job and no one wants to do it. BTW: There's an open source effort under way:

https://github.com/SymbiFlow/prjxray

but it doesn't seem to move very quickly.

Quote from: Siwastaja on August 03, 2018, 09:07:09 am

I agree that FPGA place&route is probably hard enough that finding the most optimal solution would possibly take years of synthesis time even for a fairly simple design. So it's all about getting close enough in manageable time and software complexity.

Again, there's no optimal solution. Any solution that meets all of your constrains is just as good as any other. And there are billions of them (much more actually) for any given design. But you only need one.

Wait a minute. I'll take it back. There is one thing you can optimize for, and this is the compile time, but it doesn't seem to be on vendor's radars.

asmi · « **Reply #55 on:** August 03, 2018, 03:29:27 pm »

Quote from: NorthGuy on August 03, 2018, 02:30:44 pm

Again, there's no optimal solution. Any solution that meets all of your constrains is just as good as any other. And there are billions of them (much more actually) for any given design. But you only need one.

This is not correct. "Optimize area" vs "Optimize performance" is an obvious candidate, there are reasons to reduce the amount of resource used even if they are available in the chip of your choice - more resources used => more power consumed => more heat generated. This can drive much more than many suspects - like using physically bigger package as it let's you get away with less heat management, which ironically can make entire solution smaller and cheaper. Less consumed power can mean longer battery life (or ability to get away with smaller capacity battery while still meeting your requirements on a batter life), and/or cheaper DC-DC converters with smaller inductors, less powerful PSUs and so on. FPGAs do not exist in isolation, and only parameters of entire system are important. Power is actually a big topic - for example Artix-100T can consume up to about 7 amps of current on it's Vccint rail alone! Not many "general purpose" DC-DC converters can deal with that kind of load and maintain good efficiency (less efficiency => more heat => heat managements becomes more problematic), while specialized ICs designed to deal with that kind of load are generally cost quite a bit more.
Another factor is achieving timing closure becomes progressively harder as resource utilization rolls over some magical number (about 70% in my experience), there were times where even 166 MHz was almost too much so ask for because components were too far apart and net delays were too high, and that in turn was caused by some bad decisions made during PCB development, while board re-spin was not an option.
There is a reason there is a million of settings and parameters for both synthesis and P&R tools, as developing a system with FPGA is very often about compromises.

NorthGuy · « **Reply #56 on:** August 03, 2018, 04:12:26 pm »

Quote from: asmi on August 03, 2018, 03:29:27 pm

Quote from: NorthGuy on August 03, 2018, 02:30:44 pm
Again, there's no optimal solution. Any solution that meets all of your constrains is just as good as any other. And there are billions of them (much more actually) for any given design. But you only need one.
This is not correct. "Optimize area" vs "Optimize performance" is an obvious candidate, there are reasons to reduce the amount of resource used even if they are available in the chip of your choice - more resources used => more power consumed => more heat generated.

Power depends on the number of switchings per unit of time. It is not clear whether the design using more area will consume more power compared to a smaller design. Aggressive design on A15T can draw more power than lazy design on A100T. Static analysis cannot estimate power with any reasonable accuracy, so you cannot optimize for power unless you run special simulations. Vivado estimates power consumption on every implementation run. Compare this to real power measurements. These two have nothing in common.

The "Optimize area" and "Optimize performance" settings do not imply that the tools run optimization process trying to minimize area, or maximize performance (whatever this means). It's much simpler. Often, the same thing may be done in a number of ways. Say, if you have a simple 32-bit counter, it may be possible to implemented it using carry logic, or using DSP, or, if you want it really fast, you could do it in general fabric consuming more LUTs. If a similar decision is to be made, the tools may use your settings to select one option or another.

rhb · « **Reply #57 on:** August 03, 2018, 05:35:34 pm »

In mathematics, satisfying constraints is called an optimization problem. In the case of an NP hard problem, finding the best solution is generally not practical. So you settle for something close to optimal. The optimal solution might only be a pS less variation in the latency of the individual bits of an adder output. The tighter your constraints the more difficult the optimization problem is.

You have a set of constraint equations. What is commonly minimized is the summed absolute deviation of the solution from the constraints. This is generally referred to as an L1 solution. But the mathematics are quite general. Normal notation is : min <some expression> s.t. <some set of constraint equations>. In the case of FPGA synthesis, one would most likely want to apply weights to the error terms so that the bounds on the high speed portions are tighter than on the low speed portions. So different weights would be applied to different errors in the minimization expression.

Mathematicians, scientists and engineers have collectively spent millions of hours studying the problem looking for practical solutions. And continue to do so as NP hard problems are very important in many applications. The literature on the subject is so vast that no person could ever read all of it.

If anyone wants to learn more about the topic, I suggest:

Linear Programming: Foundations and Extensions
Robert J. Vanderbei
3d ed Springer 2008

Vanderbei is a professor at Princeton and writes beautifully. And the Gnu Linear Programming Kit, GLPK, is excellent even it it is not as fast as the $100K/seat commercial packages. In some 8 years using it and following the mailing list, I cannot think of a single failure that was not due to user error.

nctnico · « **Reply #58 on:** August 03, 2018, 07:05:45 pm »

Quote from: rhb on August 03, 2018, 05:35:34 pm

In mathematics, satisfying constraints is called an optimization problem. In the case of an NP hard problem, finding the best solution is generally not practical. So you settle for something close to optimal. The optimal solution might only be a pS less variation in the latency of the individual bits of an adder output. The tighter your constraints the more difficult the optimization problem is.

In an FPGA this doesn't matter at all. In an FPGA logic is typically synchronous. Sure there are clock domain crossings (not to forget the inputs and outputs are a clock domain crossing!) but these can all be caught by timing constraints. The place & route just needs to make sure the delay doesn't exceed the time needed for the clock to arrive in a worst case scenario. Actually the software isn't that sophisticated. It needs a lot of steering from the user to get timing closure in many cases.

NorthGuy · « **Reply #59 on:** August 03, 2018, 07:06:57 pm »

Quote from: rhb on August 03, 2018, 05:35:34 pm

In mathematics, satisfying constraints is called an optimization problem.

We won't get far if we start twisting basic mathematical terms.

Functions may have minimums and maximums. Since the same methods can deal with minimums and maximums, they often called optimums. An optimum can be either maximum or minimum.

The task of finding minimums is called minimization. The task of finding maximums is called maximization. The task of finding optimums is called optimization.

The optimization may be complicated with constraints - which finds the optimum in the sub-space of the inputs defined by the constraints.

Often optimization cannot be performed computationally within reasonable time (what you call NP-hard). In this situation, finding a point which is close enough to the optimum gives you approximate solution.

Finding an arbitrary point in the constrained sub-space is not optimisation. Contrary to the optimization, this task does not have any approximate solution. They point either lies within the constraints, or it doesn't. It cannot be more within constraints, or less within constraints. It is either in or not. Makes sense so far?

This is what happens in FPGA. There are constraints, predominantly timing constraints. If these constraints are met the design will work across specified conditions. If not, the design may fail. Very simple.

Quote from: rhb on August 03, 2018, 05:35:34 pm

In the case of FPGA synthesis, one would most likely want to apply weights to the error terms so that the bounds on the high speed portions are tighter than on the low speed portions. So different weights would be applied to different errors in the minimization expression.

FPGA uses RTL logic model (stands for Register Transfer Logic), which includes sequential elements (often called registers or flip-flops) and combinatorial logic between them (various gates, LUTs, muxes, interconnect etc.).

When a clock hits a flip-flop, the input of the flip-flop gets registered, transferred to the output and starts to propagate through the combinatorial logic to the next sequential element where it is supposed to be registered on the next(for simplicity) clock edge.

The delay through combinatorial logic is generally unpredictable - it varies from FET to FET, depends on the process, voltage, temperature. But the vendor performs characterization work - they measure the delays for thousands of FPGAs and across all the various conditions and they come up with two numbers - minimum delay and maximum delay. The vendor does this for each and every elements within FPGA.

Once these numbers are known, you can sum them up across the combinatorial path and use the numbers to determine whether the design is acceptable or not. This is done with two comparisons.

1. The combined minimum delay must be big enough to make sure that by the time the signal propagates to the next sequential element, this next sequential element has already done working with its previous input. This is defined by the "Hold" characteristic - the time starting from the clock edge and ending at the point in time where the register doesn't need the input any more.

2. The combined maximum delay must be small enough to make sure that the signal gets to the next sequential element before the sequential element starts registering the signal. This is defined by the "Setup" characteristic - the time starting from the point when the register must have valid input to the clock edge.

Thus the sequence of events should be such:

clock edge - hold expires - new data arrives - setup point - next clock edge

Note that the uncertainty of the delays doesn't propagate to the next clock cycle and doesn't accumulate. Each clock cycle starts anew, error free (not counting clock jitter). This feature lets the RTL system work for very long periods of time without errors.

However, for the RTL system to work, the events must happen in exact order (clock-hold-data-setup-clock) regardless of the conditions - voltage, temperature etc. If data arrives before hold or after setup even for one single flip-flop in your FPGA, the whole design may be doomed. Worse yet, the flip flop clocked when the input is unstable may become metastable.

Therefore, the design must meet the timing constraints. The solution cannot be approached or done approximately. The constraints must be met. And vice versa, once the constraints are met, there's no reason to tweak things any further - the design is guaranteed to work anyway. Makes sense?

rhb · « **Reply #60 on:** August 03, 2018, 10:34:41 pm »

On the first part:

This looks to be a tolerably decent summary.

https://en.wikipedia.org/wiki/Convex_optimization

The 2nd part is obvious by inspection from my summary description of an FPGA as a collection of hard blocks and an N layer interconnect controlled by FET switches which are set by a bit pattern in memory. Solving for the connections and choice of hard blocks which meet the timing constraints is a convex optimization problem. It's actually equivalent to a regular polytope in N dimensional space which as N gets large is overwhelmingly likely to be convex. Which is nice as it makes things easier.

I had hoped my air transport example would have made the concepts clear. But you may believe whatever you wish.

Bassman59 · « **Reply #61 on:** August 03, 2018, 11:51:42 pm »

Quote from: BrianHG on August 03, 2018, 02:15:56 am

Quote from: asmi on August 01, 2018, 01:00:42 am
Quote from: Bassman59 on August 01, 2018, 12:48:15 am
So yeah, it would be great if it was reasonable to "write once, synthesize everywhere," but in practice, that isn't possible.
I agree with everything above, but in addition to that there is an elephant in the room - DSO application will most certainly require using DSP tiles, and they are some of the most non-portable even across different FPGA families of a single vendor, much less so between vendors.
What do you mean?
I wrote a complete image scaler and video processor in system Verilog in Altera's Quartus 3 years ago. All math was written out as simple adds, multiplies, divides in Verilog. I did not use any DSP tiles, yet, once compiled, Quartus placed all the arithmetic into the DSP blocks all on it's own.

The DSP block in the Xilinx Spartan 3E/3A/3AN and Spartan 6 is quite clever, as it can be configured to do MAC, various clear/set, with programmable pipeline stages and etc etc. It has an opcode input that can be used to configure it on-the-fly on a per-clock basis. We used it to build a dual-slope integrator.

The problem was that ISE wasn't clever enough, and it would use the DSP block for the multiplier only. In order to have it do what we wanted, we had to instantiate the block and write a state machine that controlled the opcode and kept track of the data as it moved through the pipeline. Annoying and non-portable? Yes. Did it work? Yes. Whatever. The product shipped and I didn't particularly care that it was "inelegant."

carl0s · « **Reply #62 on:** August 04, 2018, 12:31:33 am »

Bet this site's had quite a few visits lately: https://en.wikipedia.org/wiki/NP-hardness. Still makes no frickin' sense to me.

If you can do a 'scope equivalent of OpenTx (https://www.google.com/search?q=horus+radio ) then that would be super cool.

What we really want, is for Chinese 'scope manufacturers to be pushing (and competing on) their hardware manufacturing limits, to be utilised by a standard operating system.

It's Android, for oscilloscopes.

NorthGuy · « **Reply #63 on:** August 04, 2018, 01:04:56 am »

Quote from: rhb on August 03, 2018, 10:34:41 pm

But you may believe whatever you wish.

Thank you for the permission.

However mathematics (as any science) is not based on believes, but rather on proofs.

If you believe that FPGA design is an optimization of untold convex functions, I don't think I can say anything useful that would help. Please accept my apologies for disturbing your thread.

rhb · « **Reply #64 on:** August 04, 2018, 02:49:09 am »

The traveling salesman problem (TSP) asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?" It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science.

In 2006, Cook and others computed an optimal tour through an 85,900-city instance given by a microchip layout problem, currently the largest solved TSPLIB instance.

From: https://en.wikipedia.org/wiki/Travelling_salesman_problem

See also:

sarielhp.org/teach/2004/b/27_lp_2d.pdf

Quote from: NorthGuy on August 04, 2018, 01:04:56 am

If you believe that FPGA design is an optimization of untold convex functions, I don't think I can say anything useful that would help. Please accept my apologies for disturbing your thread.

FPGA synthesis is a convex optimization, not design.

In a purely mathematical description, the constraints form planes in an N dimensional space. The possible solutions lie within the convex polytope called the feasible region which is all the possible solutions which satisfy the constraints. If one wishes to clock the system at the highest possible rate, then one seeks the vertex of the polytope with the highest possible clock rate. One may choose a wide variety of traits such as minimum latency to optimize. This consists of reorienting the polytope in N dimensional space and seeking the minimum. Every vertex of the polytope is optimal for some constraint.

I'm a scientist, not a mathematician. I only learned this a few years ago when I solved some problems and then realized I'd been taught they could not be solved. So I set out to find out why what I'd been taught was wrong. Or more precisely, when what I'd been taught was wrong. The general case was true, but there were exceptions which I'd not been shown.

hamster_nz · « **Reply #65 on:** August 04, 2018, 04:25:15 am »

Waffling on about the mathematical purity of the FPGA P+R process makes as much sense as lamenting that you can't see see the animals at the zoo ias deciding where to walk is an NP-hard problem.

Treating it as an optimization problem was "gen 1" FPGA tools. It had to be abandoned to be replaced with a set of heuristics that give reasonable results in a reasonable amount of time.

Take this vague observation. The increase in P+R runtime vs FPGA and and design size is not consistent with the standard 'NP hard' scaling. Compare the hardness of travelling salesman for 1,000 cities vs 10,000 cities. against the place and route of a design of 1000 LUTs vs one for 10,000 LUTs.

Why is a 85% full design on a small FPGA harder to route successfully than the same design on an FPGA that is twice the size? It should be harder, as it has far more solutions that need to be explored.

I am sure somebody will find this interesting to watch:

"Visualization of nextpnr placing & routing two PicoRV32s on an iCE40 HX8K (10x speed)"

https://twitter.com/i/status/1024623710165237760

Bassman59 · « **Reply #66 on:** August 04, 2018, 06:40:57 am »

Quote from: rhb on August 04, 2018, 02:49:09 am

FPGA synthesis is a convex optimization, not design.

You are confusing synthesis and fitting (place and route). They are two pretty much entirely separate processes.

The goal of synthesis is to translate your hardware description into a netlist that implements the design in the primitives available in the target architecture and shows the connections between those primitives. Synthesis is one part of pattern matching: follow the guidelines in the manual. If you want a flip-flop, write code this way. If you want a large RAM block, write code that way.

The other part of synthesis is logic optimization (and not necessarily minimization), and this optimization is highly dependent on what works best for a given target architecture. Consider a simple shift register. Xilinx CLBs can be configured as shift registers, so say a four-bit shift register fits into one CLB. MicroSemi logic "tiles" are too fine-grained for that, so a shift register synthesizes into four flip-flops. The Xilinx CLB can also be configured as a small RAM (called LUT RAM or distributed RAM), and that feature doesn't exist in MicroSemi's parts, either, so you get an array of flip-flops to implement that memory.

The obvious point here is the ultimate implementation of the logic in specific target-device primitives actually doesn't matter -- as long as the result is functionally correct, then, really, who cares what the synthesis tool creates?

But there are FPGA features that the synthesis tool can't, or won't, infer. In some cases, inferring a primitive like a PLL or a gigabit serializer from a purely behavioral description is just too complicated. It's better to instantiate a black box, which the synthesis tool just puts into the netlist and passes along to the place and route. (There are some things that synthesis should infer but doesn't, like DDR flops in the I/O.)

Since the synthesis tool has no real way of knowing what the fitter will do, it cannot do a proper timing analysis. But synthesis understands loading and fan-out and will replicate and buffer nets for performance reasons.

The role of the fitter is to take the netlist of primitives and their connections and fit them into the fabric. There is more to the fitting process than just the netlist, though, than just the traveling-salesman problem of "optimal" placement and routing.

That's where the constraints come in. They are in two broad categories: timing and placement constraints.

Timing constraints are usually straightforward. FPGAs are synchronous digital designs, so you need a period constraint for each clock to ensure your flip-flip to flip-flop logic (both primitives and routing) ensure no failures. Managing chip input and output timing is often straightforward, too; there are not that many ways of connecting things.

You set the timing constraints for your actual design requirements. If you have a 50 MHz clock, you don't set a 75 MHz period constraint, as it makes the tools work harder than they have to, and it might not close timing. Then the tools run and you see whether you win or not. If you don't, then you need to reconsider things. Set the tools for extra effort. Look to where you can minimize logic. Look to see if the synthesis built something wacky, so you have to re-code.

Placement constraints are a lot more complicated, and this is because each FPGA family has different rules. You have to choose pins for each signal that goes to or comes from off-chip. Sounds easy, right? Well, the layout person has ideas about routing to make that job easier. But you have to mind specialist pins for your clock inputs. You have to abide by I/O standards and I/O supply voltages, and whether a 3.3 V LVCMOS signal can go on a bank with LVDS signals. Oh, and LVDS requires choosing the pair of pins. And you have to specify output drive and slew rate, and on inputs you specify termination, input delay, pull-up/pull-down/keeper. And on and on.

Only once your placement constraints are set should you let the traveling salesman run.

rhb · « **Reply #67 on:** August 04, 2018, 01:52:48 pm »

Thank you! Finally, some sensible comments. I'd meant for this to end long ago, but tried to provide NorthGuy with an explanation of the terminology mathematicians use for these problems.

Two brief comments:

An almost full chip is harder to synthesize because many of the heuristics that succeed when it is not so full don't succeed. So you are forced to try more possibilities. No one ever attempts a full optimization because it simply cannot be done for anything other than uselessly small problems. It's really just a search for any point inside the feasible region. But the mathematical community have a lexicon and language for this. So I follow their rules.

I only referenced the TSP as a simple explanation of what NP hard meant and because the synthesis and placement problem is by inspection at least as hard as a TSP. No one previously bothered with the distinction between synthesis and placement which as noted is significant, as they involve very different problems.

I asked what I thought was a simple question. Are there any other FPGAs with hard ARM cores similar to the Zynq and Cyclone V? Boy, was I ever wrong! It turned into quite an odyssey,

Have Fun!
Reg

Bassman59 · « **Reply #68 on:** August 04, 2018, 09:48:55 pm »

Quote from: rhb on August 04, 2018, 01:52:48 pm

Thank you! Finally, some sensible comments. I'd meant for this to end long ago, but tried to provide NorthGuy with an explanation of the terminology mathematicians use for these problems.

Two brief comments:

An almost full chip is harder to synthesize because many of the heuristics that succeed when it is not so full don't succeed.

Again -- that's not a synthesis problem, that's a fitter problem. But, yes, you are correct, because as the device fills up, routing resource availability might become strained. That was a problem on ancient devices (ugh, XC3000?) but the newer stuff has a lot of routing so the problem really becomes, "can we place the related logic close enough to each other so that routing between them meets our timing constraints?"

Quote

So you are forced to try more possibilities. No one ever attempts a full optimization because it simply cannot be done for anything other than uselessly small problems. It's really just a search for any point inside the feasible region. But the mathematical community have a lexicon and language for this. So I follow their rules.

But optimization is dependent on your goal. I mean, this is engineer, right? We don't need to strive for perfection, because we can't define that, anyway. But we can say, "the design has 5,000 flip-flops and has to run at 50 MHz." Meeting the former constraint requires choosing a device with at least 5,000 flip-flops. If there are more, well, great. Meeting the latter constraint means the design will work. We don't care if it can run faster.

Quote

I asked what I thought was a simple question. Are there any other FPGAs with hard ARM cores similar to the Zynq and Cyclone V? Boy, was I ever wrong! It turned into quite an odyssey,

Well, this is the Internet, where veering off-topic is a given.

rhb · « **Reply #69 on:** August 04, 2018, 11:00:04 pm »

Sorry, yes. It's just that no one made the distinction between synthesis and fitting before. So I'm still tending to describe it in that fashion. But it's obvious that they are quite different problems. The synthesis step is entirely governed by the available hard blocks.

In this case, "optimization" is complete once the constraints are met unless actual performance fails to meet expectations from the simulations. But the mathematicians still call it "optimization" even if it's really just finding the feasible region.

As a consequence of stumbling across the work of Candes and Donoho, I spent some 3 years reading over 3000 pages of complex mathematics on optimization. It's really cool stuff. Google "single pixel camera" if you'd like to really blow your mind. TI is using it in a near IR spectrometer product although they call it "Hadamard sensing". I still need to get serious on the general subject of convex optimization, but that sort of stuff is a lot of work to read. I also have no one to talk to about it, so it's not as much fun as if I did. I'd rather play with hardware right now. And in particular compare the Zynq world to the Cyclone V world.

Again, thank you for writing a clean crisp description of the process. You have to know the topic well to do that, and even then it's work to do well.

BrianHG · « **Reply #70 on:** August 05, 2018, 03:15:34 am »

Quote from: Bassman59 on August 03, 2018, 11:51:42 pm

Quote from: BrianHG on August 03, 2018, 02:15:56 am
Quote from: asmi on August 01, 2018, 01:00:42 am
Quote from: Bassman59 on August 01, 2018, 12:48:15 am
So yeah, it would be great if it was reasonable to "write once, synthesize everywhere," but in practice, that isn't possible.
I agree with everything above, but in addition to that there is an elephant in the room - DSO application will most certainly require using DSP tiles, and they are some of the most non-portable even across different FPGA families of a single vendor, much less so between vendors.
What do you mean?
I wrote a complete image scaler and video processor in system Verilog in Altera's Quartus 3 years ago. All math was written out as simple adds, multiplies, divides in Verilog. I did not use any DSP tiles, yet, once compiled, Quartus placed all the arithmetic into the DSP blocks all on it's own.

The DSP block in the Xilinx Spartan 3E/3A/3AN and Spartan 6 is quite clever, as it can be configured to do MAC, various clear/set, with programmable pipeline stages and etc etc. It has an opcode input that can be used to configure it on-the-fly on a per-clock basis. We used it to build a dual-slope integrator.

The problem was that ISE wasn't clever enough, and it would use the DSP block for the multiplier only. In order to have it do what we wanted, we had to instantiate the block and write a state machine that controlled the opcode and kept track of the data as it moved through the pipeline. Annoying and non-portable? Yes. Did it work? Yes. Whatever. The product shipped and I didn't particularly care that it was "inelegant."

Yes, keeping track of those damn pipeline stages and where and when data is valid. Even in Quartus, it is a hand full. However, since I also created my own full multiport read and write intelligent cache DDR2 DRam controller, the trick I performed, whenever reading, writing, or math anywhere else, and I done a lot of on the fly video processing, I have 2 thing included in every verilog function I have created to date:

1. Enable input and enable output, where the enable is a DFF with the same pipe size as the function allowing the data flow to go on and off at any point, with an embedded parameter size configuration control.
2. A wide width of DFF in and out destination address bits. Basically the same as the enable in to out, but, with a n-number of address data bits. So, if I read from my ram controller or process color in my color enhancement processor, I also provide a destination address input with the enable input, which all follows the delay pipe through the verilog function, and, at the output, the in sync enable out, with destination address out and the function's generated data all come in parallel. Whichever number of bits I set for these in the parameters, or if I just don't wire the port, it is no problem as the veriolog compiler will only include the wired assets when building the firmware not wasting time and space on anything unwired.

* the address may also be any parallel unprocessed data which may need to be kept in parallel with the verilog function's processed data output.

Following this convention design practice for every function I make has exponentially increased my ability to change pipe-line lengths on the fly for any function at will when needed for optimization with 0 debugging effort or any additional external fixes. If I were to instruct someone in the art of doing designs which will need variable sized pipelines, this would be the first thing I would teach.

rhb · « **Reply #71 on:** August 05, 2018, 04:53:29 am »

I'd very much like to learn more. You can buy a 2.6/5.2 GSa/S 14 bit ADC eval board from AD for $1900 or a 2/4 GSa/S for $1200 so one of my goals is to make what I write accommodate variable widths so that a person could assemble a bespoke instrument as a one-off using connectorized modules and eval boards.. Not sure it's possible, but it would be really useful if I can figure out how to build a framework that could do that.

I'm sure it will not be easy, but this is a 12-18 month project. There's no hope of work in the oil patch at current prices, so I might as well give something else a serious effort. Whatever the outcome, I'm sure I'll learn a lot.

BrianHG · « **Reply #72 on:** August 05, 2018, 05:03:59 am »

Quote from: rhb on August 05, 2018, 04:53:29 am

I'd very much like to learn more. You can buy a 2.6/5.2 GSa/S 14 bit ADC eval board from AD for $1900 or a 2/4 GSa/S for $1200 so one of my goals is to make what I write accommodate variable widths so that a person could assemble a bespoke instrument as a one-off using connectorized modules and eval boards.. Not sure it's possible, but it would be really useful if I can figure out how to build a framework that could do that.

I'm sure it will not be easy, but this is a 12-18 month project. There's no hope of work in the oil patch at current prices, so I might as well give something else a serious effort. Whatever the outcome, I'm sure I'll learn a lot.

Which ADC boards so we could see how they interface on the digital side. At these speeds, usually you will need to use LVDS transceivers. This complicates board-to-board linking and on the FPGA side, this will get be as expensive as the ADC eval boards, if not double or triple.

As for the feasibility of doing it, yes, with money, anything with the speeds you listed can be done, but, the connectors and board layout for the FPGA, with careful attention to a dedicated bank and dedicated PLL clock domain to acquire at said speeds.

Hint, at these speeds, with real-time mathematical signal processing at full sample speed, you will be using a lot of pipe-lined functions to keep that maximum clock rate up there as well as multiple parallel channels.

rhb · « **Reply #73 on:** August 05, 2018, 01:15:40 pm »

AD9689. I had assumed that the FPGA side would be about double the ADC side. I have no intentions of building such a beast. I just mentioned it because the capabilities it offers are quite intriguing. I discovered it via an AD ad in my email. That and a good article in QEX not long ago about a 100 MHz BW 16 bit SDR built with a $500 ADC board and a $500 ZedBoard prompted me to factor variable width ADCs into my DSO FOSS FW project.

Something like the AD9869 is *seriously* difficult to deal with and the JESD204B interface IP might make a one off uneconomic/impractical. The AD9689 eval board is transformer coupled so it would not make a general purpose DSO, but for a bespoke lab instrument it's cheaper and more capable than buying a $20K DSO. A one off like that is really PhD or post-doc project material though.

The stuff is quite pricey, but I've discovered that there has been a lot of standardization of interconnects between FPGA boards and ADC and DAC boards. I presume the market developed because of the amount of time it takes to design a board to operate at these speeds.

This thread got started because I know from experience that good design methodology makes a huge difference both in productivity and bug rates. I got lots of flack for suggesting developing on both the Zynq and Cyclone V at the same time. But the best way to understand the strengths and weaknesses of each is to do a series of small tasks on both at the same time. I'd hoped there was a 3rd line with embedded hard ARM cores from a different vendor, but it appears not.

I'd be really appreciate a general outline of your methodology for dealing with pipelines of variable bit width. I'd like to generalize what I do even though it's not needed and is more work.

For the FOSS DSO FW project I've assumed a pipeline from the ADC of an AFE correction filter, user selected BW filter with a choice of step responses (e.g. best rise time, least overshoot) followed by fanout to a pipelined set of math functions, a main data stream and the trigger functions. Observation of period measurements using a couple of $20K 1 GHz DSOs and a Keysight 33622A (< 1 pS jitter), a Leo Bodnar GPSDO and one of his 40 pS rise time pulsers, suggest to me that interpolating the trigger point is a significant problem. This was confirmed by a comment by @nctnico.

A back of the envelope estimate suggests that a 10 point sinc interpolator lookup table would provide 1 pS resolution of the trigger set point at the cost of 10 multiply-add operations. But I've not done any numerical experiments yet. I'm still trying to get a basic discussion of minimum phase anti-alias filtering completed for the DSP 101 FOSS DSO FW thread but being forced to do other things like spray weed killer.

If you can suggest some graduate level texts on FPGAs I'd be grateful. I've bought some books, but they're pretty basic undergraduate level texts. I posted asking for recommendations, but no one suggested anything at the sort of level I was looking for. I'd *really* like to find a graduate level monograph on the IC design aspect of FPGAs so I have a better understanding of the actual hardware at the register and interconnect level. My search attempts produced longer lists of introductory material than I wanted to wade through.

BrianHG · « **Reply #74 on:** August 05, 2018, 07:10:25 pm »

Reading everything so far, I personally would choose a CycloneV or Xilinx equivilant low end dev board with embedded Arm and hopefully 2 banks of DRAM, 1 dedicated for ARM software and another for high speed sampling with as a plug, a HDMI/VGA output. Start out with a cheap home made 500MSPS converter as these dev boards will struggle to interface any faster, they might even limit you a bit slower.

I know this will have an embarrassing cost below 800$ total, even if you need to make you own custom ADC daughter board, but the code you test and develop will be identical to your super fast final high end final product. All the tricks you will need to speed up the CycloneV to deal with a 500MSPS ADC with it's 400-600Mhz DDR3 ram and real time processing will be equivilant to when you upgrade to an Altera Arria FPGA to handle to speed of the DDR3/4 1.3GHZ ram and 2-3GSPS ADC. Your development and learning cure will be the same and with a compiled project, you will be better suited in per-compiling and being able to select whether you will need a Arria, or Stratix FPGA from Altera (or Xilinx equivilant) to make the jump into multi-GHz sampling.

I am not sure of the level of capability in Xilinx's IDE, but I would personally use 'System Verilog' instead of 'Verilog' since at the time when I started my video scaler, System Verilog automatically tracked and handled mixed unsigned and signed registers for the component color processing math I required when feeding Altera's DSP blocks with less headache than the regular old Verilog. Things may have changed since this was 5 years ago.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: FPGAs with embedded ARM cores, who makes them? (Read 12525 times)

Share me