FPGAs are more power efficient than a CPU for most tasks,
What you miss is that most scopes to get the information to the screen stop the acquisition and have a substantial blind time where although the screen may be flickering away quickly at high FPS much of the information captured by the ADC is completely ignored.Okay. And why is that?
What is more important is how much data actually gets to the screen,
Data on the screen is already in decimated form. So what exactly do you mean by "how much data actually gets to the screen" here?
More to the point, you can't see on the screen more data than the display itself is actually capable of displaying. If I have a 1000 pixel wide display, that means I can only display 1000 pixels of horizontal information. Period. Even if I have a million points of acquisition within the time frame represented by the screen, I can't display all of them. I can only display a 1000 pixel representation of them.
And that means that I might be able to take shortcuts, depending on the set of operations that are being performed in order to transform the data in the acquisition buffer into the data that is actually displayed. For instance, for the FFT, I can sample subsets of the acquisition data, and as long as my technique preserves the frequency domain statistical characteristics of the data in the acquisition buffer, what lands on the screen will be a reasonable representation of the FFT, right? Moreover, I can't display more than 1000 discrete frequency bins from the FFT at any given time, either, precisely because my display is limited to 1000 pixels of width.
The intensity grading mechanism is one of the mechanisms where you can't get away with subsampling, though, but at the same time, the transforms required for it are very simple and, more importantly, because the target of the transforms is the display, it means the target buffer can be relatively small, which means it can be very fast.
In an intensity graded display you need to decimate so much that an individual sample doesn't even count. Therefore you can take huge shortcuts without misrepresenting a signal. And you don't really want an actual representation anyway because that would mean an 'area' with very little intensity is invisible.
FPGAs are more power efficient than a CPU for most tasks,
They are? I thought the underlying fabrics of FPGAs were relatively inefficient compared with straight-up hardware, of which a CPU is the latter.
I can see how an FPGA would be more efficient if the nature of the problem is such that a CPU would be a poor solution compared with a hardware solution. Certainly some of what we're discussing here qualifies, e.g. scope triggering.
I'll also be the first to admit, too, that CPUs have a substantial amount of overhead, so they can be overkill for many problems that FPGAs would be a much better solution. But I'm not sure about that "most tasks" claim ...
What you miss is that most scopes to get the information to the screen stop the acquisition and have a substantial blind time where although the screen may be flickering away quickly at high FPS much of the information captured by the ADC is completely ignored.Okay. And why is that?Its like you're learning nothing through all these threads or you're intentionally pretending like you don't understand. Here is the entry level approach to it:
https://cdn.fswwwp.rohde-schwarz.com/pws/dl_downloads/dl_application/application_notes/1er02/1ER02_1e.pdf
While analog oscilloscopes just need to reset the horizontal system for the next electron beam sweep, digital oscilloscopes spend most of the acquisition cycle postprocessing the waveform samples [1]. During this processing time the digital oscilloscope is blind and cannot monitor the measurement signal.
A natural reaction to the discussion so far could be to say "Let's build a faster digital oscilloscope with improved processing power and pipelined architecture". However, such a solution would require massive processing capabilities. For example, a digital oscilloscope with a 10 Gsample/s 8-bit ADC produces 80 Gbits of continuous data that must be processed and displayed every second. In addition, DSP filtering, arithmetic operations, analysis functions and measurements are often applied to the waveform samples which require additional processing power. Real-time processing with no blind time is currently not feasible for a digital oscilloscope in a laboratory environment.
The measurement signal enters the oscilloscope at the channel input and is conditioned by attenuators or amplifiers in the vertical system. The analog-to-digital converter (ADC) samples the signal at regular time intervals and converts the respective signal amplitudes into discrete digital values called “sample points”. The acquisition block performs processing functions such as filtering and sample decimation. The output data are stored in the acquisition memory as “waveform samples”.
You cant take shortcuts, or you lose the information.
If you just want a scope that shows the screen with a min/max envelope then have at it (and come back when you've looked at the computational cost of computing just that alone) but the market demands a graduated display these days.
The intensity grading mechanism is one of the mechanisms where you can't get away with subsampling, though, but at the same time, the transforms required for it are very simple and, more importantly, because the target of the transforms is the display, it means the target buffer can be relatively small, which means it can be very fast.
You dont know in advance which of the samples in the waveform has the interesting information in them, so the scope needs to parse (process) ALL of them. If you knew where the interesting information was then you'd have captured just that part alone.
A vast array of applications have been targeted to PFGAs and they improve the power efficiency:
http://www.ann.ece.ufl.edu/courses/eel6686_15spr/papers/paper1a.pdf
https://www.altera.com/en_US/pdfs/literature/wp/wp-01173-opencl.pdf
http://www.cc.gatech.edu/~hadi/doc/paper/2014-isca-catapult.pdf
But the repetitive high throughput tasks which FPGAs and ASICs are so well suited to are exactly the sort of computations that are needed in an FPGA. They cant effectively replace all the computations but they can do the "heavy lifting".
That document says:QuoteWhile analog oscilloscopes just need to reset the horizontal system for the next electron beam sweep, digital oscilloscopes spend most of the acquisition cycle postprocessing the waveform samples [1]. During this processing time the digital oscilloscope is blind and cannot monitor the measurement signal.and:QuoteA natural reaction to the discussion so far could be to say "Let's build a faster digital oscilloscope with improved processing power and pipelined architecture". However, such a solution would require massive processing capabilities. For example, a digital oscilloscope with a 10 Gsample/s 8-bit ADC produces 80 Gbits of continuous data that must be processed and displayed every second. In addition, DSP filtering, arithmetic operations, analysis functions and measurements are often applied to the waveform samples which require additional processing power. Real-time processing with no blind time is currently not feasible for a digital oscilloscope in a laboratory environment.That is thoroughly unenlightening, especially in the context of entry level scopes. It claims that the problem cannot be solved at all without explicitly stating why. All it gives is some vague claim about the sheer amount of processing power that would be required, without any recognition that processing requirements are immensely dependent upon the transforms that are required of the data in the first place. It thus implicitly makes the same claim that you do, that all data must be processed for every operation, or alternatively it engages in the fallacy that because there exist some operations that need to operate upon all the acquired data, that therefore all operations must do so.
I'll cut to the chase with a question: which operations, aside from the intensity grading operation, require processing of all the data in the buffer? Even the FFT doesn't require that.
QuoteThe measurement signal enters the oscilloscope at the channel input and is conditioned by attenuators or amplifiers in the vertical system. The analog-to-digital converter (ADC) samples the signal at regular time intervals and converts the respective signal amplitudes into discrete digital values called “sample points”. The acquisition block performs processing functions such as filtering and sample decimation. The output data are stored in the acquisition memory as “waveform samples”.Why decimate the data before storing it to the acquisition memory instead of after?
That said, the block diagram they supply for the RTO architecture (page 13) is very much like what I'm envisioning here, but the memory implementation I have in mind would be double buffered so as to ensure that display processor reads from acquisition memory never collide with the writes coming from the acquisition engine.
Losing information for display purposes is unavoidable here. What matters is what the decimation process did to transform the acquired data into something for the display, and that is very operation-specific.
You dont know in advance which of the samples in the waveform has the interesting information in them, so the scope needs to parse (process) ALL of them. If you knew where the interesting information was then you'd have captured just that part alone.You're conflating captured data with processed data. The triggering mechanism is what defines the points of interest, and that absolutely has to keep up with the sampling rate. I've never argued otherwise for that. The rest is a matter of how the UI interacts with the rest of the system.
Note how Dave notices the ASIC gets 'hot as hell' in his DSO1000X hack video.
I'll cut to the chase with a question: which operations, aside from the intensity grading operation, require processing of all the data in the buffer? Even the FFT doesn't require that.An FFT on the entire memory depth is extremely useful and desirable,
the Agilent/Keysight X series dont do it and its a limitation of them. As I keep repeating endlessly, take some of the simple display examples and do your own maths on the required computational load. Just min/max to draw the envelope of the deep memory to a smaller display is computationally expensive enough to prove the point. Moving that sort of data throughput in and out of a CPU is not viable, and requires dedicated hardware resources to achieve the peak rates that current products are doing.
QuoteThe measurement signal enters the oscilloscope at the channel input and is conditioned by attenuators or amplifiers in the vertical system. The analog-to-digital converter (ADC) samples the signal at regular time intervals and converts the respective signal amplitudes into discrete digital values called “sample points”. The acquisition block performs processing functions such as filtering and sample decimation. The output data are stored in the acquisition memory as “waveform samples”.Why decimate the data before storing it to the acquisition memory instead of after?
That said, the block diagram they supply for the RTO architecture (page 13) is very much like what I'm envisioning here, but the memory implementation I have in mind would be double buffered so as to ensure that display processor reads from acquisition memory never collide with the writes coming from the acquisition engine.
Its almost like you've never used a scope....
You can decimate before writing to the acquisition memory for modes such as min/max, or hi-res, where the ADC samples at a higher rate than you're writing to the sample memory (because of memory limitations and you want a longer record).
Losing information for display purposes is unavoidable here. What matters is what the decimation process did to transform the acquired data into something for the display, and that is very operation-specific.YES, and some similaifcations are appropriate for some applications. But 2D histograms are the benchmark which almost all scopes today provide.
You dont know in advance which of the samples in the waveform has the interesting information in them, so the scope needs to parse (process) ALL of them. If you knew where the interesting information was then you'd have captured just that part alone.You're conflating captured data with processed data. The triggering mechanism is what defines the points of interest, and that absolutely has to keep up with the sampling rate. I've never argued otherwise for that. The rest is a matter of how the UI interacts with the rest of the system.No, thats the Lecroy slight of hand where they have a banner spec for dead time that is only for segmented captures where the data is not going to the screen at that throughput consistently (a tiny fast burst). If the trigger defines what you're interested in then what are you doing with the deep memory?
You could wind it down to an acquisition depth the same as the screen size which is a trivial display case.
A: you have long memory and capture something interesting to look at later/slower/in depth
B: you want to capture as much information as possible and display it on the screen
A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.
B is where the magic hardware comes in to do all that work in hardware, but it can also run on stopped traces to speed up case A. Thus a better scope for all uses because it has hardware accelerated plotting.
A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.
That's an interesting idea, and proved to be an interesting exercise.
It took 6 milliseconds for the min/max computation on that number of points (in other words, just to determine whether each point was within the min/max envelope), with what amounts to a naive approach to the problem (no fancy coding tricks or anything). But once you add in the output buffer code, it rises to about 135 milliseconds. This is on a single core of a 2.6GHz Intel i5. On the one hand, this was just on a single core, and the nature of the operation is such that I'm writing to the same RAM as I'm reading from (which may or may not matter). But on the other hand, the CPU I tested this on, while not the most modern (2014 vintage), is certainly going to be relatively expensive for embedded use.
A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.It took 6 milliseconds for the min/max computation on that number of points (in other words, just to determine whether each point was within the min/max envelope), with what amounts to a naive approach to the problem (no fancy coding tricks or anything). But once you add in the output buffer code, it rises to about 135 milliseconds. This is on a single core of a 2.6GHz Intel i5. On the one hand, this was just on a single core, and the nature of the operation is such that I'm writing to the same RAM as I'm reading from (which may or may not matter). But on the other hand, the CPU I tested this on, while not the most modern (2014 vintage), is certainly going to be relatively expensive for embedded use.
It looks to me like we're saying roughly the same thing (now especially, after the results of the test you had me perform), but for some reason you seem to believe that if you have more memory, then you have to scale the processing to match. I disagree with that position, if indeed that is your position in the first place. Keep in mind that the OP's question was about why we don't see scopes with big piles of modern off-the-shelf memory.
A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.It took 6 milliseconds for the min/max computation on that number of points (in other words, just to determine whether each point was within the min/max envelope), with what amounts to a naive approach to the problem (no fancy coding tricks or anything). But once you add in the output buffer code, it rises to about 135 milliseconds. This is on a single core of a 2.6GHz Intel i5. On the one hand, this was just on a single core, and the nature of the operation is such that I'm writing to the same RAM as I'm reading from (which may or may not matter). But on the other hand, the CPU I tested this on, while not the most modern (2014 vintage), is certainly going to be relatively expensive for embedded use.So you expect us all to believe you got just under 7 points through a single core per instruction cycle? 100M/(2.5G*6ms) = 6.666.. points per clock! and a memory bandwidth of 16GB/s (130Gb/s) close to the bleeding edge in desktop performance.
Its possible to stick some extra in there with vector stuffing but that's an implausible result to include i/o for data which doesnt fit into the cache so you have memory overheads to also consider etc. A min/max algorithm in a processor such as that with a deep pipeline either stalls the pipe to collect all the partial results, or has a very nasty memory access pattern. And this is just a tiny part of the processing needed in a scope but already its not possible with a desktop processor.
That processing in an FPGA to min/max an 8bit 10Gs/s stream would use a dollar or two of FPGA logic, at around 20mW of dynamic consumption and a similar amount of static consumption for a shocking 40-50mW total power requirement. This is why all the major brands are offloading work to the FPGAs (or ASICs), its faster, cheaper, and lower power.
We've held your hand and shown you over and over again that the scopes are using common available memory to store their acquisitions in.
Deep memory scopes that have slow displays have been around a long time and they were never very popular, so when faced with the design of a new scope the manufacturers are putting in sufficient memory to meet the markets demands.
If you want a cheap scope with deep memory buy a rigol, and then come back when you can't stand using it because navigation of the memory is so difficult.
What you're talking about exists but its not something people want to use and the market sees little value in it, the manufacturers aren't short changing anyone but they've all come to a similar point of what the market will pay for. Adding in an extra XXMpts of memory costs real money that they won't profit on, so oddly enough they choose not to do it.
Changing the test to do a min/max pass on groups of 100000 points, with one group per pixel, where I write the min value and max value to individual 1000-element buffers, yields 6 milliseconds on my 2.6GHz Core i5 processor system (a 2014 Macbook Pro 13"). On my Raspberry Pi 3, it yields 421 milliseconds. On my 2010 Mac Mini with a 2.4GHz Core i7, it yields 8 milliseconds. On my Linux virtual machine running on my Xeon E5-1650v4, it yields 8 milliseconds.
Note that these are all CPU times, not wall clock times. I've got a memory speed test program, which uses the same method for recording CPU time, that I will also attach and that yields memory throughput results that are a very close match to what the memory test programs (Passmark MemTest86) I've run show for the systems on which I've run both. The memory speed test program shows a throughput of 1.5GB/s on the Raspberry Pi 3, 8GB/s on my Mac Mini, 20G/s on my Macbook Pro, and 13.5G/s in my virtual machine on my Xeon system.
Now, it's possible that the CPU timekeeping mechanism under-reports the amount of CPU time used, but most certainly not by orders of magnitude. The memory test on my virtual machine yields results fairly close to the actual value I'm getting from PassMark's memtest86 program when run against the Xeon host itself (about 15G/s).
You don't have to believe me. You can try it out for yourself. I'm attaching the source to the test program I'm using. Feel free to pull it apart and to run the test yourself.
Thanks for the well laid out code, you can run it on any of the web facing servers for testing code and its always returning times of 0.1 to 0.2 seconds per test by the number the code spits out or their timers on the tasks (its also confirmed to use 100MB so we know its not optimising everything away).
So something between your single digit ms results and the real world is messed up again. Back of the envelope calculations say it can't process the data that fast we don't need to go any deeper unless you wish to point out some extremely deep vector instructions which make this task so fast.
This is a tiny part of the processing and its already blowing out your processing budget, adding deep memory to a scope without making it slow as a wet week just doesn't work.
It looks like it's a question of the compiler optimizations. Use -O3 for the optimization option and you get the results I'm getting.
Interestingly enough, adding some instrumentation to it so as to force it to go through all the values even with full optimization does substantially increase the execution time, by over an order of magnitude. My Macbook generates an execution time of 100ms after adding that in. I'll add the revised code here so you can check it out.
e can go back to full processing again (at whatever speed it's capable of, naturally).
if (*p > max)
max ^= *p;
if (*p < min)
min ^= *p;
The XOR defeats the optimization: SSE is not used, and time goes up. Was that actually done on purpose?I'm still interested in knowing how one can differentiate between poor UI coding and processing limitations when it comes to things like examining the memory. You mentioned the poor Rigol memory handling in the UI more than once. Why can't it do things like subsampling while the UI is being twiddled, so as to retain responsiveness in the UI? Once the user has made the changes he wants, the scope can go back to full processing again (at whatever speed it's capable of, naturally).
A couple of things about the (not so) strange timing results:
1- Visual Studio compiler will happily use pmaxub and pminub SSE vector instructions (checked the .asm output). On my old i5-750 it takes 12ms per test.
2- In your second code version, you have:Code: [Select]if (*p > max)
The XOR defeats the optimization: SSE is not used, and time goes up. Was that actually done on purpose?
max ^= *p;
if (*p < min)
min ^= *p;
I have no explanation for this, save perhaps that Intel CPUs are far more capable at some operations than he expects them to be.
$LL16@main:
; 45 : if (*p > max)
; 46 : max = *p;
movups xmm0, XMMWORD PTR [esi]
pmaxub xmm1, xmm0
pminub xmm2, xmm0
movups xmm0, XMMWORD PTR [esi+16]
add esi, 32 ; 00000020H
pmaxub xmm4, xmm0
pminub xmm3, xmm0
sub eax, 1
jne SHORT $LL16@main
Of course, that might not be all that relevant. Like I said, the type of processor used for embedded use (at least in entry level scopes) is likely to be more akin to that in the Raspberry Pi 3 rather than a modern PC, and the performance numbers for that processor when running this test are much more in line with what Someone expects. Even so, it may be illustrative of what modern processors are capable of, at least for certain types of operations.
As far as UI responsiveness, they are just poorly programmed. Fast is not the same thing as real time.
I suspect there is a general trend toward using "safe" programming languages with garbage collection which makes things worse if extreme care is not taken. This is the sort of thing that makes the 50 MHz 32-bit ARM in my HP50g slower in the practical sense than the 4 MHz 4-bit Saturn in my decades old HP48g.
It also makes my ancient DSOs with their slower user interfaces faster and more usable than their feature set and hardware specifications would suggest.
Subsampling would have to be a separate selected feature in the FPGA code. Otherwise the processor has to spend the full cost of copying the FPGA memory and then discarding part of it.
As far as processing large record lengths with a high performance CPU, not only does the CPU have to process the acquisition record, but it has to *access* it somehow.
In current designs this means copying the contents of the FPGA's memory to CPU memory through the FPGA during which time the FPGA cannot use its own memory without special provisions which are going to require faster memory yet. And the CPU's fastest interface, its memory interface, is not amendable to multimastering or attachment to an FPGA.
Copying is actually not as bad as it seems and may be preferable. While processing the copied acquisition record, the FPGA can grab another acquisition. Adding another memory channel to the FPGA to support double buffering would only double the acquisition rate and not even that if the CPU processing is a bottleneck.
Subsampling would have to be a separate selected feature in the FPGA code. Otherwise the processor has to spend the full cost of copying the FPGA memory and then discarding part of it.
Yeah, I was thinking it could be something that the FPGA design could incorporate as an option.
QuoteAs far as processing large record lengths with a high performance CPU, not only does the CPU have to process the acquisition record, but it has to *access* it somehow.
I was presuming that the acquisition FPGA could write directly to the acquisition DRAM, and the acquisition memory could be set up in a double-buffered ("banked"?) configuration so that the FPGA's writes wouldn't collide with reads performed by the downstream processing pipeline. Which is to say, the FPGA's writes would go through a switch which would send the data to one DRAM bank or the other, depending on which one was in play at the time.
QuoteIn current designs this means copying the contents of the FPGA's memory to CPU memory through the FPGA during which time the FPGA cannot use its own memory without special provisions which are going to require faster memory yet. And the CPU's fastest interface, its memory interface, is not amendable to multimastering or attachment to an FPGA.
It's not? How have multiprocessor systems managed it?
QuoteCopying is actually not as bad as it seems and may be preferable. While processing the copied acquisition record, the FPGA can grab another acquisition. Adding another memory channel to the FPGA to support double buffering would only double the acquisition rate and not even that if the CPU processing is a bottleneck.
Right. Eventually the CPU processing would become a bottleneck, but it could easily have FPGA help. I've not looked at the Zynq setup, but wouldn't be surprised if its onboard CPU has some kind of fast interface to the FPGA fabric to make CPU+FPGA operations possible and relatively straightforward.