Author Topic: Which FGPA/tool for this project? (Read 15167 times)

mikeselectricstuff · « **Reply #25 on:** September 22, 2017, 06:56:28 pm »

As a quick & easy reality check, it would probably worth getting your code running on a RasPi & see where you're at speed-wise.

PartialDischarge · « **Reply #26 on:** September 22, 2017, 07:03:00 pm »

A fast MCU would be the best solution I agree. I need around 5-10 readings/sec, something that provides human interactivity. A laptop is too big and heavy, any laptop.

Marco · « **Reply #27 on:** September 22, 2017, 07:56:22 pm »

Between 3x the clockspeed, dual issue and the instruction count reduction from going floating point I'd be surprised if a STM32F7 wouldn't get you your desired speedup.

PartialDischarge · « **Reply #28 on:** September 22, 2017, 08:03:49 pm »

Quote from: Marco on September 22, 2017, 07:56:22 pm

Between 3x the clockspeed, dual issue and the instruction count reduction from going floating point I'd be surprised if a STM32F7 wouldn't get you your desired speedup.

I'll take a look. Also people seem happy with Atollic

technix · « **Reply #29 on:** September 23, 2017, 03:05:05 am »

Quote from: MasterTech on September 22, 2017, 05:51:11 pm

Quote from: technix on September 22, 2017, 05:30:34 pm
If you can tolerate some latency you can use an ESP8266/ESP32/Raspberry Pi Zero W to stream the data off the system to a Wi-Fi network, to be processed using a laptop, a desktop PC, a gaming PC with one or two high-end GPUs, a server with a lot of high power x86 cores, or some cloud service like Amazon EC2. I don't think your calculations can hog up an 8-core Ryzen 7 and two GeForce GTX 1080 Ti's.

Latency I can tolerate. However the system must be light, portable, self-contained and battery operated Sorry I can't disclose the application I've thought of

Just stream it off to a server on the cloud for the number crunching then.

Sal Ammoniac · « **Reply #30 on:** September 23, 2017, 05:32:38 am »

Quote from: MasterTech on September 22, 2017, 08:03:49 pm

Also people seem happy with Atollic

If you don't mind that it's as slow as molasses in the winter.

forrestc · « **Reply #31 on:** September 23, 2017, 07:09:35 am »

Quote from: MasterTech on September 21, 2017, 05:09:33 pm

Long time ago I did program a somewhat limited algorithm to locate the position of a magnet in space. Now I'd like to go further and port it to an FPGA.
I'm currently using a PSOC 5LP running at 67MHz(cortex M3).
The algorithm makes heavy use of 16-bit fixed point multiplications and sums for speed purposes, because I have multiple 4x4 matrix multiplications, jacobians, determinants, cosines, sines.... For example trygonometric functions are calculated by taylor series, to avoid slow floating point functions.

I agree with others that perhaps a fast core processor might be the best option, however, a few things others haven't mentioned specifically about FPGA's follow:

Note: I haven't done much real work with a FPGA, so these are just things I've picked up knowing about somewhere along the way:

1) There are C to HDL (aka fpga) compilers. These will take something C like and interpret convert it to something like verilog or VHDL. Not sure how well they work, I do know that for some you have to write in a 'special' way. See https://www.xilinx.com/video/hardware/getting-started-vivado-high-level-synthesis.html as an example - just the first one I found.

2) For at least some of the functions you're talking about, there are cores out there (aka chunks of HDL code) which do a lot of the math functions you are talking about. In particular XILINX has CORDIC IP which does a lot of the trig functions. Others probably have similar or more or different.

3) You may want to look at the math projects at opencores.org

legacy · « **Reply #32 on:** September 23, 2017, 10:27:32 am »

Quote from: forrestc on September 23, 2017, 07:09:35 am

3) You may want to look at the math projects at opencores.org

yeah, the best advice ever (sarcasm)

rstofer · « **Reply #33 on:** September 23, 2017, 10:09:41 pm »

At Ultibo's gitgub, there is an example where a CPU is dedicated to a single task and the remaining 3 CPUs do whatever (RPi 3 in my case).

At the culmination of the example there is a simple assembly language loop to increment a value in memory while a task running elsewhere displays the iterations per second. How about 603 MILLION loops per second? Sure, there is only an increment register and a memory write per loop but that's a LOT of loops!

Other Pascal loops are down around 7 MILLION per second which still might be sufficient.

And the RPi3 has floating point support.

https://github.com/ultibohub/Examples/tree/master/Advanced/DedicatedCPU/RPi2

I don't pretend to understand what is going on. There's a lot to learn...

PartialDischarge · « **Reply #34 on:** September 24, 2017, 05:38:50 am »

Assembly code is somewhere I don't want to go. It is a nightmare and the algorithm I'm handling sometimes fries my head.
Back in 1998 I did program a receiver-transmitter for a Direct Sequence Spread spectrum modulation in a TM3240C54x. I happen to have found the code of it. There is the taylor series for cos(x), sin(x), cos(4x), sin(4x), I used since theres is not enough memory for a look-up table and if you use 8-bits for that, the noise is too high for the demod operations.
The receiver used a complex I-Q demodulating loop

Code: [Select]

 
	 .title "Receptor DS-SS"
         .mmregs	;memory mapped registers
	 .setsect ".text",   0x200,0	;Inicio del codigo ejecutable
						
	 .setsect "vectors", 0x180,0	;Inicio del espacio para vectores

	 .setsect ".data", 0x2000,0	

;-----definiciÛn de constantes 

flag_rcv .set 0x0001	; flag que se§aliza la recepci¢n de un dato
kv	 .set 0x0002	; constante del VCO
phi	 .set 0x4F35	; frecuencia discreta central del VCO
			; phi=2*pi*fc*Ts (1.4Khz)
n_taps	 .set 15	; numero de taps del filtro paso bajo del lazo

buff_length .set 15*9	; longitud del buffer para el c¢digo PN
			; usado en la rutina acq.
umbral_iq .set	17100	; umbral para la rutina acq.

	.copy "const.dat" ; Copia la secciÛn de constantes.
	.sect "vectors"
	.copy "vectors1.asm"

;-----inicio del programa principal	

	.text		; secciÛn de programa
start:
	intm=1		; no permito interrupciones.
	call AC01INIT	; configura el conversor A/D.
	pmst = #01a0h	; processor mode status register
	sp = #27fah	; init stack pointer
	imr = #240h	; unmask TDM RINT and HPIINT(host port interface)
	intm = 0        ; globally enable all interrupts
	sxm = 1		; ExtensiÛn de signo activada.
	*(flags)=#0	
	AR0 = #1	; importante para las instrucciones MAC.
	
inicio	TC=bitf(*(flags),flag_rcv)	; barrera -> se espera hasta que
 	nop
	if (NTC) goto inicio	; podamos procesar otro dato
	*(flags) ^= #flag_rcv	; pone a cero el flag de recibido
	
	T = *(samplerx)		; T = muestra recibida
	nop
	A = T * *(vcocout)	; multiplica la entrada por la salida
				; del oscilador I.
	*(f_data1) = hi(A)	; introduce la salida del
				; multiplicador en el filtro I
	T = *(samplerx)
	nop
	A = T * *(vcosout)	; multiplica la entrada por la salida
				; del oscilador Q.
	*(f_data2) = hi(A)	; introduce la salida del
				; multiplicador en el filtro Q
	call filtrar		; funci¢n de filtrado(I/Q)

	A = *(lp1)		; Toma la salida del filtro I
	*(b_2) = A		; y la introduce en buff2.
	A = *(lp2)		; Toma la salida del filtro Q
	*(b_3) = A		; y la introduce en buff3.
	call acq1		; rutinas que filtran con el cÛdigo PN local.
	call acq2

	*(update) = #0xffff	; Desactiva la seÒal de actualizaciÛn.
	A = *(filt_out1)	; Suma en mÛdulo las salidas de los filtros
	A = |A|			; que dan la correlaciÛn.
	B = A
	A = *(filt_out2)
	A = |A|
	A = A + B
	*(i2q2) = A
	A = A - #umbral_iq	; comprueba el umbral.
	nop
	if (ALT) goto no_update
	nop
	*(update) = #0x5000	; seÒaliza la actualizaciÛn de la salida de los
				; bloques acq1 y acq2.

no_update T = *(ultimo1)	; T = salida del bloque acq1.
 	A = T * #0x7fff
	B = *(ultimo2) * hi(A)
	call vco

	T = *(ultimo2)
	*(sampletx) = T
	goto  inicio


filtrar: 
	push(AR0)	; guarda AR0 en la pila
	
			; filtro 1
	AR0 = #f_data1_end
	A = #0
	repeat (#(n_taps-1))
	macd(*AR0-,h0,A)
	*(lp1) = hi(A)

			; filtro 2
	AR0 = #f_data2_end
	A = #0
	repeat (#(n_taps-1))
	macd(*AR0-,h0,A)
	*(lp2) = hi(A)

	AR0 = pop()	; recupera AR0 de la pila
	return_enable

vco:			;realiza la funci¢n del VCO
	push(AR3)

	
	AR3 = #vcomem
	A = B << -6	; entrada en B
	T = #kv
	A = T * hi(A)
	A = A + #phi	; A = phi
	A = A + dbl(*AR3)	;A = phi + entrada del VCO + valor de vcomem anterior
	call modpi		;reduce 'A' a modulo pi
	dbl(*AR3) = A		;guarda 'A' en 'vcomem'
	A = A<<-2		;desplazamiento a la derecha
				;para dividir por cuatro el argumento
	*(cosarg) = A
	*(sinarg) = A
	call coseno
	call seno
	
	AR3 = pop()
	return_enable

modpi:	;reduce a modulo pi la variable 'vcomem'
	push(AR1)
	push(AR2)

	AR1 = #dospi	
	AR2 = #pi

	*(camsig) = #0
	if (AGEQ) goto loop1
	*(camsig) = #1
	A = |A|
loop1	B = A
	B = B - dbl(*AR2)
	nop
	nop
	if (BLEQ) goto fin	;goto fin si -pi<= B <=pi
	A = A - dbl(*AR1)	;si B esta fuera del rango resta 2*pi a 'A'.
	goto loop1
fin	TC= (*(camsig)==#1)
	nop
	nop
	if (TC) execute(1)	;si cambiÈ el signo al principio,
		A=-A		;volver a cambiar.

	AR2=pop()
	AR1=pop()
	return_enable
	
coseno: ;calcula cos(4x) a partir de cos(x)
	;Necesario ya que el argumento de la funci¢n cos() es vcomem 
	;dividido por 4.
	call cos
	A = *(cresult) * *(cresult)
	A = A -#0x7fff <<16
	A = T * hi(A)
	nop
	nop
	A = T * hi(A)
	A = A << -13
	A = A + #0x7fff
	*(vcocout) = A
	return_enable

seno:	;calcula el sen(4x) a partir de sin(x).
	call sin
	A = #0
	A = *(cresult) * *(cresult)
	nop
	A = A << 1
	A = A - #0x7fff << 16
	T = *(cresult)
	A = T * hi(A)
	nop
	nop
	T = *(sresult)
	A = T * hi(A)
	A = A << -14
	*(vcosout) = A
	return_enable

cos:	;calcula el coseno con la serie de Taylor
	;argumento cosarg entre -1 rad y 1 rad.

	push(AR2)
	push(AR3)
	push(AR4)
		
	AR2 = #cosarg
	AR3 = #c_coffs
	AR4 = #C_1
	A = *AR2+ * *AR2+
	*AR2 = hi(A) 
	|| B = *AR4<<16 ;
	A = B - *AR2+ * *AR3+
	A = T * hi(A)
	*AR2 = hi(A)
	A = B - *AR2- * *AR3+
	B = *AR2+ * hi(A)
	*AR2 = hi(B)
 	|| B = *AR4<<16
	A = B - *AR2- * *AR3+
	A = A <<-1
	A = -A
	B = *AR2+ * hi(A)
	B = B + *AR4 <<16
	*AR2=hi(B)

		
	AR4 = pop()
	AR3 = pop()
	AR2 = pop()
	return_enable

sin:	; calcula en seno con la serie de Taylor
	; argumento sinarg entre (-1 rad y 1 rad).
	push(AR2)
	push(AR3)
	push(AR4)
		
	AR2 = #sinarg
	AR3 = #s_coffs
	AR4 = #C_1
	A = *AR2+ * *AR2+
	*AR2 = hi(A) 
	|| B = *AR4<<16 ;
	A = B - *AR2+ * *AR3+
	A = T * hi(A)
	*AR2 = hi(A)
	A = B - *AR2- * *AR3+
	B = *AR2+ * hi(A)
	*AR2 = hi(B)
 	|| B = *AR4<<16
	A = B - *AR2- * *AR3+
	B = *AR2+ * hi(A)
	*AR2 = hi(B)
 	|| B = *AR4<<16 ;
	A = B - *AR2- * *AR3+
	B = *(sinarg) * hi(A)
	*(sresult) = hi(B)
	
	AR4 = pop()
	AR3 = pop()
	AR2 = pop()
	return_enable

acq1:	;rutina que correla con la secuencia PN local.

	push(AR1)	;guarda en la pila AR1

	AR1 = #b_2end		;retrasa las muestras de buff2
	repeat(#(buff_length-1))
	delay(*AR1-)	

	A = #0
	repeat(#(buff_length-1)) ;calcula el producto de la entrada y del 
				;c¢digo local est·tico
	macp(*AR1+, #b_1, A)
	*(filt_out1) = hi(A)
	A = *(update)
	A = A - #0x9
	nop
	nop
	if (ALT) goto end_acq1
	nop
	T = *(filt_out1)
	*(ultimo1) = T
end_acq1
	AR1 = pop()
	return_enable

acq2:	;rutina que correla con la secuencia PN local.

	push(AR1)	;guarda en la pila AR1

	AR1 = #b_3end		;retrasa las muestras de buff2
	repeat(#(buff_length-1))
	delay(*AR1-)	

	A = #0
	repeat(#(buff_length-1)) ;calcula el producto de la entrada y del 
				;cÛdigo local est·tico.
	macp(*AR1+, #b_1, A)
	*(filt_out2) = hi(A)
	A = *(update)
	nop
	nop
	if (ALT) goto end_acq2
	T = *(filt_out2)
	*(ultimo2) = T

end_acq2
	AR1 = pop()
	return_enable

transmit:
   	B=trcv
	*(samplerx)=B
	*(sampletx) &= #0xfffc	;elimina los bits de control.
	A=*(sampletx)
	tdxr=A
	*(flags) |= #flag_rcv 	;seÒalizar que se ha recibido un dato
	return_enable

	.copy "ac01ini1.asm"	;configuraciÛn del conversor A/D.


	.end

tggzzz · « **Reply #35 on:** September 24, 2017, 07:10:39 am »

Consider using one of these https://www.digikey.com/en/product-highlight/x/xmos/startkit which can be regarded as being halfway to being an FPGA

For a miserly £12 you get:

8*100MIPs 32-bit cores, with some instructions specialised for DSP
boards transparently daisy-chainable if you need more cores
continue to program in C/C++
free development environment, Eclipse/LLVM/gdb
excellent flexible, fast, low latency FPGA-like I/O
USB comms

Other processors in the family go up to 4000MIPs, but not on that dev board.

mikeselectricstuff · « **Reply #36 on:** September 24, 2017, 08:51:49 am »

Quote from: tggzzz on September 24, 2017, 07:10:39 am

Consider using one of these https://www.digikey.com/en/product-highlight/x/xmos/startkit which can be regarded as being halfway to being an FPGA

For a miserly £12 you get:
8*100MIPs 32-bit cores, with some instructions specialised for DSP
boards transparently daisy-chainable if you need more cores
continue to program in C/C++
free development environment, Eclipse/LLVM/gdb
excellent flexible, fast, low latency FPGA-like I/O
USB comms

Other processors in the family go up to 4000MIPs, but not on that dev board.

Might be a solution, but would need the algorithm to be pipelined to make good use of all cores.
The OP is only looking for 5-10x the performance of a 67MHz Cortex M3.
It would make the most sense to try conventional MCU or DSP options before going to anything more exotic. It shouldn't take more than a day to get the code and I2C sensors running on a RasPi ( or similar) even if you'd never used Linux before. That would immediately give some definite numbers for speed and power consumption to inform decisions on which way to go.
Until that gets done there is little point spending time thinking about anything more exotic.

tggzzz · « **Reply #37 on:** September 24, 2017, 11:13:49 am »

Quote from: mikeselectricstuff on September 24, 2017, 08:51:49 am

Quote from: tggzzz on September 24, 2017, 07:10:39 am
Consider using one of these https://www.digikey.com/en/product-highlight/x/xmos/startkit which can be regarded as being halfway to being an FPGA

For a miserly £12 you get:
8*100MIPs 32-bit cores, with some instructions specialised for DSP
boards transparently daisy-chainable if you need more cores
continue to program in C/C++
free development environment, Eclipse/LLVM/gdb
excellent flexible, fast, low latency FPGA-like I/O
USB comms

Other processors in the family go up to 4000MIPs, but not on that dev board.
Might be a solution, but would need the algorithm to be pipelined to make good use of all cores.

I suspect that many navigation applications have some low-level noise-filtering grunt work, plus some "higher level" integration algorithms. I noted that there were several sensors, and guessed that the one-core-per-peripheral doing a lot of grunt work might be useful.

In addition, pure C doesn't have very useful DSP arithmetic modes, e.g. saturating arithmetic. Since the xCORE devices are aimed at DSP, their facilities might avoid losing performance when running C DSP. But I haven't investigated that, so it would be up to the OP to check my presumptions.

Without knowing the algorithm, it is impossible to say more.

Quote

The OP is only looking for 5-10x the performance of a 67MHz Cortex M3.
It would make the most sense to try conventional MCU or DSP options before going to anything more exotic. It shouldn't take more than a day to get the code and I2C sensors running on a RasPi ( or similar) even if you'd never used Linux before. That would immediately give some definite numbers for speed and power consumption to inform decisions on which way to go.
Until that gets done there is little point spending time thinking about anything more exotic.

Agreed. The xCORE devices are very nice w.r.t. precise guaranteed timing and w.r.t. bit-banged IO.

rstofer · « **Reply #38 on:** September 24, 2017, 01:35:14 pm »

Quote from: MasterTech on September 24, 2017, 05:38:50 am

Assembly code is somewhere I don't want to go. It is a nightmare and the algorithm I'm handling sometimes fries my head.

Of course not! But several million Pascal loops per second seems pretty impressive when you consider the slow rate coming from the sensors.

The idea of a dedicated 1.2 GHz processor with floating point just seems impressive. The other 3 processors can deal with grabbing data and doing whatever with the output while the dedicated processor does nothing but crunch numbers.

rstofer · « **Reply #39 on:** September 24, 2017, 06:12:41 pm »

I decided to write a little Pascal program to see how the RPi3 handled floating point. To my surprise, with the Free Pascal compiler, 64 bit is the default for type Real. That's nice because the A53 processor defines 32 (check this) pairs of 32 bit registers to hold the values (d0..d31 are the reg names). I haven't been able to find the answer on how many clocks it takes to perform a multiply but it can't be many because the processor, overall, is rated in the 2+ GFlop range.

It seems ARM doesn't produce timing specs for instructions on some processors because it gets swamped by outside factors like memory access, cache hits, and so on.

The processor also does short vector processing. I didn't research this but it could be useful.

tggzzz · « **Reply #40 on:** September 24, 2017, 06:29:38 pm »

Quote from: rstofer on September 24, 2017, 06:12:41 pm

It seems ARM doesn't produce timing specs for instructions on some processors because it gets swamped by outside factors like memory access, cache hits, and so on.

Do you know of any current processors that have decent performance and do have such timing specs?

The only ones I'm aware of are the xCORE processors, which avoid needing interrupts and don't have caches.

chickenHeadKnob · « **Reply #41 on:** September 24, 2017, 09:58:54 pm »

Quote from: tggzzz on September 24, 2017, 06:29:38 pm

Quote from: rstofer on September 24, 2017, 06:12:41 pm
It seems ARM doesn't produce timing specs for instructions on some processors because it gets swamped by outside factors like memory access, cache hits, and so on.

Do you know of any current processors that have decent performance and do have such timing specs?

The only ones I'm aware of are the xCORE processors, which avoid needing interrupts and don't have caches.

The Texas Instruments AM3358 or 3359 in the beaglebone series have 2 PRU units which run at 200Mhz and are deterministic if I recall correctly. Not surprising as they target the same type of problems that xCORE or propeller cpus are intended for.

tggzzz · « **Reply #42 on:** September 24, 2017, 11:10:39 pm »

Quote from: chickenHeadKnob on September 24, 2017, 09:58:54 pm

Quote from: tggzzz on September 24, 2017, 06:29:38 pm
Quote from: rstofer on September 24, 2017, 06:12:41 pm
It seems ARM doesn't produce timing specs for instructions on some processors because it gets swamped by outside factors like memory access, cache hits, and so on.

Do you know of any current processors that have decent performance and do have such timing specs?

The only ones I'm aware of are the xCORE processors, which avoid needing interrupts and don't have caches.

The Texas Instruments AM3358 or 3359 in the beaglebone series have 2 PRU units which run at 200Mhz and are deterministic if I recall correctly. Not surprising as they target the same type of problems that xCORE or propeller cpus are intended for.

A quick scan of the TI AM335x shows the ARM-A8 has cache (of course), the PRU-ICSS have interrupts and 120(!) registers and 12k shared RAM and "limited" peripherals. I haven't assessed the effects those features have in practical systems, but they are orange flags I would want to investigate. The tools do code profiling, which is an orange flag.

The PRU-ICSS appear, in some subtle ways, appear to be regarded as bolt-ons to the ARM-A8. I'd prefer it to be the other way around!

In the past there have been many many asymmetric multicore processors, and the programming environment has always been an afterthought. Given that, I'm not entirely surprised there's little prominence given to how you program the hard real time parts, and the communications with the other cores. That's a shame, because the best hardware is useless without decent programming environments. I haven't spotted any instruction timings, nor part of the IDE that predicts worst-case timing - pointers would be welcome.

OTOH, the xCORE processors are symmetrical and have an excellent programming environment: xC which is based on Communicating Sequential Processes (also included in Go and Rust, apparently).

rstofer · « **Reply #43 on:** September 24, 2017, 11:57:44 pm »

Quote from: tggzzz on September 24, 2017, 06:29:38 pm

Do you know of any current processors that have decent performance and do have such timing specs?

No, but given that these are register to register operations, it should be possible to determine how many cycles it takes to add or multiply. Add is complicated due to alignment and would probably be omitted or bounded. I can't see any reason ARM couldn't describe the number of clocks required to multiply to reals.

NorthGuy · « **Reply #44 on:** September 25, 2017, 12:30:41 am »

Quote from: rstofer on September 23, 2017, 10:09:41 pm

Other Pascal loops are down around 7 MILLION per second which still might be sufficient.

My old Sandy Bridge does about 2 billion Pascal loops per second with Delphi (fetching a value from a long array).

Code: [Select]

for i := 0 to N-1 do begin
  Buf[255] := char(WorkBuf[i]);
end;

tggzzz · « **Reply #45 on:** September 25, 2017, 07:24:03 am »

Quote from: rstofer on September 24, 2017, 11:57:44 pm

Quote from: tggzzz on September 24, 2017, 06:29:38 pm
Do you know of any current processors that have decent performance and do have such timing specs?

No, but given that these are register to register operations, it should be possible to determine how many cycles it takes to add or multiply. Add is complicated due to alignment and would probably be omitted or bounded. I can't see any reason ARM couldn't describe the number of clocks required to multiply to reals.

What guarantees the variables are kept in registers? They probably are, but compiler optimisation algorithms are notoriously fickle and, um, "heuristic".

If the system performance is solely dependent on such an inner-loop, then fine. However in most systems detailed timing is dependent on other factors, e.g. interrupts, memory accesses in other parts of the codebase, etc, etc.

Floating point arithmetic performance is more or less impossible to guarantee, especially if IEEE754 is involved. Not only can operations be short-circuited, but operations involving denorm number are often notoriously slow - they often require fixups in software. What happens is very implementation dependent, and therefore highly non-portable.

mikeselectricstuff · « **Reply #46 on:** September 25, 2017, 08:21:36 am »

Quote from: tggzzz on September 25, 2017, 07:24:03 am

What guarantees the variables are kept in registers? They probably are, but compiler optimisation algorithms are notoriously fickle and, um, "heuristic".

In principle you can use the register qualifier to tell the compiler which variables to optimise most, but how successful this is (or if you can even tell what it has done) will depend on the compiler

tggzzz · « **Reply #47 on:** September 25, 2017, 08:42:11 am »

Quote from: mikeselectricstuff on September 25, 2017, 08:21:36 am

Quote from: tggzzz on September 25, 2017, 07:24:03 am

What guarantees the variables are kept in registers? They probably are, but compiler optimisation algorithms are notoriously fickle and, um, "heuristic".
In principle you can use the register qualifier to tell the compiler which variables to optimise most, but how successful this is (or if you can even tell what it has done) will depend on the compiler

My understanding is that all non-trivial compilers have ignored the "register" hint for at least 30 years!

Someone · « **Reply #48 on:** September 25, 2017, 09:47:36 am »

Quote from: tggzzz on September 25, 2017, 08:42:11 am

Quote from: mikeselectricstuff on September 25, 2017, 08:21:36 am
Quote from: tggzzz on September 25, 2017, 07:24:03 am

What guarantees the variables are kept in registers? They probably are, but compiler optimisation algorithms are notoriously fickle and, um, "heuristic".
In principle you can use the register qualifier to tell the compiler which variables to optimise most, but how successful this is (or if you can even tell what it has done) will depend on the compiler
My understanding is that all non-trivial compilers have ignored the "register" hint for at least 30 years!

There are many compliers targeting embedded targets that strictly follow the C register keyword, especially when you want to mix C and assembly knowing the critical instructions and data.

legacy · « **Reply #49 on:** September 25, 2017, 09:53:25 am »

Quote from: tggzzz on September 24, 2017, 11:10:39 pm

In the past there have been many many asymmetric multicore processors, and the programming environment has always been an afterthought

Which ones?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Which FGPA/tool for this project? (Read 15167 times)

Share me