Author Topic: 32F417 SPI running at one third the speed it should (Read 8382 times)

peter-h · « **Reply #50 on:** January 21, 2022, 05:39:10 pm »

Very good point, and I agree. By the time you have designed a PCB, most of the allocation of features is fixed.

For example I could have made good use of SPI1 (42MHz clock is possible) but lost SPI1 because I needed four UARTs, with some GPIO for them. And loads of stuff like that.

Siwastaja · « **Reply #51 on:** January 21, 2022, 06:14:29 pm »

Yeah, I stopped trying to write "generic" driver code when multiple peripherals work with each other (typical example: DMA & SPI, indeed). It's a HUGE task to get right, and usually you get just a half-ass solution which does half of the job, in other words, is useless.

But if you just write what you exactly need, the amount of code ends up so small that it's also quick to modify or rewrite, when necessary.

As you probably already realized by looking at DMA channel mapping list, there is quite limited choice available. If you need more than 1 DMA stream and more than 1 peripheral, chances are you just can't do it without going and redesigning the PCB to use a different peripheral instantiation; peripheral X might be only available on Stream Y, but peripheral Z is also only available on Stream Y, so you can't get both. (This is fixed in the F7/H7 series by introducing the DMAMUX allowing basically equivalent of >100 "channels" per stream; every peripheral available on every stream.)

This, combined with the fact that the instantiations of peripherals (like SPI1, SPI2, SPI3...) might be slightly different, makes attempts of creating a universal code interface where you can just swap between different SPIs, quite futile, unless you are ready to sacrifice something. The STM32 HAL also suffers from this compromise.

nctnico · « **Reply #52 on:** January 21, 2022, 06:30:36 pm »

Quote from: Siwastaja on January 21, 2022, 06:14:29 pm

Yeah, I stopped trying to write "generic" driver code when multiple peripherals work with each other (typical example: DMA & SPI, indeed). It's a HUGE task to get right, and usually you get just a half-ass solution which does half of the job, in other words, is useless.

Dang you are evil... you sure know how to bring back bad memories... a very long time ago I worked on a project with a web-based front-end. The programmers of that front-end spend ages on their universal layout engine for tables. In every project meeting they would come up with new and wild ideas about how to make it more universal while even in the current state it wasn't up the the task. A never ending saga!

SiliconWizard · « **Reply #53 on:** January 21, 2022, 06:57:01 pm »

Quote from: Siwastaja on January 21, 2022, 06:14:29 pm

Yeah, I stopped trying to write "generic" driver code when multiple peripherals work with each other (typical example: DMA & SPI, indeed). It's a HUGE task to get right, and usually you get just a half-ass solution which does half of the job, in other words, is useless.

Yes, this is something that pops up on a regular basis. I agree with you there. Rewriting too generic driver code is usually a waste of time, and the fact you think you're going to do it better than vendors (with their libraries) while taking the same approach is also delusional.

Now we may need to define what "generic" means here.
If that means, as vendors usually do, writing drivers that cover the complete features of every peripheral you use, then yes, it's the above approach, and is wasteful at best.
Now if that means writing drivers that cover your use case *only*, but with generic enough parameters (when possible) that porting them to other targets would be much easier - not having to change software interfaces is always a big plus - then go "generic". =)

peter-h · « **Reply #54 on:** January 22, 2022, 07:31:49 am »

I managed to break the code last night, and it turned out to be replacing

Code: [Select]

__HAL_SPI_CLEAR_OVRFLAG(hspi);
with (per the RM)

Code: [Select]

volatile uint32_t tmp;
tmp=SPI2->DR;
tmp=SPI2->SR;

but I originally did just

Code: [Select]

SPI2->DR;
SPI2->SR;

which someone here told me does a read of that register(s). Well, it does not, and I need to go through a lot of unrelated code to see where I may have used that.

I also get an "unused" compiler warning on tmp, which the macro gets around by using

Code: [Select]

/** @brief  Clear the SPI OVR pending flag.
  * @param  __HANDLE__ specifies the SPI Handle.
  *         This parameter can be SPI where x: 1, 2, or 3 to select the SPI peripheral.
  * @retval None
  */
#define __HAL_SPI_CLEAR_OVRFLAG(__HANDLE__)        \
  do{                                              \
    __IO uint32_t tmpreg_ovr = 0x00U;              \
    tmpreg_ovr = (__HANDLE__)->Instance->DR;       \
    tmpreg_ovr = (__HANDLE__)->Instance->SR;       \
    UNUSED(tmpreg_ovr);                            \
  } while(0U)

The UNUSED(tmp); fixes it.

Siwastaja · « **Reply #55 on:** January 22, 2022, 08:25:31 am »

Quote from: peter-h on January 22, 2022, 07:31:49 am

Code: [Select]
SPI2->DR; SPI2->SR;
which someone here told me does a read of that register(s). Well, it does not

It definitely does, unless you have written your own header which fails to qualify SPI2 as volatile, or use a broken compiler.

I do exactly that all the time and never had a problem.

Did you check the disassembly to verify the absence of reading the registers? Or are you just assuming it didn't happen? To me, this sounds like a timing issue where any random change fixed it. Because now, with the tmp as volatile, write to tmp is also forced, so there will be longer delay between the two reads.

I.e., original did this (in pseudo-assembly):
load SPI2->DR
load SPI2->SR

And the current incarnation does:
load SPI2->DR
store tmp # causes delay!
load SPI2->SR
store tmp

BTW, I always tend to read SR first, then DR. Don't know if this is relevant at all, just saying.

peter-h · « **Reply #56 on:** January 22, 2022, 09:14:17 am »

I will check the asm. Could also be optimisation level. Or a silicon bug which requires a delay between the reads.

But what is the objective? To clear the OVR flag, yeah, sure, but if you want to clear out all RX data, the macro won't do that. It removes just 1 value, The RX can contain 2 bytes! I've been around this block so many times. To empty out a UART with a 16 byte FIFO you have to do 17 reads. So I think one should read DR twice, to be 100% sure there is nothing in there which will bite you next time because the SPI RX won't be empty.

Code: [Select]

#define SPI2 ((SPI_TypeDef *) SPI2_BASE)
and then all this

Code: [Select]

/**
  * @brief  SPI handle Structure definition
  */
typedef struct __SPI_HandleTypeDef
{
  SPI_TypeDef                *Instance;      /*!< SPI registers base address               */

  SPI_InitTypeDef            Init;           /*!< SPI communication parameters             */

  uint8_t                    *pTxBuffPtr;    /*!< Pointer to SPI Tx transfer Buffer        */

  uint16_t                   TxXferSize;     /*!< SPI Tx Transfer size                     */

  __IO uint16_t              TxXferCount;    /*!< SPI Tx Transfer Counter                  */

  uint8_t                    *pRxBuffPtr;    /*!< Pointer to SPI Rx transfer Buffer        */

  uint16_t                   RxXferSize;     /*!< SPI Rx Transfer size                     */

  __IO uint16_t              RxXferCount;    /*!< SPI Rx Transfer Counter                  */

  void (*RxISR)(struct __SPI_HandleTypeDef *hspi);   /*!< function pointer on Rx ISR       */

  void (*TxISR)(struct __SPI_HandleTypeDef *hspi);   /*!< function pointer on Tx ISR       */

  DMA_HandleTypeDef          *hdmatx;        /*!< SPI Tx DMA Handle parameters             */

  DMA_HandleTypeDef          *hdmarx;        /*!< SPI Rx DMA Handle parameters             */

  HAL_LockTypeDef            Lock;           /*!< Locking object                           */

  __IO HAL_SPI_StateTypeDef  State;          /*!< SPI communication state                  */

  __IO uint32_t              ErrorCode;      /*!< SPI Error code                           */

#if (USE_HAL_SPI_REGISTER_CALLBACKS == 1U)
  void (* TxCpltCallback)(struct __SPI_HandleTypeDef *hspi);             /*!< SPI Tx Completed callback          */
  void (* RxCpltCallback)(struct __SPI_HandleTypeDef *hspi);             /*!< SPI Rx Completed callback          */
  void (* TxRxCpltCallback)(struct __SPI_HandleTypeDef *hspi);           /*!< SPI TxRx Completed callback        */
  void (* TxHalfCpltCallback)(struct __SPI_HandleTypeDef *hspi);         /*!< SPI Tx Half Completed callback     */
  void (* RxHalfCpltCallback)(struct __SPI_HandleTypeDef *hspi);         /*!< SPI Rx Half Completed callback     */
  void (* TxRxHalfCpltCallback)(struct __SPI_HandleTypeDef *hspi);       /*!< SPI TxRx Half Completed callback   */
  void (* ErrorCallback)(struct __SPI_HandleTypeDef *hspi);              /*!< SPI Error callback                 */
  void (* AbortCpltCallback)(struct __SPI_HandleTypeDef *hspi);          /*!< SPI Abort callback                 */
  void (* MspInitCallback)(struct __SPI_HandleTypeDef *hspi);            /*!< SPI Msp Init callback              */
  void (* MspDeInitCallback)(struct __SPI_HandleTypeDef *hspi);          /*!< SPI Msp DeInit callback            */

#endif  /* USE_HAL_SPI_REGISTER_CALLBACKS */
} SPI_HandleTypeDef;

Siwastaja · « **Reply #57 on:** January 22, 2022, 10:11:53 am »

Oh, SPI is trivial: you need to do exactly the same number of writes and reads to the DR. This is regardless of whether the "FIFO" is just one word long like on F417, or longer like on F7 series. This all stems from how the SPI protocol works.

Reading DR when no data is available may not cause any problems, but it isn't beneficial either and messes with your mind. Especially if you have a longer FIFO.

So no, an extra dummy read to "empty" the FIFO is definitely not required, you can't make empty more empty. The correct pattern is to see if RX FIFO is not empty (polling or interrupt) in this case RXNE, and if it's not, then do one read. If the RX FIFO had more than one word in it, another RXNE event (new interrupt, or polling using while() loop would succeed again) would immediately follow, causing another readout.

But since the RX "FIFO" is just 1 word, it can't contain 2 words. If RXNE = 1, then DR read returns valid data, and then RXNE = 0; you should read this value if you check. Unless your timing is so marginal that RXNE goes back to 1 immediately after reading out the previous data, but then you are on the brink of overrun anyway.

Do you even need to read SR? I don't remember, every STM32 peripheral is different and some require dummy reads on status register for no apparent reason when just reading the data out of DR would logically do. But this is at least usually documented.

The problem is you assuming, instead of checking. Earlier you assumed that compiler does not generate the reads, wasting a lot of time on it and writing a forum post about how the compiler does not do the reads. And now it turned out you never checked and are ready to go for the next theory about the delays. Whether the reads are generated or not would take just one minute to see from the disassembly.

One question, why are you posting a screenshot of I2S section? You don't have I2S, right? What makes you think anything in the I2S section is relevant to SPI use?

peter-h · « **Reply #58 on:** January 22, 2022, 01:20:21 pm »

"One question, why are you posting a screenshot of I2S section? You don't have I2S, right? What makes you think anything in the I2S section is relevant to SPI use?"

Because the RM sez so :

Follow the link.

In reality few people care for the overrun flag (unless interrupting from it, which almost nobody does). What really matters is that the RX channel is empty - otherwise you start by reading a byte of crap out of it next time.

"So no, an extra dummy read to "empty" the FIFO is definitely not required, you can't make empty more empty"

True but what is the alternative? You need to have a loop on RXNE, and keep reading until RXNE goes false. Then (if you are a "proper software engineer"

) you need a timeout on that loop, which is yet more fun if you want it to work in both startup code (ints disabled) and later code (ints + RTOS etc). You end up spending time doing it using something like CYCCNT, when reading N bytes is going to do the job perfectly.

Siwastaja · « **Reply #59 on:** January 22, 2022, 02:13:48 pm »

Well OK I understand, you are just looking at the OVR flag clear procedure, which is probably the same in SPI in I2S modes given they refer you to this section.

However, why do you want to clear the OVR flag?

"few people care for the overrun flag"; well then, I'm one of those "few" people. It is a serious error condition which indicates unrecoverable data loss, and it's most trivial to enable OVR interrupt and attach the error reporting function there, job done, no need to ever look at the flag. I normally enable all possible error interrupt sources, let them end up in the same handler, and there dump error code and all relevant status registers to UART or whatever, maybe once a second so you can attach the cable and see.

Do not depend on planned overruns which you then clear. Obviously in some cases you don't care about the data, and you can just let it overrun, but when the data matters, you don't want overruns at all; and being the SPI master yourself, preventing the overruns 100% is trivial timing work, it's your task as a programmer to guarantee, leaving the OVR flag as error detection mechanism only.

So I think recovery sequence is irrelevant then. For the same reason, I don't use timeouts for core functionality where timeout could be only triggered for two reason: colossal fuckup by me, or actual silicon failure. Sure, people love to throw around unhandled or poorly handled timeouts everywhere, but recovering from serious failures is hard. Do it well, or don't do it at all. (Latter usually works.) But even if you don't try to recover, do check for such failures and add a handler which reports a helpful error code, instead of the software starting to act randomly.

SPI master is deterministic and you have all the control. Having FIFO of 1 word, you don't need to loop on RXNE, only single event can happen. Read it out before the next one, guarantee this by code design and you are fine.

And if you don't have time to do that, then that's what DMA is for.

peter-h · « **Reply #60 on:** January 22, 2022, 02:33:53 pm »

Well, I agree on error recovery. When I stripped down that HAL SPI code, I removed all the timeouts because they either cannot happen except with duff silicon or they cannot be usefully handled at that (deep) level.

I've never implemented checking serial comms error flags

The error must be detected by the higher level code which then retries, etc. But for a retry to work you do have to clean out the RX "UART" channel, otherwise you will be forever a byte short

DMA is good, and I have just learnt a lot

I will now have a go at using DMA for this USB code. Not even sure the 32F417 DMA can connect to that USB FIFO. I don't think it can. There is a dedicated DMA for the USB I think (like there is for ETH, where you really need it) but is it not available in FS mode (only in HS mode). But if the USB FIFO has an address then surely one could do a memory-memory transfer? There is no "busy" status.

nctnico · « **Reply #61 on:** January 22, 2022, 02:39:56 pm »

Quote from: Siwastaja on January 22, 2022, 08:25:31 am

Quote from: peter-h on January 22, 2022, 07:31:49 am
Code: [Select]
SPI2->DR; SPI2->SR;
which someone here told me does a read of that register(s). Well, it does not

It definitely does, unless you have written your own header which fails to qualify SPI2 as volatile, or use a broken compiler.

This is interesting. It is the first time I ever see this mentioned. I just checked with GCC 4.6.2; the statement does perform a read but it does not result in less instructions compared to reading a value into a dummy variable (either optimised for size or speed ). That makes sense because in the end the data that is read must end up in a register even if it isn't used. I do get a warning from the IDE that it is a statement without effect though.

peter-h · « **Reply #62 on:** January 22, 2022, 02:57:08 pm »

I struggle to unravel the ST #defines and structs etc but DR itself it certainly declared as __IO i.e. volatile, otherwise very little code would ever work.

This is what I see

Siwastaja · « **Reply #63 on:** January 22, 2022, 02:59:45 pm »

Quote from: nctnico on January 22, 2022, 02:39:56 pm

Quote from: Siwastaja on January 22, 2022, 08:25:31 am
Quote from: peter-h on January 22, 2022, 07:31:49 am
Code: [Select]
SPI2->DR; SPI2->SR;
which someone here told me does a read of that register(s). Well, it does not

It definitely does, unless you have written your own header which fails to qualify SPI2 as volatile, or use a broken compiler.
This is interesting. It is the first time I ever see this mentioned. I just checked with GCC 4.6.2; the statement does perform a read but it does not result in less instructions compared to reading a value into a dummy variable (either optimised for size or speed ). That makes sense because in the end the data that is read must end up in a register even if it isn't used. I do get a warning from the IDE that it is a statement without effect though.

That is because the IDE does not understand the C language, which is no surprise, because IDEs are second-tier projects developed with much less understanding, compared to compilers which need to work.

This is because by very definition of C language, evaluating a value of a variable qualified volatile is not a statement without effect. The memory access is the effect, and needs to be generated.

Now indeed if you just assign the result of the evaluation to a normal variable which isn't qualified volatile, compiler is free to optimize that variable (and assignment to it) away and obviously will do it. But note that peter-h made that local variable volatile as well, and now compiler must write to it.

I prefer not to add any temporary variable because that generates an unnecessary warning, which then needs to be silenced, leading to excess code. I like my code to be "to the point". If I want read access to a register, I just write exactly that, it also conveys the meaning of "dummy" read very well. If I assign it to a temporary variable, it then raises questions about why this variable is never used.

Quote from: peter-h on January 22, 2022, 02:57:08 pm

I struggle to unravel the ST #defines and structs etc but DR itself it certainly declared as __IO i.e. volatile, otherwise very little code would ever work.

Yes, tracking down the actual "volatile" keyword through all the #defines requires some work, but all IO registers eventually are volatile, as you say otherwise nothing would work at all.

peter-h · « **Reply #64 on:** January 22, 2022, 03:00:35 pm »

So I went back to my non-working code, and it looks like SPI2->DR; does actually do those reads

peter-h · « **Reply #65 on:** January 22, 2022, 03:07:03 pm »

So I went back to my original hypothesis about clearing out the full two byte / two 16-bit word RX queue, with

Code: [Select]

   	SPI2->DR;
   	SPI2->DR;
   	SPI2->SR;

and it works

So the ST code, where they just clear out the OVR flag, is BS.

The scenario where this breaks is where you just happen, by design, to not read the RX data, which with SPI happens if just sending data. This is of course common. You can't prevent "stuff" being received. I guess ST got away with this by a combination of accidents. This is the relevant bit of their code (HAL_SPI_Transmit)

Code: [Select]

    /* Transmit data in 8 Bit mode */

  	// The need for this initial byte is unknown
    if (initial_TxXferCount == 0x01U)
    {
    	*((__IO uint8_t *)&SPI2->DR) = (*hspi->pTxBuffPtr);
    	hspi->pTxBuffPtr++;
    	hspi->TxXferCount--;
    }

    while (hspi->TxXferCount > 0U)
    {
    	/* Wait until TXE flag is set before loading data */
    	if (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE))
    	{
    		*((__IO uint8_t *)&SPI2->DR) = (*hspi->pTxBuffPtr);
    		hspi->pTxBuffPtr++;
    		hspi->TxXferCount--;
    	}
    }

    //========= PH addition ===========
    // Best way to make sure TX is finished is by waiting for the RX byte (which is garbage) to arrive
    // Potential for hanging here...
    while (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE))
    {
    	SPI2->DR;		// extract the byte and dump it
    }
    // ====================

    // Clear overrun flag in 2 Lines communication mode, just in case
   	__HAL_SPI_CLEAR_OVRFLAG(hspi);

    if (hspi->ErrorCode != HAL_SPI_ERROR_NONE)
    {
    	errorcode = HAL_ERROR;
    }

error:
  	return errorcode;

}

So the bit I added at the very end, to make it blocking for correct CS=1, masked the issue.

peter-h · « **Reply #66 on:** January 24, 2022, 11:44:43 am »

Well, I got another taste of the "loosely coupled" peripherals in this uC. At the end of this code (discussed earlier) I had a GPIO waggle, to get a pulse on a scope to measure timing. This evening, having finished with it all, I removed that line, and the code stopped working. Running my test code I was finding that the last byte (byte 511) of each page was wrong. I had noticed previously that this could be induced by not fully clearing RX data (in the "dump rx data" mode, where SPI is merely sending), so I spent a bit of time tracking it down.

It turned out that a small delay is needed before doing the final SPI2->DR (2 of them).

Code: [Select]

	  	while ((TxXferCount > 0U) || (RxXferCount > 0U))
	  	{
	  		/* Do a TX if TX empty */
			if ( txallowed && (TxXferCount > 0) )
			{
				if ( (SPI2->SR & (1<<1)) != 0 )
			    {
					*(__IO uint8_t *)&SPI2->DR = *pTxBuffPtr;
			    	pTxBuffPtr++;
			    	TxXferCount--;
			    	txallowed=rxdump;					// block another TX until RX done (tx always ok in rxdump)
			    }
			}

		    /* Do an RX if RX not empty */
			if ( !rxdump )
			{
				if ( ((SPI2->SR & 1) != 0) && (RxXferCount > 0) )
				{
	   				(*(uint8_t *)pRxBuffPtr) = SPI2->DR;
	   		    	pRxBuffPtr++;
	   		    	RxXferCount--;
	   		    	txallowed = true;					// allow tx again
		    	}
			}
	  	}

		#ifdef TIMING_DEBUG
	  		ADS1118_an_2p5v_external(1);
		#endif

  	}

    // Clear any rx data and the overrun flag in case not all received data was read; mandatory in rxdump mode
	// Have to read DR 2x to be sure the RX is cleared. The ST command __HAL_SPI_CLEAR_OVRFLAG(__HANDLE__) does
	// only 1 read of DR and this failed in rxdump mode, if DR was not read above (which it actually is).
   	// A small delay is needed because the very last transfer may not propagate to DR.

   	hang_around_us(1);
   	SPI2->DR;
   	SPI2->DR;
   	SPI2->SR;

   	return (true);

}

One could also solve it with a 3rd SPI2->DR but it's not clear with how much margin.

The APB buses are 42MHz but there are so many clocks between the CPU and the SPI...

Siwastaja · « **Reply #67 on:** January 24, 2022, 12:01:46 pm »

You can always consider resetting the SPI peripheral through the RCC reset registers. It doesn't take long, you need two writes to RCC (activate and deactivate reset), and then re-configure the SPI peripheral which is usually just a few register writes. This way, you should reliably end up with all the shift registers, buffers, and any other internal state equal to the reset state of the MCU. Maybe. Hopefully.

If you need to depend on cycle accurate synchronism between your code and peripherals, it's not going to work reliably. You need to remove the race conditions completely, not work around them.

peter-h · « **Reply #68 on:** January 24, 2022, 12:37:03 pm »

It is stuff which looks like what one might expect interfacing to a slow device of any kind.

The SPI runs at APB clock, 42MHz in this case, so let's say you get a RXNE status bit set, you can definitely read out that byte. But if say you had a 4 byte RX FIFO, with its propagation driven on a 42MHz clock, and you read it with a CPU running on a 168MHz clock, and you want to empty out that FIFO and dump the data, and you do

SPI2->DR;
SPI2->DR;
SPI2->DR;
SPI2->DR;

you may not get all four bytes - because they could not ripple through fast enough. I may be wrong in this; it is well possible that the SPI2->DR read operation itself is slow enough to enable the FIFO to propagate.

If I am right, then there are two approaches:

wait
SPI2->DR;
wait
SPI2->DR;
wait
SPI2->DR;
wait
SPI2->DR;

or have a loop testing RXNE, but that needs a timeout of some sort.

I don't get why ST run the peripherals from such a slow clock. They should have run them from the CPU clock. Lots of people have got their fingers burnt with this.

Anyway, I now have what looks like solid code; I can't break it

In DMA mode, 16 bit SPI mode, or 8 bit SPI mode. The 16 bit mode is used instead of DMA if the RAM is CCM.

peter-h · « **Reply #69 on:** February 03, 2022, 07:34:22 am »

Just in case someone finds this useful, below is the source for the 32F417 SPI3 DMA.

I had lots of problems getting it to work and it sounds like the DMA "streams" cannot be used for more than one device even if they are on different channels, so e.g. you cannot do this

Learn something every day! Basically there are just eight DMA controllers (on DMA1): stream 0-7. The "channel" is just connecting these controllers to the peripherals.

Code: [Select]

/*
*
* DMA-only version of the ultra-bloated HAL SPI TransmitReceive() but fixed for SPI3 and XX-only options added.
*
* For use where fast transfers are needed, on the limit of SPI3 speed so with zero gaps. This is impossible
* to do by polling at 10.5mbps or 21mbps and is probably marginal at 5.25mbps. The 16 bit SPI mode just
* manages gap-free with polling but works only with even block sizes, and has the "first byte problem" which
* DMA gets around.
*
* ** DMA ONLY SO NO CCM ACCESS SO THE TWO BUFFERS HAVE TO BE "STATIC" **
*
* This function is blocking so the caller can set CS=1 right away (check device data sheet!). A non-blocking
* version would make sense only if transmitting only (txonly=true) but the caller would have to tidy up
* the DMA, SPI3->CR2, etc.
*
* Two modes, obviously mutually exclusive, for tx-only and rx-only:
* If txonly, dumps rx data so you don't need to allocate a buffer for it
* If rxonly, transmits all-0x00 so you don't need to feed SPI with some "known garbage"
*
* The rx-only mode is superfluous in most cases but it does avoid shifting non-0x00 data to the device
* while we are reading data out of it. With some devices this can matter. The ADS1118 ADC is one such.
*
* SPI3 basic config is done elsewhere.
*
* Returns (false) if memory is in CCM, size=0, null pointers...
*
* NULL pointers are allowed if using the txonly or rxonly modes; then the unused one can be NULL.
*
* Originally streams 2+5 were used but it didn't work and produced bizzare effects like transfers hanging
* with the transfer counters having 1 and 127 in them, for a required count of 1. Changing the streams to
* 0+7 fixed it. Seemingly a "stream" cannot be shared across peripherals, and stream 5 is used for DAC1
* even though the "channel" is different.
*
*/

bool SPI3_DMA_TransmitReceive(uint8_t *pTxData, uint8_t *pRxData, uint16_t Size, bool txonly, bool rxonly)
{

  	// Check for invalid inputs

	if ( Size==0 ) return (false);
  	if ( (pTxData==NULL) && !rxonly ) return (false);
  	if ( (pRxData==NULL) && !txonly ) return (false);

	uint32_t txadd = (uint32_t) pTxData;
	uint32_t rxadd = (uint32_t) pRxData;
	static uint8_t txonly_target=0;					// must not be in CCM
	static uint8_t rxonly_source=0;					// must not be in CCM

	// Test for CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K (linkerscript.ld)
	if ( ((txadd>=0x10000000) && (txadd<0x10010000)) || ((rxadd>=0x10000000) && (rxadd<0x10010000)) )
	{
		return(false);
	}

	// DMA ch 1 clock enable already done in b_main.c
	//RCC->AHB1ENR |= (1u << 21);					// DMA1EN=1 - DMA1 clock enable
	//hang_around_us(1);							// give it a chance to wake up

	// DMA1 Ch 0 Stream 0 is SPI3 RX

	DMA1_Stream0->CR = 0;							// disable DMA so all regs can be written

	DMA1->LIFCR = (0x03d << 0);						// clear int flags & transfer complete - 111101 stream 0

	DMA1_Stream0->NDTR = Size;

	if (txonly)
	{
		DMA1_Stream0->M0AR = (uint32_t) &txonly_target;		// memory address to dump rx data to
	}
	else
	{
		DMA1_Stream0->M0AR = rxadd;					// memory address in normal mode
	}

	DMA1_Stream0->PAR = (uint32_t) &(SPI3->DR);		// peripheral address
	DMA1_Stream0->FCR = 0;							// direct mode

	if (txonly)
	{
		DMA1_Stream0->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  3 << 16					// PL: highest priority
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  0 << 10					// MINC: memory address increment: 0
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  0 << 6					// DIR: peripheral to memory
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}
	else
	{
		DMA1_Stream0->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  3 << 16					// PL: highest priority
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  1 << 10					// MINC: memory address increment: 1
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  0 << 6					// DIR: peripheral to memory
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}

	// DMA1 Ch 0 Stream 7 is SPI3 TX

	DMA1_Stream7->CR = 0;							// disable DMA so all regs can be written

	DMA1->HIFCR = (0x03d << 22);					// clear int flags & transfer complete - 111101 stream 7

	DMA1_Stream7->NDTR = Size;

	if (rxonly)
	{
		DMA1_Stream7->M0AR = (uint32_t) &rxonly_source;		// memory address to fetch dummy tx data from
	}
	else
	{
		DMA1_Stream7->M0AR = txadd;					// memory address in normal mode
	}

	DMA1_Stream7->PAR = (uint32_t) &(SPI3->DR);		// peripheral address
	DMA1_Stream7->FCR = 0;							// direct mode

	if (rxonly)
	{
		DMA1_Stream7->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  0 << 16					// PL: priority low
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  0 << 10					// MINC: memory address increment: 0
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  1 << 6					// DIR: memory to peripheral
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}
	else
	{
		DMA1_Stream7->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  0 << 16					// PL: priority low
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  1 << 10					// MINC: memory address increment: 1
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  1 << 6					// DIR: memory to peripheral
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}

	// Config SPI3 to let DMA handle the data. These need to be cleared when transfer complete!
	// This starts the transfer

	SPI3->CR2 |= 3;									// TXDMAEN, RXDMAEN: 11 - both set in one go
	SPI3->CR1 |= (1<<6);							// SPE=1 enable SPI

	// Wait for DMA to finish. Blocking is necessary to prevent device CS=1 too early.
	// There could be a timeout here but a failure is impossible short of duff silicon, because
	// we are a Master and generating the SPI clock.

	while(true)
	{

		// Either method below worked fine

		//uint16_t temp1;
		uint32_t temp2;

		//temp1 = DMA1_Stream0->NDTR;
		//if ( temp1 == 0 ) break;					// transfer count = 0

		temp2 = DMA1->LISR;
		if ( (temp2 & (1<<5)) !=0 ) break;			// TCIF0

	}

	SPI3->CR2 &= ~3;								// TXDMAEN, RXDMAEN: 00 - both cleared in one go

	// Clear int pending flags. They get cleared at the top of this function anyway, but...

	DMA1->LIFCR = (0x03d << 0);					// clear int flags & transfer complete - 111101 stream 0
	DMA1->HIFCR = (0x03d << 22);				// clear int flags & transfer complete - 111101 stream 7

    // Clear any rx data and the overrun flag in case not all received data was read

	SPI3->CR1 &= ~(1<<6);							// SPE=0 disable SPI

   	SPI3->DR;
	SPI3->DR;
   	SPI3->SR;

   	return (true);

}

Siwastaja · « **Reply #70 on:** February 03, 2022, 09:52:31 am »

This is what I told you earlier.

Yes, it's limiting. Luckily, SPI3_TX is also available on Stream7, so just use it.

Sometimes you get in a corner where you just can't make it work, unless you change to a different SPI for example, so you need to look at these tables already at PCB design stage.

STM32H7 comes with a DMAMUX which allows every peripheral connected to every DMA stream; basically over 100 "channels" instead of just 7.

peter-h · « **Reply #71 on:** February 03, 2022, 10:49:58 am »

I read every page of the data sheet before doing the PCB, and on the whole got everything right. A lot of time was spent on it. I didn't read the 2000 page RM

Everything worked 1st time. Well, not the USB cable insertion detection... done on issue B.

Looking back on it, the only thing not optimal was that SPI1 was lost, to get the four serial ports, and losing SPI1 means losing 42mbps SPI option (2 and 3 can do 21mbps max) which makes the FatFS file system a bit slower, but that doesn't matter, and would have compromised the min baud rate on the UART(s) driven from the same clock as SPI1 (can't recall which).

Looking at what DMA may be needed for one day, DMA2 can do the two ADCs. That could be clever e.g. storing readings in a circular buffer and then some code can add them up and divide by how many, for an easy FIR filter. I can't think what DMA would be good for for timers; perhaps storing captured values?

Yes I am using streams 0+7 now, for SPI3. You are right - I didn't have to move stream 2 to stream 0.

Siwastaja · « **Reply #72 on:** February 03, 2022, 01:54:56 pm »

Quote from: peter-h on February 03, 2022, 10:49:58 am

I read every page of the data sheet before doing the PCB, and on the whole got everything right. A lot of time was spent on it. I didn't read the 2000 page RM

Me neither; one could think the datasheet contains all the basic circuitry and peripheral mapping to allow a PCB design - I pretty much used to ignore the reference manual at PCB design stage, until reality hits in the face, and then I learn, and then I want to remind others of these traps, too.

Two examples:

* The first design to actually need more than just a few DMA channels: two SPIs, one UART + ADC, for total of 7 streams. This combination hit the inability to do DMA channel mapping at all. A different UART or different SPI would have worked out, so it was solvable with pin changes. Would have required board respin, which I didn't want to do, so used interrupt-driven UART in the end; not a big problem, but good to remember.

* Capacitor on NRST pin. Many reference designs leave this unconnected, I also did that for many many designs. Until one day I had STM32H743 which does not boot, at all. Touching NRST with oscillosscope probe makes it boot. The datasheet does not tell you this: it shows a reset button and a capacitor across the button, circled as a combination called "External reset circuit", and capacitor described as something "protecting the device from parasitic resets". Fine, I don't have a button, I don't need debouncing, and I don't get parasitic resets, the CPU never boots at all. So this can't be the problem, right?

But lo and behold, the reference manual, where this kind of circuitry thing does not belong at all, suddenly has this circuit diagram (RM0433 fig. 43), and it has completely different explanation. It shows the capacitor alone, and goes on: "when it [NRST] is not used, connect this pin to ground via a 10 to 100 nF capacitor".

So yes, reading that 3319 pages of reference manual before PCB design would be highly recommendable. Of course if the design does not use Ethernet or CAN or camera interface or DMA2D graphics accelerator etc., you can skip most of the pages.

And I actually read all of it (excluding obviously unused peripherals) before committing to project. I think I read it for a week or two, day in day out. And I still missed that capacitor thing!

peter-h · « **Reply #73 on:** February 03, 2022, 03:57:39 pm »

Hmmm... I have NRST unconnected too. Didn't put a cap on it because the debugger drives it. But presumably 10nF should be safe?

Given it has a ~40k pullup, one would need a fairly powerful radiated spike to pull it down for long enough. This is the data sheet, showing 100nF

Just tested 10nF and it works fine with STLINK V3 ISOL.

I thought about DMA and UARTs and it isn't generally useful, due to the inevitably low baud rates (below 1mbps) in NRZ async encoding. I've been in serial comms for > 40 years and very few products use high rates. Most uses 9.6k

and 115.2k is the highest generally.

Ethernet uses its own DMA - at least if using the ST library code.

I am not sure the "camera interface" is useful today. It is quite slow.

harerod · « **Reply #74 on:** February 03, 2022, 05:12:02 pm »

I just finished another STM32 design where the NRST ends up "all over the board" (slight exaggeration). MCU, DCDC-open-drain-power-good, JTAG, secondary circuitry. I prefer to connect NRST to a 10k pull-up and a 100nF to ground. This cappa should be placed closely to the MCU.
An example where this concept doesn't work, would be some new ATtiny (e.g. ATTINY814). Here the RESET-pin doubles as programming interface. It may take 1..4.7nF or so. The point is that this is not a RESET pin in the normal sense.
The NRST on STM32F4 is a real RESET pin. Thus we can add protection. And any professional programmer (JLINK, STLINK) can handle an 100nF "Angstkondensator".


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: 32F417 SPI running at one third the speed it should (Read 8382 times)

Share me