-
I am getting warnings when compiling some ST Cube code. These are the two functions
/**
* @brief USB_WritePacket : Writes a packet into the Tx FIFO associated
* with the EP/channel
* @param USBx Selected device
* @param src pointer to source buffer
* @param ch_ep_num endpoint or host channel number
* @param len Number of bytes to write
* @param dma USB dma enabled or disabled
* This parameter can be one of these values:
* 0 : DMA feature not used
* 1 : DMA feature used
* @retval HAL status
*/
HAL_StatusTypeDef USB_WritePacket(USB_OTG_GlobalTypeDef *USBx, uint8_t *src, uint8_t ch_ep_num, uint16_t len, uint8_t dma)
{
uint32_t USBx_BASE = (uint32_t)USBx;
uint32_t *pSrc = (uint32_t *)src;
uint32_t count32b, i;
if (dma == 0U)
{
count32b = ((uint32_t)len + 3U) / 4U;
for (i = 0U; i < count32b; i++)
{
USBx_DFIFO((uint32_t)ch_ep_num) = *((__packed uint32_t *)pSrc);
pSrc++;
}
}
return HAL_OK;
}
/**
* @brief USB_ReadPacket : read a packet from the Tx FIFO associated
* with the EP/channel
* @param USBx Selected device
* @param dest source pointer
* @param len Number of bytes to read
* @param dma USB dma enabled or disabled
* This parameter can be one of these values:
* 0 : DMA feature not used
* 1 : DMA feature used
* @retval pointer to destination buffer
*/
void *USB_ReadPacket(USB_OTG_GlobalTypeDef *USBx, uint8_t *dest, uint16_t len)
{
uint32_t USBx_BASE = (uint32_t)USBx;
uint32_t *pDest = (uint32_t *)dest;
uint32_t i;
uint32_t count32b = ((uint32_t)len + 3U) / 4U;
for (i = 0U; i < count32b; i++)
{
*(__packed uint32_t *)pDest = USBx_DFIFO(0U);
pDest++;
}
return ((void *)pDest);
}
The macro USBx_DFIFO is
#define USBx_DFIFO(i) *(__IO uint32_t *)(USBx_BASE + USB_OTG_FIFO_BASE + ((i) * USB_OTG_FIFO_SIZE))
The macro __IO is just "volatile" and is used for CPU registers.
The code works. It is a loop which copies across 64 bytes, in 16 moves, each one 4 bytes. One could use DMA for it (this came up in a thread here long ago) but by the time DMA was set up, you don't gain anything.
I am getting these warnings:
(https://peter-ftp.co.uk/screenshots/20221017454920611.jpg)
Someone had a look at this and thought the "packed" is completely meaningless, and there is a lot of ST Cube code which that applies to, and indeed removing it doesn't break anything, but I wonder if anyone here can think of a reason the programmer did that. One normally uses 'packed' on structs.
-
It is not meaningless, it indicates that alignment may be arbitrary, not 4 bytes in this case. You can see that pSrc was obtained by casting uint8_t* buffer to uint32_t*, so the buffer is generally not aligned.
But I'm not sure GCC ever supported it that way, I think this is ARM/Keil feature. I'm not even sure they implemented it when moving to Keil5, which is based on Clang. But I guess they might have for compatibility.
-
I don't think 'packed' makes any sense for a pointer. It's an attribute meant for a struct definition. 'Packed' guarantees that struct members are not padded. It is absolutely meaningless for qualifying a pointer.
-
It makes sense for the pointed data though. __packed here says that uint32_t pointed by the pointer is not aligned (pointer lower bits are not 00). Without this the code here is not correct. This cast " uint32_t *pSrc = (uint32_t *)src;" is wrong given that src is uint8_t*.
So compiler would be forced to generate 4 bytes load/stores instead of one word load/store for architectures that don't support unaligned access.
-
It makes sense for the pointed data though. __packed here says that uint32_t pointed by the pointer is not aligned (pointer lower bits are not 00). Without this the code here is not correct. This cast " uint32_t *pSrc = (uint32_t *)src;" is wrong given that src is uint8_t*.
So compiler would be forced to generate 4 bytes load/stores instead of one word load/store for architectures that don't support unaligned access.
I see, but I don't think GCC supports that, at least with this attribute. I'll have to check.
-
That's what I'm saying. This is supported by old Keil (<5) before they moved to Clang.
It looks like it is supported by new versions too:
https://developer.arm.com/documentation/dui0472/k/Compiler-Coding-Practices/The---packed-qualifier-and-unaligned-data-access-in-C-and-C---code?lang=en
and
https://developer.arm.com/documentation/dui0472/k/Compiler-specific-Features/--packed There is an example at the end where individual field of a struct is packed.
No idea if this comes from Clang itself, or they added this for Keil.
-
GNUC supports things like
struct __dealign_uint16 { uint16_t datum; } __attribute__((packed));
struct __dealign_uint32 { uint32_t datum; } __attribute__((packed));
struct __dealign_uint64 { uint64_t datum; } __attribute__((packed));
to misalign data, but I don't think there is a type generic way for non-struct data types.
-
I just played a bit with godbolt. GCC and Clang support packing of individual struct members:
struct foo
{
char one;
__packed short two;
char three;
int four;
} c;
But not packing for the individual variables.
-
I just played a bit with godbolt. GCC and Clang support packing of individual struct members:
struct foo
{
char one;
__packed short two;
char three;
int four;
} c;
But not packing for the individual variables.
That's a useful thing to remember. I've only packed whole structures for things like the dealign macros I showed earlier, and making a structure match the layout of packets.
-
Can anyone suggest the right way to do this which runs fastest?
I can probably make sure the RAM buffer is 4 byte aligned. Currently, AFAICT by working through a dozen nested calls in the horrid USB code, the buffer is
static uint8_t cdc_receive_temp_buffer[64];
static uint8_t cdc_out_buf[256];
Neither is aligned, but presumably putting __attribute__ ((aligned (4))) after each would do it.
In reality both buffers are aligned but that could be just fortunate since both are in the FreeRTOS stack area and there are no objects smaller than 4 bytes before them
static void USBThread(void *argument)
{
#define CDCTXBUFSIZE 256
MX_USB_DEVICE_Init();
g_USB_started=true; // Enables port 0 (KDE -> PC) output functions
uint32_t ms_before_eject = 0;
static uint8_t cdc_out_buf[CDCTXBUFSIZE]; // linear buffer for USB data
The 32F4 doesn't need alignment for uint32_t but it would run slower.
-
The most portable way is to combine the byte buffer into a single variable by shifting each byte into the multi-byte variable. Anything is just praying that it works. It is the main reason why I avoid casting multi-byte variables / structs onto byte buffers.
-
Besides,
#include <stdint.h>
uint32_t get_u32le(const unsigned char *const src) {
return ((uint32_t)(src[0])) | ((uint32_t)(src[1]) << 8) | ((uint32_t)(src[2]) << 16) | ((uint32_t)(src[3]) << 24);
}
uint32_t get_u32be(const unsigned char *const src) {
return ((uint32_t)(src[0]) << 24) | ((uint32_t)(src[1]) << 16) | ((uint32_t)(src[2]) << 8) | ((uint32_t)(src[3]));
}
uint32_t get_u32(const unsigned char *const src) {
#if __BYTE_ORDER__ == __LITTLE_ENDIAN__
return ((uint32_t)(src[0])) | ((uint32_t)(src[1]) << 8) | ((uint32_t)(src[2]) << 16) | ((uint32_t)(src[3]) << 24);
#else
return ((uint32_t)(src[0]) << 24) | ((uint32_t)(src[1]) << 16) | ((uint32_t)(src[2]) << 8) | ((uint32_t)(src[3]));
#endif
}
compiles to quite acceptable code with both GCC and clang on all architectures when using -Og, -Os, or -O2.
The only one I wasn't sure about is risc-v (rv32gc), but even on that one,
#include <stdint.h>
struct word32 {
uint32_t u32 __attribute__((packed));
} __attribute__((packed));
uint32_t get_u32(const unsigned char *const src) {
return ((const struct word32 *)src)->u32;
}
compiles to the same code; I assume rv32gc doesn't like unaligned 32-bit accesses. On x86-64, where unaligned accesses are okay (just slower), these compile to a simple load (plus a bswap for get_u32be).
-
The most portable way is to combine the byte buffer into a single variable by shifting each byte into the multi-byte variable.
I do that too, to ensure byte order when storing e.g. a uint32 in an EEPROM.
But this is different. It needs to be very fast. The CPU register is 32 bits wide and it should be done in 16 moves.
-
The CPU register is 32 bits wide and [copying a 64-byte buffer to it] should be done in 16 moves.
Just because you do 16 stores, does not mean you have to do exactly 16 loads, too.
HAL_StatusTypeDef USB_WritePacket(USB_OTG_GlobalTypeDef *USBx, uint8_t *src, uint8_t ch_ep_num, uint16_t len, uint8_t dma)
{
uint32_t USBx_BASE = (uint32_t)USBx;
uint32_t *pSrc = (uint32_t *)src;
uint32_t count32b, i;
if (dma == 0U)
{
count32b = ((uint32_t)len + 3U) / 4U;
for (i = 0U; i < count32b; i++)
{
USBx_DFIFO((uint32_t)ch_ep_num) = get_u32(pSrc + i);
}
}
return HAL_OK;
}
void *USB_ReadPacket(USB_OTG_GlobalTypeDef *USBx, uint8_t *dest, uint16_t len)
{
uint32_t USBx_BASE = (uint32_t)USBx;
uint32_t *pDest = (uint32_t *)dest;
uint32_t i;
uint32_t count32b = ((uint32_t)len + 3U) / 4U;
for (i = 0U; i < count32b; i++)
{
set_u32(pDest + i, USBx_DFIFO(0U);
}
return ((void *)pDest);
}
which on stm32f4, because it supports unaligned 32-bit accesses, will optimize the get_u32() and set_u32() calls to ordinary loads and stores. You can implement them for example as
struct unaligned_u32 {
uint32_t u32 __attribute__((packed));
} __attribute__ ((packed));
static inline uint32_t get_u32(const void *const src)
{
return ((const struct unaligned_u32 *)src)->u32;
}
static inline uint32_t set_u32(void *const dst, uint32_t val)
{
((struct unaligned_u32 *)dst)->u32 = val;
return val;
}
In any case, a better solution is to use
HAL_StatusTypeDef USB_WritePacket(USB_OTG_GlobalTypeDef *USBx, const uint32_t *src, uint_fast8_t ch_ep_num, uint_fast8_t words, uint_fast8_t dma)
{
const uintptr_t USBx_BASE = (uintptr_t)USBx;
if (dma == 0U) {
const uint32_t *p = src;
const uint32_t *const q = src + words;
while (p < q) {
USBx_DFIFO((uint32_t)ch_ep_num) = *(p++);
}
}
return HAL_OK;
}
uint32_t *USB_ReadPacket(USB_OTG_GlobalTypeDef *USBx, uint32_t *dst, uint_fast8_t words)
{
const uintptr_t USBx_BASE = (uintptr_t)USBx;
uint32_t *p = dst;
uint32_t *const q = dst + words;
while (p < q) {
*(p++) = USBx_DFIFO(0U);
}
return dst;
}
and declaring uint32_t cdc_out_buf[(CDCTXBUFSIZE + 3)/4];.
If you need to access the buffer data at some byte offset, you can always use (offset + (unsigned char *)cdc_out_buf) as the unsigned char pointer to that byte.
-
Unfortunately you have stretched my C expertise past its limit there :)
What is actually wrong with the existing (ST) code assuming the buffers are explicitly 4-aligned?
Just because you do 16 stores, does not mean you have to do exactly 16 loads, too.
I think it does.
AIUI, if a RAM buffer, accessed as uint32, is not 4-aligned, and given that in many/most cases the 32F4 requires a 32 bit reg to be written (and sometimes read, too, AFAIK, especially status registers where a read clears a bit so you need to read all 32 in one go) via a 32 bit variable, a 16 word move from RAM to a 32 bit reg will perform 64 RAM reads and 16 register writes, or the opposite (16 reg reads and 64 RAM writes). Nobody is likely to notice because the RAM access is zero wait state (7ns cycle time) but it will be slower by 336ns (48x7).
FWIW the code shows the stupidity of ST code because dma=0 always (on USB FS; only USB HS uses DMA) :) I see I removed that test from one of the two functions...
-
I can probably make sure the RAM buffer is 4 byte aligned.
This is the easiest and cleanest way of dealing with this.
Neither is aligned, but presumably putting __attribute__ ((aligned (4))) after each would do it.
You don't even need non-standard attributes. C11 includes stdalign.h, so all you need to do is this:
#include <stdalign.h>
static alignas(4) uint8_t cdc_receive_temp_buffer[64];
static alignas(4) uint8_t cdc_out_buf[256];
-
Thanks. Yes, I found those macros too.
Mods done, and it still runs :)
-
I posted the question on th ST forum (where usually there are no replies) and got one
https://community.st.com/s/question/0D73W000001i4NaSAI/detail?fromEmail=1&s1oid=00Db0000000YtG6&s1nid=0DB0X000000DYbd&s1uid=0053W000001ojJ3&s1ext=0&emkind=chatterCommentNotification&emtm=1666118332031&t=1666121247984
This is for read/write a multibyte value (32-bit in this case) from/to unaligned location.
For gcc, __packed should be defined somewhere as __attribute__((packed)).
Alternatively, for any compiler use the CMSIS (ARM-defined) macros __UNALIGNED_UINT32_READ, __UNALIGNED_UINT32_WRITE -
I don't understand it.
Has to be said though that just because the 64 byte buffer is aligned, nobody actually knows whether all USB transfers are aligned within that buffer.
I think, for receiving, the USB ISR (the USB code is totally ISR based) always starts at the start of that 64 byte buffer but it may transfer a number of bytes which is not a multiple of 4, and that means the last buffer write will be done one byte at a time (1-3 bytes).
And for transmission, where I am supplying a 256 byte buffer, I again think it starts from the beginning of it but have no idea what it does with it. And since USB CDC never transfers more than 64 bytes, if I give it say 211 bytes in that buffer, it must be doing 64,64,64 and then 19.
USB MSC is always 512 bytes (transferred as 8 x 64), and those functions are different:
/**
* @brief .
* @param lun: .
* @retval USBD_OK if all operations are OK else USBD_FAIL
*/
int8_t STORAGE_Read_FS(uint8_t lun, uint8_t *buf, uint32_t blk_addr, uint16_t blk_len)
{
Adesto_FLASH_ReadPage(buf, blk_len * STORAGE_BLK_SIZ, blk_addr * STORAGE_BLK_SIZ);
return (USBD_OK);
}
and the transfer from FLASH to a 512 byte buffer which USB MSC then reads is done with DMA, in Adesto_FLASH_ReadPage(). I ought to check those buffers are aligned too.
-
I just played a bit with godbolt. GCC and Clang support packing of individual struct members:
struct foo
{
char one;
__packed short two;
char three;
int four;
} c;
But not packing for the individual variables.
It supports this (https://godbolt.org/z/PWYjeoq7z), for instance.
-
I don't understand it.
They are saying that you need to use those alignment attributes. But I don't see it working with GCC when used like this. No idea where this code was tried. May be there is some version of GCC that supports this use.
I think, for receiving, the USB ISR (the USB code is totally ISR based) always starts at the start of that 64 byte buffer but it may transfer a number of bytes which is not a multiple of 4, and that means the last buffer write will be done one byte at a time (1-3 bytes).
The code you provided assumes that the buffer size is multiple of 4 bytes and only transfers words. USB FIFO is word-only transfer and the code makes no attempts to split the word into bytes. No idea if this is an oversight or an API requirement.
And for transmission, where I am supplying a 256 byte buffer, I again think it starts from the beginning of it but have no idea what it does with it. And since USB CDC never transfers more than 64 bytes, if I give it say 211 bytes in that buffer, it must be doing 64,64,64 and then 19.
You still write multiple of 4 bytes, but then the length register specifies how many bytes to actually transfer. If your buffer is not a multiple of 4 you will just write some junk at the end, but it will never be transmitted, so it does not matter.
-
I just played a bit with godbolt. GCC and Clang support packing of individual struct members:
struct foo
{
char one;
__packed short two;
char three;
int four;
} c;
But not packing for the individual variables.
It supports this (https://godbolt.org/z/PWYjeoq7z), for instance.
Yes, that would do the trick. As ataradov mentioned above, using alignas(1) instead of a GCC attribute would do the same and be standard C11, although alignas() doesn't seem to be accepted in the context of a type definition, unfortunately.
With all that said, there's something I don't quite get. If you want the compiler to enforce the alignment you set for a given structure whena accessing fields, just dereference a pointer to the structure instead of messing with pointers to members (and risk getting alignment completely wrong.)
struct->field will access the field with whatever alignment was defined for the structure being pointed to.
-
They are saying that you need to use those alignment attributes. But I don't see it working with GCC when used like this. No idea where this code was tried. May be there is some version of GCC that supports this use.
Sure, I get that, but he doesn't say why, and more to the point doesn't say why it works without them. I have just intentionally misaligned the CDC buffer by 1 byte and it still all runs.
May be there is some version of GCC that supports this use.
I am now on GCC v10 and the ST code I am using was brought into the project ~ 3 years ago when GCC was on v7 (Cube was 1.4). This is from my notes when testing it:
Cube 1.5.1 has a compiler version 7.3.1 20180622.
Cube 1.6.1 has a compiler version 9.3.1 20200408.
Cube 1.7.0 is the same as 1.6.1.
Cube 1.8.0 is the same again (no change in compiler or linker)
Cube 1.9.0 has a compiler v10 which does stricter checking of various things
Cube 1.10 is same as 1.9 but crashes more often
So maybe v7 supported it. But the fact is that it works with those __packed attribs ignored. And the code does get tested exhaustively all day; both CDC and MSC. That said, an overhead of 3xx ns per 64 bytes will go totally unnoticed. This is a tiny fraction of nothing :)
I have the CDC buffers aligned but can't find the 512 byte MSC buffers, due to the way the MSC functions are called via multiple levels of indirection, via function tables etc. FWIW, a breakpoint on the r/w functions shows the buffers are aligned.
-
And the code does get tested exhaustively all day;
It is not about how hard you test and given binary. Your buffers may shift after you make a change to the code. But when it happens the failure would be obvious and instant.
-
I did misalign them, as I wrote above. Still works. And arguably it should work, given that the 32F4 supports unaligned 32 bit loads and stores.
IIRC, some members of the 32F family do not support unaligned data.
-
And arguably it should work, given that the 32F4 supports unaligned 32 bit loads and stores.
Well yes, if unaligned traps are not enabled, it would work. The only thing to keep in mind is that ldrd/strd do not support unaligned access, but that's generally not a problem.
All Cortex-M3/M4/M7 devices should support unaligned access. I don't think there is even an option in IP to disable that.
-
And arguably it should work, given that the 32F4 supports unaligned 32 bit loads and stores.
Well yes, if unaligned traps are not enabled, it would work. The only thing to keep in mind is that ldrd/strd do not support unaligned access, but that's generally not a problem.
All Cortex-M3/M4/M7 devices should support unaligned access. I don't think there is even an option in IP to disable that.
Not sure either.
Now for Aarch64 (which is not the OP's concern here, but just saying), unaligned access is not supported unless you configure the MMU - learned that when writing baremetal code for the RPi 4.
-
I posted the question on th ST forum (where usually there are no replies)
Of your 39 questions there, 3 went with no replies.
[EDIT]Okay, 4, as this (https://community.st.com/s/question/0D53W00000mm8NnSAI/usb-slave-does-the-code-need-to-change-usbdm-usbdp-pa11-pa12-to-inputs-when-vbussense0) is no reply either. I just counted the zeros.
JW
-
What is actually wrong with the existing (ST) code assuming the buffers are explicitly 4-aligned?
That USB FIFO works explicitly in 32-bit units, so feeding it a byte array and a byte length (which is silently rounded up to a multiple of four), is just the wrong approach.
Just because you do 16 stores, does not mean you have to do exactly 16 loads, too.
I think it does.
Nope. Using the functions I showed, it does 64 byte loads on architectures where unaligned 32-bit loads are not supported.
That's not all. You can also do 15 32-bit reads, and 4 byte reads (for a total of 19 reads), and exactly 16 writes.
You do this by checking the lowest two bits of the source pointer, and pick one of four functions, one for each alignment (correct, off-by-1, off-by-2, off-by-3). The aligned one is trivial. For the unaligned ones, you combine two 32-bit registers into a 64-bit word, to use as a barrel shifter recovering the alignment.
Consider the case where the buffer address low bits are 11, meaning there is a single byte, followed by 15 aligned 32-bit words, followed by three bytes. You load the first byte to the low bits of the upper word, and the first full word to the low word:
|00000000 00000000 00000000 aaaaaaaa|bbbbbbbb cccccccc dddddddd eeeeeeee|
and then rotate left by 24 bits, getting
|aaaaaaaa bbbbbbbb cccccccc dddddddd|eeeeeeee 00000000 00000000 00000000|
You write the upper word, now perfectly aligned. Then, you rotate left by 8 bits:
|bbbbbbbb cccccccc dddddddd eeeeeeee|00000000 00000000 00000000 00000000|
For the rest of the aligned 32-bit words, you repeat from the first step, reading the next aligned 32-bit word to the low part. For the three last bytes, you read them into the high bits of the low part instead:
|bbbbbbbb cccccccc dddddddd eeeeeeee|xxxxxxxx yyyyyyyy zzzzzzzz 00000000|
and do a final rotate left by 24 bits, getting the final 32-bit word,
|eeeeeeee xxxxxxxx yyyyyyyy zzzzzzzz 00000000|00000000 00000000 00000000 00000000|
writing the upper word. You're now done.
For an offset of k bits, the shifts are k and 32-k bits. This works with even non-byte-aligned inputs, you see.
You do not need hardware bit rotate support or even 64-bit type support to do this, either; you can use
struct u32pair {
uint32_t lo;
uint32_t hi;
};
static inline u32pair u32pair_shift_left(struct u32pair p, uint_fast8_t n)
{
return (struct u32pair){ .hi = (p.hi << n) | (p.lo >> (32 - n)), .lo = p.lo << n);
}
See?
Yes, the cost is four shifts per 32-bit word output, plus some extra work for the initial and final unaligned bytes, but that wasn't the point. The point is that just because you need to write exactly 16 32-bit words from a possibly unaligned buffer, it does not mean you have to do 16 unaligned 32-bit reads from said unaligned buffer: you can do 15 aligned 32-bit reads and 4 byte reads instead, plus some extra work per 32-bit word.
(If you don't see, just say so: I can write the functions [and test and verify they work as I described, which is the annoying bit] for you to examine at leisure.)
-
Of your 39 questions there, 3 went with no replies.
Did you count the
- replies with little or no information content
- replies weeks later
- replies by "Piranha" telling one that they are a useless idiot, can't read, and providing info which is deliberately incomplete so "Piranha" gets multiple chances to tell the person that they are a useless idiot who can't read (many others have been on the receiving end of this; not just me)
That USB FIFO works explicitly in 32-bit units, so feeding it a byte array and a byte length (which is silently rounded up to a multiple of four), is just the wrong approach.
Do we actually know this? All the info I have seen is in ST'd code. Debugs in the tx function (done at high speed using ITM console) on "len" show this:
18 18 64 34 26 4 36 18 18 9 64 34 18 4 28 18 4 38 2 4 2 26 1 36 13 36 13 12 13 8 13 7 8 13 64 64 64 64 64 64 64 64 13 23 13 23 13 8 13 8 9 13 64 64 64 64 64 64 64 64 64 13 34 64 64 64 64 64 64 64 64 4 13 8 13 8 13 38 64 64 64 64 64 64 64 64 13 36 64 64 64 64 64 64 64 64 26 13 8 13 8 13 64 64 64 64 64 64 64 64 13 8 13 64 64 64 64 64 64 64 64 13 8 13 8 13 64 64 64 64 64 64 64 64 13 8 13 13 8 13 8 13 64 64 64 64 64 64 64 64 13 13 64 64 64 64 64 64 64 64 13 8 13 8 13 64 64 64 64 64 64 64 64 13 8 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 13 8 13 8 13 64 64 64 64 64 64 64 64 13 64 64 64 64 64 64 64 64 13 8 13 8 13 64 64 64 64 64 64 64 64 13 8 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 13 8 13 8 13 64 64 64 64 64 64 64 64 13 64 64 64 64 64 64 64 64 13 8 13 8 13 64 64 64 64 64 64 64 64 13 8 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 23 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 8 13 8 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 13 13 13 13 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 13 13 13 13 7 7 7 7 13 13 13 13 13 13 13 13 13 13 24 13 5 64 64 64 64 50 10 6 4 6 4 19 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 6 2 7 64 12 2 3 2 3 2 3 1 13 13 13 13 13 13 13 13 13 13 13 13 13
so you can see it is any number from 1 to 64. Circumstantially I think most of the 64s above are MSC (which always transfers 8x64=512 rapidly) and the rest are CDC; I have Teraterm running on a VCP port.
So clearly the 32F4 USB hardware is happy to get 4 byte multiples but it discards the extra.
wek may well know all this.
But, hey, I did something innovative: googled on USBx_DFIFO (googling on register names is what 73.5% of ST coders do), and found this
https://community.st.com/s/question/0D50X0000BSXYHCSQ5/sw4stm32-doesnt-like-the-project-generated-by-stmcubemx-v53 (https://community.st.com/s/question/0D50X0000BSXYHCSQ5/sw4stm32-doesnt-like-the-project-generated-by-stmcubemx-v53)
:) :)
That is for 32F7xx. The last post was unanswered
Or the compiler will actually generate code to access unaligned data by bytes and reassemble?
but does it make sense? Is there a need for different machine code if the buffer is unaligned?
Eventually we come full circle :)
https://www.eevblog.com/forum/microcontrollers/32f417-usb-fs-(not-hs)-and-dma/ (https://www.eevblog.com/forum/microcontrollers/32f417-usb-fs-(not-hs)-and-dma/)
-
That USB FIFO works explicitly in 32-bit units, so feeding it a byte array and a byte length (which is silently rounded up to a multiple of four), is just the wrong approach.
Do we actually know this?
Even if it internally discards extra bytes, it still works in explicitly 32-bit units. Making the buffers 32-bit avoids all the alignment problems. And, as I showed, if one needs to access individual bytes in the buffer, that is still trivially possible using a simple cast.
Edited to add: Obviously, the USB messages themselves are not 32-bit aligned. Only the interface to the USB FIFO features on that MCU is 32-bit. And because it is, it makes a lot of sense to make the buffer interface 32-bit and 32-bit aligned also.
-
Putting the data in a buffer and specifying the size are two separate operations. It is not an actual FIFO. It is cleared after each frame is sent, so you can write in multiples of 4, the remaining 1-3 bytes will be discarded.
-
Putting the data in a buffer and specifying the size are two separate operations. It is not an actual FIFO. It is cleared after each frame is sent, so you can write in multiples of 4, the remaining 1-3 bytes will be discarded.
I suspected that would be the case.
Nevertheless, because the interface processes the messages in 32-bit units, making the C interface use 32-bit units for the message buffers makes most sense. Abstraction, making the interface consume (unaligned) arbitrary buffers, is incorrect here if the hardware does not always support unaligned 32-bit accesses. It is better to ensure alignment, because it is the Path of Least Surprises.
Perhaps the following would be acceptable? Data buffer is explicitly 32-bit units, to ensure proper alignment, but the length is in bytes? (You have much more experience with these than I do.)
HAL_StatusTypeDef USB_WritePacket(USB_OTG_GlobalTypeDef *USBx, const uint32_t *src, uint_fast8_t ch_ep_num, uint_fast16_t bytes, uint_fast8_t dma)
{
const uintptr_t USBx_BASE = (uintptr_t)USBx;
if (dma == 0U) {
const uint32_t *p = src;
const uint32_t *const q = src + (bytes + 3) / 4;
while (p < q) {
USBx_DFIFO((uint32_t)ch_ep_num) = *(p++);
}
}
return HAL_OK;
}
uint32_t *USB_ReadPacket(USB_OTG_GlobalTypeDef *USBx, uint32_t *dst, uint_fast16_t bytes)
{
const uintptr_t USBx_BASE = (uintptr_t)USBx;
uint32_t *p = dst;
uint32_t *const q = dst + (bytes + 3) / 4;
while (p < q) {
*(p++) = USBx_DFIFO(0U);
}
return dst;
}
Although, I must admit, the way the USBx_DFIFO() macro refers USBx_BASE gives me the willies. I'd really prefer
#define USBx_DFIFO(base, ep) (*(__IO uint32_t *)((unsigned char *)(base) + USB_OTG_FIFO_BASE + (uintptr_t)(ep) * USB_OTG_FIFO_SIZE))
so that the loop innards above become
USBx_DFIFO(USBx, ch_ep_num) = *(p++);
and
*(p++) = USBx_DFIFO(USBx, 0);
respectively.
-
Of your 39 questions there, 3 went with no replies.
Did you count the
- replies with little or no information content
- replies weeks later
- replies by "Piranha" telling one that they are a useless idiot, can't read, and providing info which is deliberately incomplete so "Piranha" gets multiple chances to tell the person that they are a useless idiot who can't read (many others have been on the receiving end of this; not just me)
Yes. And I counted also the information provided by unpaid volunteers, who provide genuine information and don't have time to custom-tailor them to your very particular setup, so the burden of understanding and implementing it is upon you.
That USB FIFO works explicitly in 32-bit units, so feeding it a byte array and a byte length (which is silently rounded up to a multiple of four), is just the wrong approach.
This is true, but irrelevant. The __packed attribute is placed on the user buffer, not the USB hardware FIFO.
As I've said over on the forum where nobody ever answers, the code has already changed. I don't know if it's perfect now, I don't use Cube.
JW
-
I've designed hardware FIFOs in the past (FPGA & FPGA -> ASIC) and they were always bytewide, if you want to end up with bytes at the output (which, with USB, you do).
So I think that 32 bit data register just gets dumped to a bytewide FIFO, in four clocks, until the byte count is exhausted and then the dumping stops.
Come to think of it, I can't see where the above byte count (1-64) is actually specified. It probably isn't, and the USB controller always transfers a multiple of 4 into a FIFO, and then extracts the required # of bytes from there (it knows how many it wants) and clears the FIFO.
I am a "simple" C programmer, and keep my code very obvious, because I am working alone and need to understand it myself. I had an occassion recently where I wanted to compare two uint8_t 2k buffers fast, and memcmp was not particularly fast (surprisingly; it is always claimed to be optimised) so I tried a loop which does it 32 bits at a time, but failed to make it work. I am sure it is dead simple.
I see wek has replied in the ST forum, affirmatively that the existing code should work regardless of the buffer being aligned, on the 32F4, but there were past issues with other 32F CPUs and this code is a fossil from there.
the code has already changed.
Where can it be found?
who provide genuine information and don't have time to custom-tailor them to your very particular setup, so the burden of understanding and implementing it is upon you.
You do, for which I thank you; not many others, and some of the other stuff is positively offensive (that guy needs to get a life).
-
That USB FIFO works explicitly in 32-bit units, so feeding it a byte array and a byte length (which is silently rounded up to a multiple of four), is just the wrong approach.
This is true, but irrelevant. The __packed attribute is placed on the user buffer, not the USB hardware FIFO.
My comment was about the C interface, so it most definitely is relevant.
Instead of simply marking the buffer unaligned, so the compiler will deal with any alignment issues, I claim that it is better to use the data type native to the interface (32-bit words) instead, since it also ensures the buffers are aligned, and reminds the programmers that this interface really prefers the buffers to be aligned. It does not preclude using the buffers in a byte-wise manner, since casting the buffer to an unsigned char pointer does exactly that. Path of Least Surprise and all.
-
I've designed hardware FIFOs in the past (FPGA & FPGA -> ASIC) and they were always bytewide, if you want to end up with bytes at the output (which, with USB, you do).
In 32-bitters, byte-aligned bus interface is an extra burden (read: silicon area, logic complexity hence speed, etc.), especially if you incorporate things like DMA (and that in both simple and scatter-gather form, optionally). The Synopsys OTG-USB module we are talking about here is a mammoth, in all senses of the word.
The other USB module (mostly device-only) used in "lower-end" STM32 is even more weird in this respect, as it for historical reasons uses 16-bit data, in some STM32 models mapped to 32-bits (yeah!) and in others accessed strictly-16-bit (http://efton.sk/STM32/gotcha/g104.html).
In both cases, the burden of packing/unpacking/realigning/whatever the naturally byte-natured data is upon the programmer, but at the moment somebody starts to write a "library", it is expected by the users of "library" that it's that "library" which takes upon itself that burden.
> Come to think of it, I can't see where the above byte count (1-64) is actually specified.
This includes 0.
We've discussed this already: the Synopsys OTG has the entire transfer detached into two separate processes - "set up transfer" and "actually dump data" - to allow for the optional DMA. Number of bytes to be transferred is written in the former into the respective DIEPTSIZx/DOEPTSIZx register.
JW
-
That USB FIFO works explicitly in 32-bit units, so feeding it a byte array and a byte length (which is silently rounded up to a multiple of four), is just the wrong approach.
This is true, but irrelevant. The __packed attribute is placed on the user buffer, not the USB hardware FIFO.
My comment was about the C interface, so it most definitely is relevant.
Indeed, I was just impatient, sorry.
Instead of simply marking the buffer unaligned, so the compiler will deal with any alignment issues, I claim that it is better to use the data type native to the interface (32-bit words) instead, since it also ensures the buffers are aligned, and reminds the programmers that this interface really prefers the buffers to be aligned. It does not preclude using the buffers in a byte-wise manner, since casting the buffer to an unsigned char pointer does exactly that. Path of Least Surprise and all.
I see your point. Let me not disagree while offering an alternative view: Path of User's Absolute Comfort.
the burden of packing/unpacking/realigning/whatever the naturally byte-natured data is upon the programmer, but at the moment somebody starts to write a "library", it is expected by the users of "library" that it's that "library" which takes upon itself that burden.
JW
-
All Cortex-M3/M4/M7 devices should support unaligned access
Yup, indeed! ARM is usually tolerant, other architectures are not.
It's a serious problem with machines like
- MIPS (R5k...R16K, MIPS32, MIPS64)
- PPC embedded (PPC4xx)
- SH4 embedded
their load/store don't support unaligned access.
/* uint08 */ when (IO.size isEqualTo 1 ) ----> no problem
/* uint16 */ when ((IO.size isEqualTo 2 ) and (modulo(addr,2) isNotEqualTo 0)) ----> trap(misaligned)
/* uint32 */ when ((IO.size isEqualTo 4 ) and (modulo(addr,4) isNotEqualTo 0)) ----> trap(misaligned)
/* uint64 */ when ((IO.size isEqualTo 8 ) and (modulo(addr,8) isNotEqualTo 0)) ----> trap(misaligned)
-
in others accessed strictly-16-bit
Dunno why, but things like C8900 (Ethernet) are 16-bit :-//
(8bit works, but ... it's hw-bugged)
-
All Cortex-M3/M4/M7 devices should support unaligned access
Yup, indeed! ARM is usually tolerant, other architectures are not.
Not in my experience. Even application processors may have ARM CPUs that don't allow unaligned access and I have come across systems that don't even throw a bus error but corrupt data silently.
-
This includes 0.
As a small point, I think 0 probably doesn't exist in the target -> PC direction (what USB calls "IN" direction). May be legal though, of course, but why would the target want to transmit 0 bytes? ST don't test for it, but
for (i = 0U; i < count32b; i++)
won't execute if count32b==0.
but at the moment somebody starts to write a "library", it is expected by the users of "library" that it's that "library" which takes upon itself that burden.
Yes; we are stuck with this, since it appears nobody actually knows how the hardware works, so "we" have to speculate.
Anyway it looks like, for the 32F4, the ST code is right with the __packed simply removed, and explicitly aligning the buffers is just an extra "safety" step.
It does make sense to feed peripherals with 32 bits, since they run much slower than the CPU (48MHz for the USB?).
-
> why would the target want to transmit 0 bytes?
The reason is the same in both directions - see USB2.0 5.3.2. Pipes, treatise of "short packets" (for definition of short packet see 9.4.3).
> for (i = 0U; i < count32b; i++)
> won't execute if count32b==0.
This is in the "data" stage of transfer; in case of ZLP (zero-length packet), the OTG machine won't proceed to that. The Operational model subchapter of OTG chapter(s) in RM0090 describe that.
JW
-
Long irrelevant now, but
And I counted also the information provided by unpaid volunteers
I've just got a reply (email notification) from "Piranha" to a Q I posted in March 2022 :)
I realise everyone is unpaid but that is quite wrong for a $14BN company.
-
This is in the "data" stage of transfer; in case of ZLP (zero-length packet), the OTG machine won't proceed to that. The Operational model subchapter of OTG chapter(s) in RM0090 describe that.
Would a "dummy" write to DFIFO fix that? (As in, trigger the transfer but discard all four bytes?)
If so,
len += !len; // Minimum 1
or
count32b = ((uint32_t)len + 3U + !len) / 4U; // Minimum 1
or
q = src + (bytes + 3 + !bytes) / 4; // Minimum 1
would be a simple fix.
-
Would a "dummy" write to DFIFO fix that? (As in, trigger the transfer but discard all four bytes?)
Fix what exactly? The trigger for a transfer is not a FIFO write, but the count write. You can write 0 and ZLP would be sent. No need to write FIFO in that case.
-
Would a "dummy" write to DFIFO fix that? (As in, trigger the transfer but discard all four bytes?)
Fix what exactly? The trigger for a transfer is not a FIFO write, but the count write. You can write 0 and ZLP would be sent. No need to write FIFO in that case.
Ah, okay. I understood wek's post above to mean the functions shown have a bug in the ZLP case. They don't, because in the zero length case nothing needs to be written to the DFIFO anyway.
The C interface feels increasingly odd to me. In particular, why is the count write separate from the buffer fill? Weird.
-
It is impossible to even enumerate the device without exchanging ZLPs, so there won't be bugs like that.
The interface is structured that way because this is a reasonably low level function, basically just an abstraction over hardware. Plus you may want to use multiple calls before you send the data. FIFO can store more data than a endpoint size and USB controller is capable of splitting that into multiple packets automatically.
-
I did wonder why ZLPs were needed... What a weird protocol.
In the early days of USB, the standard comment was that the spec was written by a bunch of kids, didn't work, but there was so much commercial drive behind it that eventually it got sorted out.
Bluetooth is a decade further along that curve ;)
This has been an interesting learning experience but I would hate to be someone employing coders at $100/hr having to sort out this stuff. It works for me because I value my time at zero, and I enjoy most of it. I guess larger industrial firms buy in commercial products for USB, ETH, etc, with paid support.
-
All Cortex-M3/M4/M7 devices should support unaligned access
Yup, indeed! ARM is usually tolerant, other architectures are not.
Not in my experience. Even application processors may have ARM CPUs that don't allow unaligned access and I have come across systems that don't even throw a bus error but corrupt data silently.
yeah, that's why I say "usually" ;D
My R18200 does the same: it silently corrupts things if you don't enable the bus misalignment trap.
So it's not a "tolerant" CPU, it's a "bastard" CPU.
Yup, there are also this kind of CPUs around!
-
I did wonder why ZLPs were needed... What a weird protocol.
ZLPs are extremely useful. If you need to send 512 byte array over 64 byte endpoint, you just send it in 8 packets and finish with ZLP notifying that you are done. CDC protocol relies on this, since CDC is not just serial ports as people assume, but it also sends fully separated Ethernet frames. ZLP acts as a packet separator.
In the early days of USB, the standard comment was that the spec was written by a bunch of kids
Get off your high horse. You have no idea about how the standard works, yet make comments like this.
I guess larger industrial firms buy in commercial products for USB, ETH, etc, with paid support.
Not really. Vendor stacks usually works fine if you take just a bit of time to figure it out and actually just take time to read the spec. Commercial firms just hire people that don't struggle with proimitive things.
-
Get off your high horse
I guess this is cultural but there is no need to be rude.
Just had a look. 50MB. Yeah, I admire you for knowing it all. Especially as you got paid for reading it :) I've never had a job since leaving univ (44 years ago) so never got paid for learning something.
-
Especially as you got paid for reading it :)
The other way around, I've got the job because I read stuff. I do it for fun in my spare time.
-
The other way around, I've got the job because I read stuff. I do it for fun in my spare time.
QFT.
Some years ago I changed job - internally, we are big - and one of the reasons I got the new job was that the person checking my technical competence did not expect me to know stuff totally unrelated to my previous positions.
Recently, I was surprised to get an offer from an external firm that provided us a course - same reason, as reported by the instructor (not interested, though).
I do not consider myself especially smart, and I'm a bit lazy, but expanding one's own knowledge pays off, where my innate charm :-DD doesn't cut it.
-
I recommend a google on the early days of USB, the criticism of the messy spec, widespread compatibility issues, and then how it developed, and who drove the standardisation (Intel, with their motherboard chip monopoly) which eventually produced a workable solution.
You just need to be a bit "old" and probably a lot of it is before the internet got going.
There is still a lot of funny stuff deep in there e.g. how writes to slow FLASH are handled. I did some long threads on it when looking for a solution for my FatFS impementation in a 15ms sector-write Adesto FLASH. I got slagged off for doing this 15ms write in the USB ISR (the interrupt which occurs when the 8th 64 byte MSC packet has arrived, and the 512 byte sector is ready to be written) but nobody had the slightest clue about how to do it "puritanically". One purist told me I was supposed to write a "driver" for the FLASH (which solves precisely nothing). Basically, it appears, returning any of the BUSY status codes breaks the Windows driver. The stuff about it retrying on a BUSY status appears to be BS. I spent days or weeks of my life going down that rabbit hole. Then it turned out that FLASH sticks do exactly what I was doing: they hang in the ISR for each sector write and return OK when it is finished. This approach works on all windows versions from XP onwards, and on Linux. The timeout on Windows MSC is huge - probably a second or so because I can single step through USB code in the debugger and it continues to run.
It's easy to criticise...
-
I'm well aware of USB history. It is virtually impossible to design a perfect standard out of the box. And they did pretty well and then worked to solve issues that came up. It is easy to criticize indeed, especially when you have not done anything on a similar scale and adoption.
Your USB struggles are related to you not understanding the standard and ST USB stack. They have nothing to do with the actual standard for USB or MSC.
-
I'm well aware of USB history. It is virtually impossible to design a perfect standard out of the box. And they did pretty well and then worked to solve issues that came up. It is easy to criticize indeed, especially when you have not done anything on a similar scale and adoption.
Pretty well? Try telling that to all the peripheral makers who pumped money into multiple generations of USB devices that didn't sell because the whole chain wasn't ready, and had become outdated by the time it was. Senior people at Intel quipped in retrospect that they took less time to fight the might of Germany and Japan than it took to get USB out the door fully formed and usable.
-
Try telling that to all the peripheral makers who pumped money into multiple generations of USB devices that didn't sell because the whole chain wasn't ready,
That will always happen with new standards. Also, multiple generations? USB 1.0 had issues, but it was released for 2 years before USB 1.1 addresses most of them and was perfectly usable. And by that time native OS support was common.
In the late 90s manufacturers released a lot of junk that was destined to fail, not USB fault.
USB is collaborative project, it would take longer time to release if you compare to proprietary stuff.
I'm not saying that USB is perfect, knowing all the issues, I would not design a new interface to be like that. But back then they did not know and had no good references.
-
Given how fragmented interfaces were before USB in general, and plug-and-play being a relatively brand-new feature overall at this time, it was still a great achievement.
And now that it's been around for about 25 years, with backwards compatibility always maintained. Uh yeah. Calling it a failure? :-DD
For people stating this kind of absurdity, please let us know what *you* have achieved. We can certainly all learn from that!
-
Your USB struggles are related to you not understanding the standard and ST USB stack
I posted about it here in the hope that a true USB expert could set me on the correct course. It would appear that you missed the opportunity ;)
And now that it's been around for about 25 years
Not really surprising that it works, after years :) Bluetooth might get there one day, too. Too late for all the people whose 5 year old car system has stopped working after a phone update... I will never try to develop a BT product; it would be a nightmare for any manufacturer who has to support their products (which doesn't apply to 99% of participants in the retail space).
please let us know what *you* have achieved.
I have achieved nothing, which is why I am here: to learn from the masters :)
https://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants
-
So, wrapping it all up,
- __packed attribute is already not there in said functions the current/recent incarnations of Cube
- it's a Keil/ARM compiler native qualifier, the attribute into which it converts for gcc does nothing (i.e. is both ineffective and harmless) except it throws a warning
- said function moves words into USB hardware (FIFO) from buffer in RAM passed through pointer
- pointer to buffer is best to be word-aligned for speed, but there are legitimate user cases where users want to use unaligned pointer (e.g. zero-copy higher-level protocols)
- plain dereferencing unaligned pointer to word throws error in 'CM0/'CM0+ and works somewhat less efficiently in 'CM3/'CM4/'CM7
- facing this error, Cube authors decided to allow for unaligned pointer for users convenience at the cost of more work for them
- for Keil/ARM compiler it's easy, just adding __packed moves the problem to the compiler which does whatever is needed (i.e. generates multiple split accesses)
- I wouldn't be surprised to learn that IAR has similar feature
- this may be not optimal as there are algorithms through which most accesses still can be word-wise at the cost of dis/reassembling words in software, but here we are
- Cube authors initially probably did not realize that this does not work in the same way in gcc, IIRC in current Cube there is explicit disassembly/reassembly as needed for gcc
- in the 'CM3/4/7 version, __packed is there too, maybe to suppress warning about unaligned access? but as said above, is mostly harmless
- in the current 'CM3/4/7 gcc version there's some cast to __packed array or something, probably ineffective/harmless in the same way except it does not thow the warning? I don't know don't care about Cube
- ZLPs are useful, I referred to chapter and verse in the fine manuals above
- in Synopsys OTG USB used in higher-end STM32, the "transfer start" and "dump data" (and then also a "done/PC got it") processes are separate to allow replacing the latter by built-in DMA seamlessly
- in that USB, when "transfer start" is with ZLP, it does not throw the "dump data" interrupt at all, so there's no point to discuss zero data case in said "dump data" functions
- the free service is late and lacking and the $14B company does not care
- USB is crap but who are we to judge
- sometimes users in fora are rude to each other
JW
-
Thank you.
__packed attribute is already not there in said functions the current/recent incarnations of Cube
Are these online somewhere? I looked in the Cube library (c:\st or c:\users) but the file stm32f4xx_ll_usb.c is not there. Maybe it is compressed into something...
sometimes users in fora are rude to each other
It achieves nothing though.
-
I'm out of office but try googling for CubeF4 github.
JW
-
I think I found it here
https://github.com/STMicroelectronics/STM32CubeF4/blob/master/Drivers/STM32F4xx_HAL_Driver/Src/stm32f4xx_ll_usb.c
and the code there is
/**
* @brief USB_WritePacket : Writes a packet into the Tx FIFO associated
* with the EP/channel
* @param USBx Selected device
* @param src pointer to source buffer
* @param ch_ep_num endpoint or host channel number
* @param len Number of bytes to write
* @param dma USB dma enabled or disabled
* This parameter can be one of these values:
* 0 : DMA feature not used
* 1 : DMA feature used
* @retval HAL status
*/
HAL_StatusTypeDef USB_WritePacket(USB_OTG_GlobalTypeDef *USBx, uint8_t *src,
uint8_t ch_ep_num, uint16_t len, uint8_t dma)
{
uint32_t USBx_BASE = (uint32_t)USBx;
uint8_t *pSrc = src;
uint32_t count32b;
uint32_t i;
if (dma == 0U)
{
count32b = ((uint32_t)len + 3U) / 4U;
for (i = 0U; i < count32b; i++)
{
USBx_DFIFO((uint32_t)ch_ep_num) = __UNALIGNED_UINT32_READ(pSrc);
pSrc++;
pSrc++;
pSrc++;
pSrc++;
}
}
return HAL_OK;
}
/**
* @brief USB_ReadPacket : read a packet from the RX FIFO
* @param USBx Selected device
* @param dest source pointer
* @param len Number of bytes to read
* @retval pointer to destination buffer
*/
void *USB_ReadPacket(USB_OTG_GlobalTypeDef *USBx, uint8_t *dest, uint16_t len)
{
uint32_t USBx_BASE = (uint32_t)USBx;
uint8_t *pDest = dest;
uint32_t pData;
uint32_t i;
uint32_t count32b = (uint32_t)len >> 2U;
uint16_t remaining_bytes = len % 4U;
for (i = 0U; i < count32b; i++)
{
__UNALIGNED_UINT32_WRITE(pDest, USBx_DFIFO(0U));
pDest++;
pDest++;
pDest++;
pDest++;
}
/* When Number of data is not word aligned, read the remaining byte */
if (remaining_bytes != 0U)
{
i = 0U;
__UNALIGNED_UINT32_WRITE(&pData, USBx_DFIFO(0U));
do
{
*(uint8_t *)pDest = (uint8_t)(pData >> (8U * (uint8_t)(i)));
i++;
pDest++;
remaining_bytes--;
} while (remaining_bytes != 0U);
}
return ((void *)pDest);
}
I can't understand what they are trying to do given the 32F4 supports unaligned 32 bit accesses in hardware.
-
I can't understand what they are trying to do given the 32F4 supports unaligned 32 bit accesses in hardware.
The code may be universal for all devises with the same controller, so they want to be as compatible as possible.
-
OK. What is this
do
{
*(uint8_t *)pDest = (uint8_t)(pData >> (8U * (uint8_t)(i)));
i++;
pDest++;
remaining_bytes--;
} while (remaining_bytes != 0U);
Isn't it writing 1 byte at a time into the USB FIFO?
-
Where is this code coming from? I see it from the quoted code above.
It is not. It reads FIFO value into pData and then copies it byte by byte into the final buffer.
Although pData is always aligned, so this "__UNALIGNED_UINT32_WRITE(&pData, USBx_DFIFO(0U));" is not necessary, it could just be "pData = USBx_DFIFO(0U)".
-
Could this entire issue not be solved simply by __align(4) on the tx and rx buffers, and by moving uint32 values to/from the USB controller?
Wek has a point about "zero-copy" buffer applications but do these really exist in USB, in the arm32 context? Much has been made of that in the ETH/LWIP department but actually the perf gain there is approximately minus zero, and I don't think anybody is up to fixing all the bugs in the latest ST ETH drivers.
-
Could this entire issue not be solved simply by __align(4) on the tx and rx buffers, and by moving uint32 values to/from the USB controller?
Yes, but you would have to align the buffers at the point of call, and they did not want to place this constraint on their APIs.
In my implementations I do require buffer alignment with no issues and it indeed simplifies everything.
-
you would have to align the buffers at the point of call,
No doubt I am missing something but surely when you call this code, the buffer can be either in main RAM or on the stack, and an __align directive will work in either case.
Not sure what happens with a buffer on the heap; no doubt it depends on the heap implementation.
-
an __align directive will work in either case.
Yes, but my point is that you as a caller have to ensure alignment of each buffer you pass. You can't just declare function arguments as aligned. This is not hard to do, but may be you expect users of this API to not remember this in all cases.
It is like needing to declare a variable shared with an interrupt handler as a volatile. It is a thing you need to do, but people forget about that all the time and run into issues.
So they may have decided to do more work inside the function so that you never have to think about it outside.
-
Not sure what happens with a buffer on the heap; no doubt it depends on the heap implementation.
malloc() returns a pointer that is aligned at least as strictly as max_align_t.
For even stricter aligment requirement you can call aligned_alloc().
-
So they may have decided to do more work inside the function so that you never have to think about it outside.
OK; how about defining the buffer as a uint32_t one? Then alignment is implicit, the caller can't forget, but he has to mess with pointers if feeding bytes into it.
malloc() returns a pointer that is aligned at least as strictly as max_align_t.
I can't find it in my source, but that sounds like 8 or even 16, according to google. My biggest scalar is 8. Thank you.
-
OK; how about defining the buffer as a uint32_t one? Then alignment is implicit, the caller can't forget, but he has to mess with pointers if feeding bytes into it.
Yes, but now if you naturally want to work with unaligned data, you are going to have the same messy loops all over the code instead of one place in the final function.
It is just a decision you need to make at the architecture stage. If you are ok designing everything around aligned buffers, then there is no issue. If you are not ok with that, then doing it where ST did is the best approach. ST did not want to make this decision for their customer, so they went the safe route.
And of you are writing the drivers yourself, you can make this decision yourself. Otherwise you just have to live with the decision vendor made for you.
-
malloc() returns a pointer that is aligned at least as strictly as max_align_t.
I can't find it in my source, but that sounds like 8 or even 16, according to google. My biggest scalar is 8. Thank you.
It's C11.
alignof(long double) may be 16 on some platforms.