Author Topic: 32 bit CRC - is it standard? (Read 8492 times)

oPossum · « **Reply #25 on:** July 29, 2021, 07:07:37 am »

Quote from: peter-h on July 29, 2021, 06:10:30 am

Would you be kind enough and post the table for JAMCRC?

Everything I have posted in this thread is 0x04C11DB7 reflected.

The initial value and final XOR don't change the LUT. Only the poly and reflection do.

Quote

but I am not sure if data value is uint32_t or uint8_t. The statement

It is 8 bits because the LUT has 256 entries.

peter-h · « **Reply #26 on:** July 29, 2021, 11:57:40 am »

Yes that works great! Thank you.

The original func takes 2.3 secs to do 1MB and the table based one, 1 byte at a time, takes 0.4 sec.

The 4 byte one would probably be 0.1 sec

Nominal Animal · « **Reply #27 on:** July 29, 2021, 08:20:33 pm »

Quote from: peter-h on July 29, 2021, 06:10:30 am

Would you be kind enough and post the table for JAMCRC?

Like oPossum wrote, that table is in their first LUT post, as well as in mine.

The poly 0x04C11DB7 (0b00000100110000010001110110110111) in inverse bit order yields mask 0xedb88320 (0b11101101101110001000001100100000).
Like the comment says, the five mask/poly lines just invert the bit order.

The function takes a pointer to the data, the number of bytes in the data to be checksummed, and the initial checksum; and returns the updated checksum.

The per-byte lookup is faster than the per-32bit, because the per-32bit iterates over bits, not bytes.

To get any faster, one would have to write an even tighter inner loop, which I'm not sure is possible (since GCC -O2 on Cortex-M4 gets it down to four instructions plus branch); or switch to a HMAC (hash-based message authentication code) using a hash function that is faster than four iterations of CRC32.

In this case, I suspect that MurmurHash, specifically Murmur3_32, would perform well. This is because Cortex-M4 has a binary rotate right assembly instruction (rors) to implement ROL, and (hopefully!) a single cycle unsigned integer multiplication (that yields the low 32 bits of 32×32 multiplication).

SiliconWizard · « **Reply #28 on:** July 29, 2021, 10:51:38 pm »

Quite a few MCUs have a CRC32 peripheral, which will get you there faster. The OP might have considered it and couldn't figure out how to use it on their MCU, or something, IIRC? But really, they should have a look if they want any faster.

Nominal Animal · « **Reply #29 on:** July 29, 2021, 11:43:28 pm »

Right, SiliconWizard. STM32F417 programming manual does describe one. It essentially has two 32-bit registers: CRC_CR and CRC_DR. Writing 0x00000001 to CRC_CR resets the checksum to 0xFFFFFFFF, and writing 32-bit data values to CRC_DR updates the checksum. Reading CRC_DR yields the current checksum. So,

Code: [Select]

uint32_t  stm32f417_crc32(const uint32_t *src, const uint32_t *end)
{
    CRC_CR = 0x00000001;  /* Resets CRC_DR to 0xFFFFFFFF */
    while (src < end)
        CRC_DR = *(src++);
    return CRC_DR;
}

although you might need to enable the CRC peripheral clock (RCC_AHB1ENR.CRCEN = 1; or RCC_AHB1ENR |= 1<<12;) to enable the CRC unit first. It takes four AHB clock cycles (HCLK) per 32-bit word, and uses the same polynomial (0x04C11DB7) we've discussed here. The checksum is initialized to 0xFFFFFFFF. Therefore, the above should yield the same results as

Code: [Select]

uint32_t  stm32f417_crc32(uint32_t *src, const uint32_t *end)
{
    return crc32(src, (size_t)(end - src) * 4, 0xFFFFFFFF);
}

using the crc32() in e.g. my previous posts, assuming the memory region to verify is 32-bit aligned.

Siwastaja · « **Reply #30 on:** July 30, 2021, 06:48:11 am »

Best to use DMA to feed the peripheral.

If you end up just waiting for the DMA&peripheral in a busy loop, you could do part of the calculation in parallel with CPU, finding the optimal slice width, for example like processing 20% of the data in software while waiting for the DMA to do the trick for 80%, for most complexity for relatively little gain

.

I'm thinking, can you configure two DMA channels to trigger on the same DMA request signal, so that one of them moves the data to the CRC peripheral and another in memory? Probably not as reading out the SPI DR is the SPI FIFO pop operation.

peter-h · « **Reply #31 on:** July 30, 2021, 08:40:07 am »

ST offer a library (probably 1000 lines

) for using the 32F4 on-chip CRC generator but I decided to not pursue that because in my application I need to CRC check a data block produced externally, so needed a software version anyway.

In the context of what I am doing, running a CRC on even the entire CPU FLASH (1MB) in 400ms is good enough.

And in another part of the code I have to CRC up to 1MB coming off a 21mbps SPI FLASH chip, which is about 2MB/sec, so of the same order as doing it in software. One could fully hide the CRC calculation in the SPI transfer time in that case, if I wanted to unravel the SPI code which currently is a blocking function.

SiliconWizard · « **Reply #32 on:** July 30, 2021, 05:15:38 pm »

Quote from: Siwastaja on July 30, 2021, 06:48:11 am

Best to use DMA to feed the peripheral.

Yes. Obviously the most efficient way is to use the CRC peripheral along with DMA. Then, it'll be significantly faster, and most of all, you can do something else while it's computing the CRC on the whole block.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: 32 bit CRC - is it standard? (Read 8492 times)

oPossum

Re: 32 bit CRC - is it standard?

peter-h

Re: 32 bit CRC - is it standard?

Nominal Animal

Re: 32 bit CRC - is it standard?

SiliconWizard

Re: 32 bit CRC - is it standard?

Nominal Animal

Re: 32 bit CRC - is it standard?

Siwastaja

Re: 32 bit CRC - is it standard?

peter-h

Re: 32 bit CRC - is it standard?

SiliconWizard

Re: 32 bit CRC - is it standard?

Share me