Author Topic: Bootloader woes - corruption of data being flashed  (Read 3076 times)

0 Members and 1 Guest are viewing this topic.

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Bootloader woes - corruption of data being flashed
« on: February 21, 2019, 12:44:11 am »
Just when I thought I had things going nice and well with how I was able to port Peter Fleury's STK500v2-compatible bootloader so it runs on the ATmega32M1 (as mentioned in my recent previous thread), I have ran into a bizarre problem.

I had made some minor changes to my main application firmware code, and then wrote it to the chip using the bootloader. I then power-cycled the device, only to find that a critical part of my code was apparently no longer functional. I soon traced it down to a 'mode' variable - which is used by the main loop to decide what actions to take - having the wrong value! For some reason, it had the value zero, when it should have a value of one (which is what the code initialises it to in the variable definition). Somehow, the variable wasn't being initialised properly at boot-up! :wtf:

I checked the .map and assembly .lss output files to see if it wasn't the compiler playing silly buggers and, like, I dunno, forgetting to load in the .data section or something? Nope, the variable has been put in the .data section, and the assembly includes an initial __do_copy_data block of code which appears (according to my limited understanding of AVR assembler) to copy into memory the right number of bytes. I even check the .hex file, and the variable's correct data is indeed located at the exact place the .map output says it is.

So, next I thought I would read back the entire flash of the MCU and compare to what it had been programmed with. Here, I discovered the apparent cause of my problem. Some of the data written to flash was not the same as the .hex file!

Here is a screenshot with a side-by-side comparison of the contents of flash read from the chip (left), and the as-compiled .hex file's contents (right):



Highlighted is the one critical byte in question for the initial value of my 'mode' variable - it's different! Also, there's a bunch of extra crap afterwards.

What's going on here? Is this bootloader actually a buggy piece of crap that's not writing the data properly? :rant: I should probably note that my modifications to the bootloader did not touch any of the flash reading/writing code; it was mainly focused on UART initialisation.

The strange thing is, though - and something that had me chasing my tail for a while - is that if I change some of my main application code (e.g. just inserting a debugging printf() statement), the flash is written correctly and everything of course works as it should.
 

Offline Cicero

  • Contributor
  • Posts: 20
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #1 on: February 21, 2019, 11:55:27 am »
I haven't looked into the bootloader code, but first thought is it didn't do a full erase of application space prior to programming.  So I'd first look into that.

What are the addresses of the bootloader and app, and sizes?  Are you sure nothings overlapping by accident.



 
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Re: Bootloader woes - corruption of data being flashed
« Reply #2 on: February 21, 2019, 03:53:00 pm »
I find it suspicious that the byte in error is the very LAST byte of your program...

 

Offline mikeselectricstuff

  • Super Contributor
  • ***
  • Posts: 13748
  • Country: gb
    • Mike's Electric Stuff
Re: Bootloader woes - corruption of data being flashed
« Reply #3 on: February 21, 2019, 04:02:55 pm »

The strange thing is, though - and something that had me chasing my tail for a while - is that if I change some of my main application code (e.g. just inserting a debugging printf() statement), the flash is written correctly and everything of course works as it should.

Sounds like maybe it's not waiting for a a write to complete before restarting
Youtube channel:Taking wierd stuff apart. Very apart.
Mike's Electric Stuff: High voltage, vintage electronics etc.
Day Job: Mostly LEDs
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #4 on: February 21, 2019, 05:28:46 pm »
What are the addresses of the bootloader and app, and sizes?  Are you sure nothings overlapping by accident.

My boot section size is 512 words / 1KB, and the bootloader is just under 1KB; bootloader address is 0x7C00 (in bytes). My application code takes only 26KB (out of 32KB), so no chance of anything overlapping. In fact, you can see in the left-hand hex dump in the previously posted screenshot that there is empty space (0xFF bytes) after my application code.

I find it suspicious that the byte in error is the very LAST byte of your program...

Yes, it seems suspicious. In fact, even more so when I recall from my brief reading of the STK500v2 protocol app note that it allows flash programming by page or by word - the word size being 2 bytes, and the corruption starting at the last 2 bytes...

I have been looking at the code in the bootloader and trying to make sense of it to see if there's any subtle bugs in it. The relevant section is below - maybe someone else can see any problems with it?

Code: [Select]
            case CMD_PROGRAM_FLASH_ISP:
            case CMD_PROGRAM_EEPROM_ISP:
                {
                    unsigned int  size = (((unsigned int)msgBuffer[1])<<8) | msgBuffer[2];
                    unsigned char *p = msgBuffer+10;
                    unsigned int  data;
                    unsigned char highByte, lowByte;

                    if ( msgBuffer[0] == CMD_PROGRAM_FLASH_ISP )
                    {
                        address_t     tempaddress = (address<<1) ; //convert word to byte address (boot_xx macro needs byte address)

                        // erase only main section (bootloader protection)
                        if  (  eraseAddress < APP_END )
                        {
                            boot_page_erase(eraseAddress);  // Perform page erase
                            boot_spm_busy_wait();           // Wait until the memory is erased.
                            eraseAddress += SPM_PAGESIZE;   // point to next page to be erased
                        }

                        /* Write FLASH */
                        do {
                            lowByte   = *p++;
                            highByte  = *p++;

                            data =  (highByte << 8) | lowByte;
                            boot_page_fill(address<<1,data);  //convert word to byte address

                            address++;          // Select next word in memory
                            size -= 2;          // Reduce number of bytes to write by two
                        } while(size);          // Loop until all bytes written

                        boot_page_write(tempaddress);
                        boot_spm_busy_wait();
                        boot_rww_enable();              // Re-enable the RWW section
                    }
                    else
                    {
                        /* EEPROM writing stuff, not relevant to situation */
                    }
                }
                break;

Sounds like maybe it's not waiting for a a write to complete before restarting

You mean the bootloader restarting the device? (As opposed to me manually power-cycling or resetting.) It doesn't actually do that at all. The default behaviour is not to exit the bootloader after a programming operation - the ENABLE_LEAVE_BOOTLADER option is not turned on.



I think a next step may be to break out the logic analyser and capture an entire programming transaction to see whether data for the bytes in question are actually being sent to the bootloader.

By the way, this was using Atmel Studio 7 to program. I might also try avrdude and see if there's any difference in behaviour.
 

Offline ajb

  • Super Contributor
  • ***
  • Posts: 2607
  • Country: us
Re: Bootloader woes - corruption of data being flashed
« Reply #5 on: February 21, 2019, 06:00:53 pm »
Sounds like maybe it's not waiting for a a write to complete before restarting
You mean the bootloader restarting the device?
Or the erase before the write isn't completed properly.


What's the minimum erase/write size on the flash?  Examining how the discrepancies (BTW, were other locations incorrect?) lines up with this may provide some clue.  Also, adding a check that the region has been erased before writing would be a nice defense.  Another option is to redirect the write function to something you can directly monitor (the serial port if you have to, but running it on the desktop and writing into an array or file might be easier--and if you can manage that you're halfway to an off-target test bench, which is always useful), and make sure that the actual data written is what you expect.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Re: Bootloader woes - corruption of data being flashed
« Reply #6 on: February 21, 2019, 07:06:25 pm »
Is this repeatable?If so, do you get the same corruption programming the same .hex file with avrdude rather than AS7?If so, what does "-vvvv" say about what is being transmitted?
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #7 on: February 21, 2019, 07:38:40 pm »
What's the minimum erase/write size on the flash?  Examining how the discrepancies (BTW, were other locations incorrect?) lines up with this may provide some clue.

The flash page size is 128 bytes. I checked, and the corrupted bytes don't lie on a page boundary - corruption occurs at 0x68BA, nearest page boundary is 0x6880. However, it is noteworthy that the additional garbage data that followed does occupy only up to the end of that page (0x6900). So, it seems that either that particular page didn't get erased properly, or perhaps the data buffer (msgBuffer in the code above) didn't get cleared properly - although in that case, one would expect the garbage data to be a copy of the previous page, which it isn't.
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #8 on: February 21, 2019, 08:07:23 pm »
I have been analysing the code of the bootloader some more, and I am not sure whether I have found a deficiency in the code. I hesitate to call it a bug, as I'm not sure whether normal usage would ever actually render it meaningful. Maybe someone can tell me.

When writing a page to flash, the code does the following: 1) erase the page; 2) load data into the temporary page buffer; 3) write the page with buffer's contents.

The code for erasing the page (see earlier post - block following comment of "erase only main section (bootloader protection)") smells bad to me. The thought popped in my mind: where is eraseAddress set? Why is there a separate variable anyway? Isn't it supposed to be erasing the page it's about to write - controlled by a separate variable, address?

So where is eraseAddress set? Only two places: at variable declaration (initialised to zero), and within handler for a 'chip erase' command. The latter doesn't actually do anything, just responds okay and sets the variable to zero:

Code: [Select]
            case CMD_CHIP_ERASE_ISP:
                eraseAddress = 0;
                msgLength = 2;
                msgBuffer[1] = STATUS_CMD_OK;
                break;

Neither of these two places is within the handler for a 'load address' command, which is what I understand sets the actual flash address to write to (i.e. it is not given as part of the write command).

Code: [Select]
            case CMD_LOAD_ADDRESS:
#if defined(RAMPZ)
                address =  ((address_t)(msgBuffer[1])<<24)|((address_t)(msgBuffer[2])<<16)|((address_t)(msgBuffer[3])<<8)|(msgBuffer[4]);
#else
                address =  ((msgBuffer[3])<<8)|(msgBuffer[4]);
#endif

                msgLength = 2;
                msgBuffer[1] = STATUS_CMD_OK;
                break;

So, essentially this means that page erasure will normally take place starting only from address zero, regardless of if the programming application specifies a different initial address in a 'load address' command! Or - and perhaps this is part of my problem - if two subsequent programming operations are performed without exiting the bootloader in between, page erasure on the second operation will start from where the previous left off - that is, erasing pages after the app code (but only up to the start of the bootloader) and not the pages it is actually writing to! :scared: The only thing that might mitigate that problem is if a chip erase command is issued beforehand. I don't know if AS7 does that.
 

Offline tsman

  • Frequent Contributor
  • **
  • Posts: 599
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #9 on: February 21, 2019, 09:38:54 pm »
Yes. The disparity between load address and erase address is a known issue with that particular bootloader. It assumes you will always be starting from 0x0000 and for the majority of users, that is going to be true.
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #10 on: February 21, 2019, 10:17:44 pm »
I did a comparison in programming behaviour between AS7 and avrdude. I also made a capture of serial communication for the entire programming procedure from both using my logic analyser.

With avrdude, that worked fine, no errors, no corruption. The bytes in question got sent across the wire exactly as-is. Interestingly, even though the firmware .hex file contains no padding at all, avrdude appears to pad out the last page's data, explicitly causing 0xFF to be written in the unused area of that page. Also of note is that it does not rely on auto-incrementation of the address to be written - it sends an explicit 'load address' command for every page (contrary to the STK500v2 protocol docs, which proclaims that "All the [program/read] commands will increment an internal address counter, so this command needs only to be sent once.". It did an initial 'chip erase' command too.

However, I could not reproduce the error with AS7. It produced a working program too. :-// Bytes transmitted correctly. AS7 also pads the last page with 0xFF. However, it does not precede each page write command with a 'load address' command - it relies on the address auto-incrementing. Initial 'chip erase' performed too.

One obvious thing that I realised while doing this experiment is that the programming procedure for both apps includes a verification step - reading back the flash to ensure it conforms to what was written. I don't see how if such a step were performed, the corruption I experienced would ever have passed undiscovered. Another thing is that if the final page's data, as sent across the wire, is pre-padded with 0xFF, how on earth did garbage data end up being written in that page?

I have a hunch which I will try and test. In the initial circumstance when I experienced the corruption, I was using the 'Start Without Debugging' command in AS7 (i.e. compile and program) - not the standalone 'Device Programming' dialog. Perhaps the former behaves differently? Perhaps it uses different code? I would guess it does not do a verification step; it's hard to tell - the only indicator of activity is a single line in the status bar saying "Programming" (I forget the exact wording) and a percentage progress indicator that only seems to jump straight from 14% to 100% once programming is complete.

I shall have to try that again and see if that is the problematic scenario. But, I shall have to try and put my code back to a state where compilation produces a .hex file that matches what was being programmed when the original corruption occurred. I saved that .hex file, but not the code in that exact state. :(

Yes. The disparity between load address and erase address is a known issue with that particular bootloader. It assumes you will always be starting from 0x0000 and for the majority of users, that is going to be true.

Is it known? Where would one find that publicised?

Before I started with this bootloader, I went looking for any other forks of it in case anyone had already done the work I was planning to do. I only remember finding one fork on GitHub, where I don't recall that any of the changes were bug fixes of that nature.
 

Offline tsman

  • Frequent Contributor
  • **
  • Posts: 599
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #11 on: February 21, 2019, 11:21:51 pm »
Is it known? Where would one find that publicised?
It is known about but it isn't well documented. I first heard about it on a mailing list years ago and I've seen it mentioned in several places online. Some ports of that bootloader have it patched but don't say why they've altered it. The Arduino copy of it still has the bug even though the Wiring copy doesn't but I expect pretty much all of their users to never do anything except load from 0x0000 anyway.

https://github.com/msproul/Arduino-stk500v2-bootloader/issues/1

http://www.robotc.net/wikiarchive/ARDUINO_MEGA_Update_Bootloader

https://github.com/WiringProject/Wiring/blob/master/framework/hardware/Wiring/bootloaders/stk500v2.c#L395

https://github.com/Pinoccio/hardware-pinoccio/blob/master/firmware/bootloader/src/main.c#L1080
« Last Edit: February 21, 2019, 11:24:02 pm by tsman »
 

Online cv007

  • Frequent Contributor
  • **
  • Posts: 827
Re: Bootloader woes - corruption of data being flashed
« Reply #12 on: February 21, 2019, 11:28:07 pm »
Quote
pads the last page with 0xFF
That is a just a natural result, not specifically intended (if it is done specifically, then that would be odd as it is unnecessary since its already being done naturally).

Page erase- all pages bytes are 0xFF (a 'setbit' operation)
Fill page write buffer, any unused bytes remain 0xFF (writing to the page buffer is a 'clearbit' operation)
Program page (a 'clearbit' operation), any bits set in the page write buffer will not change corresponding bits in the page

Quote
how on earth did garbage data end up being written in that page
Your bootloader is erasing pages as needed. If your app is now a little smaller, the pages beyond it remain unchanged. If you want to implement a 'chip erase', there is nothing stopping you from modifying the code to do so. If it means you end up going to the next size of bootloader space, just do it and you then no longer need to worry about it and can then change things as you want. With a 32K chip, I wouldn't worry about a little extra bootloader code if it gets me to something better.
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #13 on: February 22, 2019, 12:44:37 am »
It is known about but it isn't well documented. I first heard about it on a mailing list years ago and I've seen it mentioned in several places online. Some ports of that bootloader have it patched but don't say why they've altered it. The Arduino copy of it still has the bug even though the Wiring copy doesn't but I expect pretty much all of their users to never do anything except load from 0x0000 anyway.

Thanks for the links. I don't know why I didn't find any of that stuff. I could swear I searched GitHub for "Peter Fleury" and/or "STK500v2", but didn't see any of those. Maybe the ol' memory is playing tricks and I never did.

Looks like I may have several other issues it will be worth fixing in my copy.

Your bootloader is erasing pages as needed. If your app is now a little smaller, the pages beyond it remain unchanged.

Yes, I know. What I was referring to was how data within a semi-utilised page ended up with garbage data in the unused portion of the page, given that a) a page erase should have been performed on it before writing, and b) the data for that page sent to the bootloader already has the unused portion padded with 0xFF bytes.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8173
  • Country: fi
Re: Bootloader woes - corruption of data being flashed
« Reply #14 on: February 22, 2019, 07:38:56 am »
Sorry if I appear off-topic on this, but really, a generic AVR bootloader project, people use for years, with a non-documented, critical show-stopper bug there for years, and you need to debug it this deeply?

Frankly, this code looks very much like someone's specific one-day one-off which probably worked for them once or twice. Which is great for them, but does it make sense to spend any time in using it? This clearly isn't a maintained generic project.

I have never used a bootloader from anyone else; writing it yourself would probably make most sense. After all, with bootloaders, you tend to come up with some specific requirement related to your specific product and its programming environment or lifecycle, so when doing it yourself, you get what you need. A bootloader can be as simple as 100 lines of code, and when it's yours, you can debug it much more easily; and, you can skip corners (like the original author here did, not documenting it!) as much as you want to to make it even simpler, but very applicable to your specific case.
 
The following users thanked this post: elecdonia

Offline Cicero

  • Contributor
  • Posts: 20
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #15 on: February 22, 2019, 11:38:36 am »
I did a comparison in programming behaviour between AS7 and avrdude. I also made a capture of serial communication for the entire programming procedure from both using my logic analyser.

With avrdude, that worked fine, no errors, no corruption. The bytes in question got sent across the wire exactly as-is. Interestingly, even though the firmware .hex file contains no padding at all, avrdude appears to pad out the last page's data, explicitly causing 0xFF to be written in the unused area of that page. Also of note is that it does not rely on auto-incrementation of the address to be written - it sends an explicit 'load address' command for every page (contrary to the STK500v2 protocol docs, which proclaims that "All the [program/read] commands will increment an internal address counter, so this command needs only to be sent once.". It did an initial 'chip erase' command too.

However, I could not reproduce the error with AS7. It produced a working program too. :-// Bytes transmitted correctly. AS7 also pads the last page with 0xFF. However, it does not precede each page write command with a 'load address' command - it relies on the address auto-incrementing. Initial 'chip erase' performed too.

One obvious thing that I realised while doing this experiment is that the programming procedure for both apps includes a verification step - reading back the flash to ensure it conforms to what was written. I don't see how if such a step were performed, the corruption I experienced would ever have passed undiscovered. Another thing is that if the final page's data, as sent across the wire, is pre-padded with 0xFF, how on earth did garbage data end up being written in that page?

I have a hunch which I will try and test. In the initial circumstance when I experienced the corruption, I was using the 'Start Without Debugging' command in AS7 (i.e. compile and program) - not the standalone 'Device Programming' dialog. Perhaps the former behaves differently? Perhaps it uses different code? I would guess it does not do a verification step; it's hard to tell - the only indicator of activity is a single line in the status bar saying "Programming" (I forget the exact wording) and a percentage progress indicator that only seems to jump straight from 14% to 100% once programming is complete.

I shall have to try that again and see if that is the problematic scenario. But, I shall have to try and put my code back to a state where compilation produces a .hex file that matches what was being programmed when the original corruption occurred. I saved that .hex file, but not the code in that exact state. :(

Yes. The disparity between load address and erase address is a known issue with that particular bootloader. It assumes you will always be starting from 0x0000 and for the majority of users, that is going to be true.

Is it known? Where would one find that publicised?

Before I started with this bootloader, I went looking for any other forks of it in case anyone had already done the work I was planning to do. I only remember finding one fork on GitHub, where I don't recall that any of the changes were bug fixes of that nature.
Sounds like you have a good grasp of things.

Yeah, to me it is strange to have eraseAddress auto increment, but specifically then set address with CMD_LOAD_ADDRESS.  Just seems like a point where things can go out of sync.  Seems more logical to have either the bootloader autoincrement both, or specifically set both.
 

Offline amyk

  • Super Contributor
  • ***
  • Posts: 8275
Re: Bootloader woes - corruption of data being flashed
« Reply #16 on: February 22, 2019, 12:49:17 pm »
Have you tried another chip? That will help determine if you might just have a slightly damaged chip instead of a software problem.

(In normal applications development, blaming the hardware/compiler is often a last resort. In embedded development, the other way around is not as rare.)

My boot section size is 512 words / 1KB, and the bootloader is just under 1KB; bootloader address is 0x7C00 (in bytes).
I wonder if that address was chosen as a tribute to the original IBM PC...
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #17 on: February 22, 2019, 06:48:57 pm »
I managed to put my main application code back to the state I think it was in when first encountering this problem. At least, the compiled .hex files only differ now by one byte - day of build date in the .progmem section. Then I captured a programming operation using AS7's 'Start Without Debugging' command (i.e. build and program).

Didn't manage to reproduce the issue. :( There's obviously something circumstance-specific that caused the original issue. Bootloader state, dev environment state, some other external influence, I don't know.

But, I did notice that the 'Start Without Debugging' programming procedure is markedly different from the standalone 'Device Programming' dialog.

For starters, it does not do a full verification step. So corruption of data can easily go unnoticed. Secondly, it does some odd things, the purpose of which I cannot imagine. It does the initial expected stuff: sign-on, reading device signature and fuses, chip erase, etc. But the flash writing process looks like this:

LOAD ADDR: 0x6800 (26,624) - a page boundary, two pages before end of program code
READ FLASH: 256 bytes - last two pages of program code
LOAD ADDR: 0x0000 (0) - start of flash
PROG FLASH: 128 bytes - one page
[... repeat last operation for entire program code ...]
LOAD ADDR: 0x6800 (26,624) - again, two pages before end
READ FLASH: 256 bytes - two pages
LOAD ADDR: 0x6800 (26,624) - and again?
READ FLASH: 256 bytes - two pages
LOAD ADDR: 0x6880 (26,752) - last page of app code
PROG FLASH: 128 bytes - one page

I can't fathom what it could be doing here. Some kind of abbreviated verification? But then why re-write the final page? Or in some other aspect it's trying to be 'clever' and short-cut something? :-//

One thing is for sure, the latter part of this procedure will fall foul of the discrepancy of erase address versus write address. The final write of that single, last page will actually erase the page after the last page, and do the write without having erased the intended page first. Maybe this was a contributing factor in the original corruption?
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #18 on: February 22, 2019, 06:54:44 pm »
I wonder if that address was chosen as a tribute to the original IBM PC...

You mean this - for the MBR? I wasn't aware of that association. Just coincidence that my bootloader section at end of flash is 1K size and the MCU is a 32KB part. ^-^
 

Offline ajb

  • Super Contributor
  • ***
  • Posts: 2607
  • Country: us
Re: Bootloader woes - corruption of data being flashed
« Reply #19 on: February 22, 2019, 07:08:00 pm »
It sort of looks like somebody started to implement a system to write only the changed sections but gave up halfway through. (J-link probes do this, but they actually read out the entire part every time, which is the only prudent way to do it.) 

Writing a page without erasing it could absolutely cause the behavior you saw, and would also explain the junk after the end of your program if a previous version of the program were a bit longer. 
 

Online cv007

  • Frequent Contributor
  • **
  • Posts: 827
Re: Bootloader woes - corruption of data being flashed
« Reply #20 on: February 22, 2019, 11:59:21 pm »
Why not simply implement the 'chip erase', and eliminate the boot_page_erase. At least then AS7 will be dealing with something that is actually doing what it was told to do.
 

Online HwAoRrDkTopic starter

  • Super Contributor
  • ***
  • Posts: 1478
  • Country: gb
Re: Bootloader woes - corruption of data being flashed
« Reply #21 on: February 23, 2019, 12:45:15 am »
Why not simply implement the 'chip erase', and eliminate the boot_page_erase. At least then AS7 will be dealing with something that is actually doing what it was told to do.

Eliminating the boot_page_erase() would not be a good thing to do. Each flash page is definitely supposed to be erased before writing; the datasheet says so: "Before programming a page with the data stored in the temporary page buffer, the page must be erased." (emphasis mine).

I can understand why chip erase is left unimplemented in AVR bootloaders - it's not particularly useful, and thus not worth wasting code space on. This is because, as I understand it, chip erase only really makes sense in the context of ISP SPI or Parallel programming, where there isn't actually any means to erase a page before writing it. The only way to erase anything using those methods is to do a chip erase.

Chip erase isn't totally useless in a bootloader, though. I suppose if one is programming a smaller-size application than already exists on flash, then without a chip erase you'll get some pages of the previous contents left untouched in flash. Not really important though, unless one wants to be 'tidy'. But I still don't think I'll implement it.



I found another instance where this bootloader code is deficient, although in this case more by omission than doing anything wrongly. It doesn't handle the situation where the checksum of a received command is incorrect. In that situation, it's supposed to respond with an ANSWER_CKSUM_ERROR answer. Instead, if my interpretation of the code is correct, what it would do instead is ignore the entire command and simply wait to receive the next one. The sending application would be none the wiser that anything was wrong, and just keep on truckin'. :o

I have now implemented the proper behaviour. It only added about 10 bytes to the code size. ^-^

Y'know, it strikes me that when Mr. Fleury was writing the bootloader, when he was implementing the command-handling code, following the AVR068 app note's state table describing how commands are handled, he didn't quite grasp that the table was illustrating how the sender works - e.g. what Atmel Studio, avrdude, et-al do. The behaviour of the receiver doesn't necessarily follow it exactly, and there is a paragraph on that page saying so. This explains why that part of the code doesn't get right not only handling of bad checksums but also sequence numbering (that is one of the fixes implemented in some of the other forks that tsman linked to earlier). :palm:
 

Online cv007

  • Frequent Contributor
  • **
  • Posts: 827
Re: Bootloader woes - corruption of data being flashed
« Reply #22 on: February 23, 2019, 01:00:08 am »
Quote
Why not simply implement the 'chip erase'
There is a reason I single quoted chip erase- because there cannot be an actual isp type chip erase performed, but that does not mean you cannot do a 'chip erase'. When you get a chip erase command, simply erase all app space pages (you will still have a boot page erase function, but it will only run inside the chip erase command).

Then when AS7 starts poking around, it will probably see erased pages when it expects it. And you eliminate the page erase address increment problem at the same time.

You just move a little chunk of code, and change an 'if' to a 'while'

Code: [Select]
case CMD_CHIP_ERASE_ISP:
    eraseAddress = 0;
    while( eraseAddress < APP_END ){
        boot_page_erase(eraseAddress);
        boot_spm_busy_wait();
        eraseAddress += SPM_PAGESIZE;
    }
    boot_rww_enable(); //and add this I guess, in case someone wants to do any reading after the chip erase
    msgLength = 2;
    msgBuffer[1] = STATUS_CMD_OK;
    break;
« Last Edit: February 23, 2019, 01:57:46 am by cv007 »
 

Offline mikeselectricstuff

  • Super Contributor
  • ***
  • Posts: 13748
  • Country: gb
    • Mike's Electric Stuff
Re: Bootloader woes - corruption of data being flashed
« Reply #23 on: February 23, 2019, 10:47:21 am »
Eliminating the boot_page_erase() would not be a good thing to do. Each flash page is definitely supposed to be erased before writing; the datasheet says so: "Before programming a page with the data stored in the temporary page buffer, the page must be erased." (emphasis mine).
In most flash implementations, program operations can only change '1' values to '0', so erase must be used to reset everything to 0xff
Youtube channel:Taking wierd stuff apart. Very apart.
Mike's Electric Stuff: High voltage, vintage electronics etc.
Day Job: Mostly LEDs
 

Online cv007

  • Frequent Contributor
  • **
  • Posts: 827
Re: Bootloader woes - corruption of data being flashed
« Reply #24 on: February 23, 2019, 04:39:09 pm »
Quote
Why not simply implement the 'chip erase', and eliminate the boot_page_erase
I did kind of phrase that wrong, as the boot_page_erase is not eliminated but the page erase 'as needed' is eliminated since you would actually follow the protocol if a 'chip erase' is implemented.

Quote
in most flash implementations, program operations can only change '1' values to '0', so erase must be used to reset everything to 0xff
Which is why I referred them as a setbit/clearbit operation-
Quote
Page erase- all pages bytes are 0xFF (a 'setbit' operation)
Fill page write buffer, any unused bytes remain 0xFF (writing to the page buffer is a 'clearbit' operation)
Program page (a 'clearbit' operation), any bits set in the page write buffer will not change corresponding bits in the page

You could program the flash 1 bit at a time if you wanted. You can also use it to your advantage when 'encrypting' an app. I wrote a bootloader for an avr (256 words) that took a standard hex file via uart which programmed one record at a time. Optionally, I had 'encryption' where the hex file records were 'encrypted' (address and data fields only so as to not give away any info on the known fields)- the records were in a random order to not give away addresses and there were records that were 'split up' taking advantage of the 'clearbit'-only way the flash is programmed. The 'encryption' simply used an lfsr+xor where the lfsr starting point was given somewhere in the first few records. The bootloader did not allow programming until an erase was done (and verified erased).

With something like a pic32mm, you cannot take advantage of the 'clearbit' flash operation as they have ecc bits so you end up getting ecc errors (the ecc bits are also clear-only and are written on each write, so although you can write multiple times to the same word the ecc will now be wrong and you get an ecc exception when trying to read).

The 'chip erase' function I showed previously could also be sped up by simply reading the words in the page until something other than 0xFFFF is found- only then do a page erase. Probably not a big deal though, as we are talking about a worst case time of a little over a second to erase all pages (and probably a lot of those will have to be erased anyway, so the gain is fractions of a second).
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf