Author Topic: Discussion on Memory vs Performance (Read 4953 times)

blewisjr · « **on:** October 02, 2013, 03:35:23 pm »

Hello Everyone,

Long time since I have made any real significant posts due to being busy with many other things and this is only a hobby for me after all.

So I am in the process of designing a project that will be actually useful for me. When I came up with this project I realized I could just go out and buy one but why buy one when I can make one with minimal functionality a hell of a lot cheaper and learn at the same time.

This lead to some interesting thoughts I have had about memory and performance. I find it very interesting how in the embedded world the two are actually very much interconnected. I am sure in the PC world they are just as much connected but we do not really notice it as much due to the fact that the computer I am on right now has 12 GB of DDR3 compared to the minute amount of 256 bytes on some 8 bit controllers.

I realize this is really a no brainer for my project due to the simplicity of it but none the less I find this concept fascinating none the less.
So lets go with something simple to discuss and maybe it will go somewhere. It will be best to keep this language agnostic even though with C you have less control over this per se then you would using ASM. The war is useless at this point.

Ok so here is the simple example from my project that may or may not lead to a decent discussion DIVISION.
Keep in mind this is from a 8bit perspective as ARM is a no go for this simple of a project.

We all know in the 8 bit world you very rarely get hardware division so it must but done from a software perspective. Now why is division important here well I think division really can demonstrate at a simple level the cost of memory and performance.

In my project you have a display the display is 4 digits. One half is 0 - 59 the other half is 0 - say 60. The display can only show 1 digit at a time and moving really fast shows all 4 (the wonders of the optical illusion of the 7 segment display obviously).

The first and most logical way to do this would be to have two 8 bit numbers store each half of the display. Then you can extract each digit from the appropriate variable. For instance if the first half is 12 you can do 12 mod 10 to get the 2 and 12 div 10 to get the 1. This method from a simplistic view saves memory at the cost of performance. Sure you can get away with 1 division in assembler and use the remainder as the mod but the concept is the same. For 2 bytes you sacrifice say 96 cycles on a AVR using the atmel algorithm as a reference. Sure you can reserve registers for this on a AVR to not use ram but in a chip agnostic fashion you would have to say on a pic and either way you are sacrificing something to have persistence on these two values. So to save on the storage negative we sacrifice performance.

The next obvious way to do this would be to store each digit separately. This costs 4 bytes of storage (in general) but now we are saving many cycles. So it may be say 4 cycles to get all 4 digits instead of 24 cycles per digit.

I am sure at the assembler level there are all sorts of ways to minimize the performance hit based on saving memory through playful tricks but we are being agnostic on language.

These simple examples are great for showing the interconnectivity between memory and performance.

So my question out to you guys to spark discussion is from a engineering perspective how do you know when to sacrifice what? Is there a clear cut process that helps determine when to eat of memory over speed or vice versa?

This can easily turn into a C vs ASM vs various Div 10 routines but it is not necessary lets try and discuss from a high level planning/design perspective because if anything that is the one thing I need to work on over anything.

Rufus · « **Reply #1 on:** October 02, 2013, 04:05:48 pm »

Quote from: blewisjr on October 02, 2013, 03:35:23 pm

So my question out to you guys to spark discussion is from a engineering perspective how do you know when to sacrifice what? Is there a clear cut process that helps determine when to eat of memory over speed or vice versa?

I don't think there is really that much to discuss. Memory (RAM and ROM) comes in incrementally sized lumps.

If you have got it it costs nothing to use it. Only when you have not quite enough and have to spend money on the next size up lump do you need to consider economising to avoid that cost.

That doesn't happen very often. Say you can get 1k or 2k of memory and you can squeeze the requirement by 20% with careful coding. Only projects which need between 1k and 1.2k with sloppy coding can benefit.

Most of the time use whatever memory makes coding easier, or run faster, or less cycles to save power depending on what is important.

blewisjr · « **Reply #2 on:** October 02, 2013, 05:05:32 pm »

So from the designer/developers point of view they would tend to side on the performance by using more memory until they reach a point where they need more memory then they have at which point they would consider alternative methods to free up the memory. I would assume this would be done first by analyzing the non critical parts of the system or at the worst get a bigger chip if everything is critical?

free_electron · « **Reply #3 on:** October 02, 2013, 06:16:00 pm »

Quote from: blewisjr on October 02, 2013, 03:35:23 pm

We all know in the 8 bit world you very rarely get hardware division

Really ? an 8051 can do it... since 1976 ...
So basically what you are saying is that all other 8 bit microcontrollers are crap. I agree with that.

Quote

-snip-
long winded explanation to do something trivial
-snip-

that is why BCD arithmetic was invented... decent 8 bit processors have BCD arithmetic engines on board and can translate binary to bcd and back with one or two instructions. Case in point : the 8051 , which dates back to 1976 ...

Must be that all the other 8 bit core makers are still using pre 1976 technology...

Hooray for 8051 !

dannyf · « **Reply #4 on:** October 02, 2013, 07:09:57 pm »

Quote

Hello Everyone,

I have to say that I have a hard time understanding your perspectives on this. The "issues" you mentioned are really none issues to me, in that they could have been easily address, with minimum performance / size differentials.

For example, the binary to bcd conversion can be done in flash, as divisions.

Quote

Hooray for 8051 !

Absolutely agree. It is a remarkable chip, even for today and a superb chip then.

westfw · « **Reply #5 on:** October 02, 2013, 09:02:37 pm »

One thing that you do not take into account is the frequency of the various operations. In your example (a clock?), it seems like you are displaying your number much more frequently (many times per second for multiplexing your display) than you are doing calculations on it (an increment each second), so the performance decision is biased. If you were doing a lot of calculations prior to display, the four-byte approach would become more complex in computation, even though the display code becomes simpler.

You can also consider:
1) Binary Coded Decimal (two 4-bit digits in each byte.) As free-electron says, direct support for BCD used to be pretty common. But it's not any more.
2) calculate display patterns only when the value changes. Requires a byte per displayed digit, plus at least one bit somewhere to indicate a value change; so it is more RAM intensive than either version you have suggested. But the display code becomes trivial.
3) Some compilers optimize "divide by constant" operations.
4) Eliminating the divide code means that you have to not use it ANYWHERE in your code.

There's a maxim: "premature optimization is the root of a lot of evil" (or something like that.)
It's pretty much true. There is little point in making some code take 4 bytes of RAM, 128bytes of code and run in 342 cycles if the chip you're using has 1k of RAM, 8k of code, and is fast enough that your code "must run" in 8000 cycles. Or frequently, even if you can buy a slightly different chip for an extra $0.25 that suddenly doubles the code and ram space. The best bet is to write your code so that it is easily understandable and maintainable, and then if (and only if) it happens to be too slow or too big (or you can save $x by going to a smaller/cheaper chip), you can think about optimizing your code differently.
(and at that point, you analyze the code and optimize the parts where you get the biggest return for your efforts.)

Kjelt · « **Reply #6 on:** October 02, 2013, 09:06:31 pm »

Stm8 has a 16 bit by 16bit hw devider and multiplier probably more 8 bit micros do if you do a search probably renesas has some to.
Anyway the only time I really invest a lot of time to make the code as fast as possible is inside ISR functions. For the rest the compilers are pretty good.
The discussion on Ram vs speed also needs to include stack usage for say local (temporary ) variables. As long as the stack can keep up, use whatever you want locally.

blewisjr · « **Reply #7 on:** October 02, 2013, 09:44:23 pm »

Quote from: westfw on October 02, 2013, 09:02:37 pm

One thing that you do not take into account is the frequency of the various operations. In your example (a clock?), it seems like you are displaying your number much more frequently (many times per second for multiplexing your display) than you are doing calculations on it (an increment each second), so the performance decision is biased. If you were doing a lot of calculations prior to display, the four-byte approach would become more complex in computation, even though the display code becomes simpler.

You can also consider:
1) Binary Coded Decimal (two 4-bit digits in each byte.) As free-electron says, direct support for BCD used to be pretty common. But it's not any more.
2) calculate display patterns only when the value changes. Requires a byte per displayed digit, plus at least one bit somewhere to indicate a value change; so it is more RAM intensive than either version you have suggested. But the display code becomes trivial.
3) Some compilers optimize "divide by constant" operations.
4) Eliminating the divide code means that you have to not use it ANYWHERE in your code.

There's a maxim: "premature optimization is the root of a lot of evil" (or something like that.)
It's pretty much true. There is little point in making some code take 4 bytes of RAM, 128bytes of code and run in 342 cycles if the chip you're using has 1k of RAM, 8k of code, and is fast enough that your code "must run" in 8000 cycles. Or frequently, even if you can buy a slightly different chip for an extra $0.25 that suddenly doubles the code and ram space. The best bet is to write your code so that it is easily understandable and maintainable, and then if (and only if) it happens to be too slow or too big (or you can save $x by going to a smaller/cheaper chip), you can think about optimizing your code differently.
(and at that point, you analyze the code and optimize the parts where you get the biggest return for your efforts.)

Correct the project does indeed have a clock component using a multiplexed display so the display will need to update faster then the actual increment in this case it is actually 1/100th of a second as it is a stopwatch of some sort that needs the lower end precision for the accuracy of the measurement I am going for. I do agree there is more to it then performance vs memory in many cases with my example and as complexity increases the 4 byte approach could indeed get in the way and over complicated over say eating the cycles through a division.

I actually find your approach #2 very interesting because it actually further shows the point of the discussion. I agree it takes slightly more ram but at the same time while making the display code more simple you are also increasing performance dramatically during deadtime considering the display would be updating a lot faster then the actual timing of the stopwatch. The stopwatch would be updating at 1/100th of a second around 100HZ where the display may be updating at say double that 1/200th of a second for instance to avoid flicker. So there is no real sense in running all the dead time calculations. This is quite a good idea from my point of view.

Quote from: dannyf on October 02, 2013, 07:09:57 pm

Quote
Hello Everyone,

I have to say that I have a hard time understanding your perspectives on this. The "issues" you mentioned are really none issues to me, in that they could have been easily address, with minimum performance / size differentials.

For example, the binary to bcd conversion can be done in flash, as divisions.

Quote
Hooray for 8051 !

Absolutely agree. It is a remarkable chip, even for today and a superb chip then.

I do understand your point first yes the 8051 is a fantastic chip you would think hw div would be the norm in 2013 along with BCD supported in hardware.

As far as my perspective on the issue I am not sure what you mean by hard to understand. I understand my example was very simplistic and really is a non issue it just turns out to be a easy example I can think of that shows the memory/performance link in development without getting over the top complex in such an example. So I do agree in that my example is indeed a non issue.

On top of all this we can agree on one thing modern day 8 bit chips would have a very hard time possibly comparing to the older Z80's and 8051 chips. Sure we have more speed but some of the missing features in the hardware would make old devices like the GameBoy much harder to implement without beefing up the power and going ARM for the hw maths.

I like this thank you for the replies the discussion already taught me a few new directions to look from a design/planning perspective for current and future projects.

dannyf · « **Reply #8 on:** October 02, 2013, 09:52:00 pm »

Quote

we can agree on one thing modern day 8 bit chips would have a very hard time possibly comparing to the older Z80's and 8051 chips.

Yes or no, depending on what you are trying to compare.

blewisjr · « **Reply #9 on:** October 02, 2013, 10:18:16 pm »

Quote from: dannyf on October 02, 2013, 09:52:00 pm

Quote
we can agree on one thing modern day 8 bit chips would have a very hard time possibly comparing to the older Z80's and 8051 chips.

Yes or no, depending on what you are trying to compare.

In the example I give I would think so. In order to implement the games you would find on a GameBoy there are various applications of Trig to handle the various physics and manipulations of the bitmaps you would see in say a old school Mario game on the first Gameboy. In this case I would be lead to believe the ability for hw math like multiplication and division on 8051 and such was important in the decision to go with those chips for the device. But this is off topic anyway and I do agree it depends on what you are trying to do as is everything in the software and hardware world. There is no best only the right one for the task.

dannyf · « **Reply #10 on:** October 02, 2013, 10:39:50 pm »

Quote

In this case I would be lead to believe the ability for hw math like multiplication and division on 8051 and such was important in the decision to go with those chips for the device.

Sure. But the fact that hardware x and / are important doesn't mean that they are not comparable.

Matje · « **Reply #11 on:** October 02, 2013, 10:53:42 pm »

Quote from: blewisjr on October 02, 2013, 05:05:32 pm

So from the designer/developers point of view they would tend to side on the performance by using more memory until they reach a point where they need more memory then they have at which point they would consider alternative methods to free up the memory.

No.

You do usually not know (with useful precision) what parts of the code will be really critical. The way to find out is to do a first implementation and look at what comes out of it.

Also: "premature optimization is the root of all evil" is a pretty well known saying in the field.

And always remember: an algorithm using more memory to gain speed is usually a more "clever" algorithm. You want to avoid these if not needed because these algos are usually harder to understand, implement correctly and maintain.

Quote from: blewisjr on October 02, 2013, 05:05:32 pm

I would assume this would be done first by analyzing the non critical parts of the system or at the worst get a bigger chip if everything is critical?

You do not know what will be (non) critical in the end, although you might have some idea. I have been surprised at times by what turned out to be the real performance bottleneck. As for "analyzing", sorry, no such thing is really viable, despite snake oil salesmen claiming the opposite.

Getting a bigger chip is a possibility - if beancounters are involved they usually object to that solution though.

blewisjr · « **Reply #12 on:** October 03, 2013, 12:59:52 am »

I do agree with the fact that premature optimization is the root of all evil. It is not just in embedded development but it is in all forms of development. We also have to realize, however, that optimization in and of itself is not bad. A good algorithm will always be better then a bad algorithm and often more efficient and there is no sense in just using the bad algorithm for the sake of it being easier to implement.

As for analyzing being snake oil of salesman I find this hard to believe. Same with it being not viable. When I say analyze I do not mean necessarily with special instrumentation but analyzing is understanding. Knowing the various code paths and understanding them can lead to better algorithms. A scope can show when a code path is not performing the way it is suppose to. It may not indicate the bottleneck but you at least get a starting point on where you should analyze the code which may lead to the problem. Analyzing is possible in the sense of understanding the code flow so I would not say it is not viable maybe not in all cases but in a general perspective I would say it is.

Oh and like I said I am not condoning pre optimization this is more of a thought process on efficient design and where possibly various different design/plans from the code perspective can lead to and what benefits and negatives they can have from a memory / performance correlation.

blewisjr · « **Reply #13 on:** October 03, 2013, 11:18:08 am »

The question is not about how memory is cheap and it really is. This is not the issue. My example may be a corner case. But you say that without disproving this little correlation I noticed. Do you have an example where using more memory would correlate to the opposite of decreasing performance?

dannyf · « **Reply #14 on:** October 03, 2013, 11:27:50 am »

Quote

where using more memory would correlate to the opposite of decreasing performance?

Many.

FFT on large array of data (more memory) vs. FFT on small array of data;
Regression on large array of data (more memory) vs. regression on small array of data;
...

Too many to list.

Rerouter · « **Reply #15 on:** October 03, 2013, 11:32:21 am »

if you wanted to take the example to its limits, you can compile superoptimisers for most micro's, essentially you give it a criteria to meet and it will run through every op code combination you inform it of until that criteria is met in the best possible way, it does this by generating a truth table for what ever bit of code you want it to optimise and matching against it,

now this approach is normally only good for attacking bits of code shorter than 20 op codes, but if you want to be pedantic about code efficiency there is your starting point,

dannyf · « **Reply #16 on:** October 03, 2013, 02:48:38 pm »

I never understood the OP's original "war and peace" but as I have stated before, there are examples that go both ways: bigger footprint can lead to better performance, and worse performance, depending on the particular circumstance.

blewisjr · « **Reply #17 on:** October 03, 2013, 06:17:52 pm »

I don't know why you call it war and peace so to speak. While I was thinking about my project I noticed this correlation. I am still very new in my opinion when it comes to embedded type development. The whole purpose of the thread was to get some discussion going so I can learn and understand better. Even if my OP was not worded as well as it could have been.

So far I feel I learned something important about embedded development. Things are not always clear cut like it may seem. I thank you Danny for giving a good example of where this correlation breaks down.

Matje · « **Reply #18 on:** October 03, 2013, 10:46:27 pm »

Quote from: blewisjr on October 03, 2013, 12:59:52 am

I do agree with the fact that premature optimization is the root of all evil. It is not just in embedded development but it is in all forms of development. We also have to realize, however, that optimization in and of itself is not bad. A good algorithm will always be better then a bad algorithm and often more efficient and there is no sense in just using the bad algorithm for the sake of it being easier to implement.

But algorithms are not good or bad per se, and anyway there are more factors to consider. Imagine that you have available a tested implementation of a "bad" algorithm vs. an as-yet-unwritten implementation of a "good" algorithm. Now do you (in the first step) use the implementation you have or do you potentially waste time by writing the other one?

Allegedly Ken Thompson said "When in doubt, use brute force." This is not bad advice until you have enough hard data to decide otherwise.

Quote from: blewisjr on October 03, 2013, 12:59:52 am

As for analyzing being snake oil of salesman I find this hard to believe. Same with it being not viable. When I say analyze I do not mean necessarily with special instrumentation but analyzing is understanding. Knowing the various code paths and understanding them can lead to better algorithms. A scope can show when a code path is not performing the way it is suppose to. It may not indicate the bottleneck but you at least get a starting point on where you should analyze the code which may lead to the problem. Analyzing is possible in the sense of understanding the code flow so I would not say it is not viable maybe not in all cases but in a general perspective I would say it is.

That was not what I meant. I was talking about magical tools that would analyze very high-level descriptions of problems and come up with purportedly "correct", "fast" or whatever solutions.

Analyzing actual existing code is of course possible, be it by looking at it or using a profiler or so.

Quote from: blewisjr on October 03, 2013, 12:59:52 am

Oh and like I said I am not condoning pre optimization this is more of a thought process on efficient design and where possibly various different design/plans from the code perspective can lead to and what benefits and negatives they can have from a memory / performance correlation.

All I'm saying is do not think about details too much in the beginning, you will not have enough information for that.

Choosing algorithms, minimizing communication roundtrips and so on is OK. Thinking about using a few bytes more memory to cut out a few cycles very likely is not.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Discussion on Memory vs Performance (Read 4953 times)

Share me