First thing to work out is to find out what you want to achieve.
The second thing is to work out how to do it.
The third thing is to find out what to do when things go wrong.
The fourth thing is to work it all out in a way that you can fix, when you find out you were wrong.
If you do not document your intentions and ideas at each step, you screw someone, usually the later version of yourself.
I liked Weiss' old
Data Structures and Algorithm Analysis in C, because not only does it show good examples of various data structures, you also shows you how to evaluate and compare them along different properties (time and space complexity, best/worst/average case).
It seems that a lot of developers do not realize that on many architectures (possibly all non-microcontroller ones), typical tasks are constrained by memory bandwidth, and not CPU speed. Even in things like Discrete Cosine Transform (used in most audio and video compression schemes), the heavy computation itself is easy to optimize; the hard part is to efficiently decode and extract the components out of a binary stream.
Current server and desktop architectures use multi-level caches to mitigate this, which means that to maximize memory efficiency, the memory access patterns must be such that the CPU can predict them to a high degree, transferring data between cache levels and memory efficiently, without unnecessary stalling.
A possibly surprising result is that if you implement some algorithm that in a microbenchmark behaves optimally, using that same algorithm in a real program on real data may be slower than an implementation that behaves worse in that same microbenchmark. This is because in microbenchmarks, the cache is only used by the implementation tested. In real life programs, the cache is shared by the entire task, and can even be evicted every now and then due to process switches on that core. So, if you design an algorithm for a CPU with specific sizes at specific cache levels, it is really unlikely to perform that way in a real programs. However, it does not seem to hinder people from developing just such algorithms and implementations...
Whenever discussing parallel (or distributed) programming, typically the most important part for actual software development, the level at which parallelism yields best results, is completely skipped.
You can develop a 3D dot product for three CPU cores, and optimize it to perform optimally even cache-wise, before you realize that saving and restoring the internal hardware state needed for each execution thread is several orders of magnitude higher than any speedup such approach might gain, for any number of vectors.
The nastiest part about that is that it is usually a human-level question, not something you can compute or automatically find out or even assume. Sometimes the user wants a task to be done in the background, using the fewest resources needed, since it can take a few minutes longer since the user is right now doing something more important. Other times, the CPU resources needed are not important, as long as the results are obtained with as little latency (in as little wall clock time) as possible.
Then, there is the problem of what "performance" means. When optimizing computation for
speed, do you optimize wall clock time taken or CPU time taken? Those two are very, very different things. Or by performance, do you mean bandwidth? What about latencies? Or bounded execution times (any sort of guarantees on performance)?
Many HPC simulations involve the Monte Carlo method. One of my favourite gripes is when a large simulation is spatially divided to allow the same simulation to be calculated in parallel; i.e. distributed over multiple computational cores. If each random change affects a nonzero neighborhood, and they always do if the simulation is at all useful, changes near the CPU core boundaries need to be communicated between cores, or the core boundaries will show up in the final results. In most scientific papers on Molecular Dynamics where the base energy state of some chunk of matter is found using Monte Carlo simulations and atom swaps, you can clearly see the boundaries, because communication between cores simply slows down the computation way too much -- several orders of magnitude slower, in fact. As far as I know the solution, sliding windows, is still not proven to retain the detail balance (meaning, whether it yields the same results as a single simulation), but it definitely beats the results containing core boundaries. My own solution, using a space-filling curve, and transmitting information during computation, at least one widow step ahead of the computation, is apparently too complicated for typical scientists to grok.
During that research, about a decade ago, I came to the conclusion that most programmers (or developers) do not understand that to get high performance,
data must flow constantly. You cannot do computation, then communication, in turns: you need to do them at the same time to get your money's worth out of the hardware you have.
Using larger buffers does not help, either. At each buffer, additional latency is induced, when the data flows to fill the buffer before it continues onwards. This is known as
bufferbloat in packet-switched networks, but it affects all software, really. After all, when you read data from storage, do some kind of work on it, then write the data out, you do exactly the same thing a network switch does to packets.
Apologies for the ranty wall of text.