It's of course extremely dependent both on the target in question and on the code itself.
In my experience, -O3 yields faster execution than -O2 in many cases, and otherwise is usually at least as fast, and I've never personally run into a case where it was slower, so this is my usual default (unless I work on very small targets, for which optimizing for size would be critical. -O3 does aggressive code inlining in many cases so the code size can inflate significantly. Depends of course on your code structure.) I benchmarked sorting algorithms lately and got consistently +10% faster execution with -O3 than -O2 on PC targets. With some computing intensive stuff (especially with floating point), it can be as high as +30 to +50% faster...
I did some tests with Bruce's code, and I confirm that with his code I also get the same execution time with -O2 and -O3 on my Core i7: 2490 ms. Now I tried with -Ofast, and it's actually slightly slower (which is consistent with my previous benchmarks with -Ofast which I've found often slower than -O3 actually), with 2550 ms. This is not that surprising, as execution time depends on many factors including how code and data are cached.
Of course benchmarking across different targets is a tricky business and the way you do it all depends on your goals. If you want to know the fastest a given algorithm can execute on a given CPU, you could write it directly in optimized assembly, using any specific instruction you can to speed things up. That would be really taking advantage of said CPU. Now if you want more of a typical "feel" of what you'd get with real-life code in a high-level language, using generic C and moderate optimization levels makes sense.