The problem with just summing, is that you are summing the error as well, as has been said. Decimate the error by dividing your 500 samples by 500, and you should get a very stable output from the ADC.
In terms of the board, yes its not a spectacularly good design, lack of decoupling etc, but since you have nothing else connected, it should not be that noisy. Also make sure that your debugger and the DUT are on the same ground voltage, using good quality USB cables, as the noise might be induced from your PC. I'f often found the best way to characterize a system's ADC is to have it powered from the design PSU, and getting samples off either via BT, or IrDA. PC's are noisy bastards, and if you are sharing a groundplane with one, you will need a lot more than just a few caps to clean that power up to avoid 0.8mV ripple affecting your ADC pins.
Also does the F4 you are using have a reference, or are you referencing to Vcc? Since the supply ripple will also affect the ADC.