The answer can get complicated pretty quick. But, the main reasons have to do with achieving high dynamic range, good image rejection, good dynamic range, good frequency accuracy and low phase noise.
For example, to get good image rejection, you want the conversion images to be far away from the IF frequency to make them easy to filter at the front end. Since the images are 2x the IF frequency away, the higher the IF the better. But, in order to get good selectivity (low RBW), you want the IF to be low frequency. You also have to consider the difficulty of getting low-noise, high-accuracy/stability LOs - much more difficult to do for very wide frequency ranges.
These are a few of many reasons that multiple conversion stages are used in Spectrum Analyzers (and receivers, etc.)