I want to implement a simple second order Butterworth filter. But I think I will have to implement it with fixed point operations (with integers) to gain speed.

You want either arm_biquad_cascade_df1_q15() or biquad_cascade_df1_q31(), for 16-bit and 32-bit integer.

See https://arm-software.github.io/CMSIS-DSP/latest/group__BiquadCascadeDF1.html

A second order IIR needs just a single biquad stage.

Note that most integer CMSIS DSP functions work with

Q numbers. At the end this is just a matter of scaling, but you need to take care.

I have about 30 CPU cycles per ADC sample (at 2666kHz ADC), so the operations have to be very fast.

I have doubts that 30 cycles per sample are enough for a biquad stage. But I may be wrong. You need to measure. I'd measure all three, q15, q31 and float. Don't forget optimization -O2. You can also try if -O3 is even faster.

Another thing I can do is to group data and apply the filter to aggregate data, not to all samples.

Decimation with a 1st order boxcar filter is likely the fastest you can do. Just add-up (say) 16 adjacent samples and replace the 16 samples with a single sample containing the sum, and so on. I guess this fits into 5 cycles/sample, so that you get (30-5)*16=400 cycles for filtering each decimated sample. However, a 1st order boxcar filter is not a good anti-aliasing filter at all. This may or may not matter, depending on the frequency of a potential undesired (picked-up) interfering signal. If the frequency happens to be folded by the downsampling into the frequency band of interest, then it matters.

What cutoff frequency do you have in mind for the lowpass?

EDIT: The larger the sample rate to cutoff frequency ratio, the higher precision is required for the coefficients and the calculation. Then it can happen that Q15 (16-bit) is not sufficient.