FIR filter with N taps has a delay of (N-1)/2 samples. Source: https://dspguru.com/dsp/faqs/fir/properties/
I believe this is for linear phase filters; mine certainly will not be linear phase as the coefficients come from an impulse response (from an FRF measurement). I know I will have some delay; and at some point will have to make an engineering choice or simply accept that it only operates at lower frequencies. Im still doing experiments a mechanical to see how long the delay is, for various N and Fs's values. To do this test, Im generating a drive signal to send to my actuator based on the FRF and a desired reference time signal. Then, Im going to overlay the resposne that drive signal has when played back to the actuator with the original reference that was used to create it. Comparing the two should give me a sense of the delay introduced by the N point convolution.
Whether the filter is linear phase is largely irrelevant to the inherent algorithmic latency it introduces, and it really doesn't matter whether you are using an FIR or some other type of filter. An impulse causes its impulse response. This issue of causality means all the interesting stuff that goes into your filter happens after the impulse. It might be the impulse response dies really quickly after the impulse, or it might die very slowly, but until it has dropped below the threshold that is important to you, you can't get an output from the filter. This kind of delay is usually far longer than the delays caused by sigma-delta converters, and so the converter is not the dominant source of algorithmic delay in most cases.
This is not true, if I am reading you correctly. You can produce the first sample of the impulse response the instant the impulse comes in. Think about how nature works. Linear systems in the natural world don't have to wait for the length of their impulse responses before producing first output, do they?
The oft-quoted (N-1)/2 delay, as already mentioned, only applies to linear-phase filters. Linear phase implies (and is implied by) symmetry or anti-symmetry in the impulse response, which is where the (N-1)/2 sample group delay comes from.
The real question is, how do you perform the convolution efficiently (that is, in the frequency domain) without imposing a latency equal to the block size. And the answer is what Marco said, partitioned convolution. If you'd like to learn more, a key google search term is "frequency delay line." I also recommend Frank Wefer's thesis, "Partitioned Convolution Algorithms for Real-time Auralization":
http://publications.rwth-aachen.de/record/466561/files/466561.pdf. If you can live with, say, four or eight sample latency, you can do the whole thing in the frequency domain using one section of short FFTs and another section of long FFTs; this is called a "nonuniform partition."
I'm using similar techniques right now to implement multi-thousand-tap FIR filters on a Cortex-M7 MCU for active noise cancelation. No coincidence, you see this stuff a lot in ANC and in your application, which is basically the same thing. Also in adaptive echo cancelers. Things get fun when you need to make your long FIR filter adaptive, using LMS or something even fancier.