Going to ultrasound which has a speed of roughly 3E4 cm/sec changes this to somewhat smaller than 0.1 millisecond, much easier to do with simple equipment.
Ultrasound is certainly doable, either by simple time of flight measurement and a triangulation (one needs at least 3 measurements to obtain a position fix) or going the whole hog and doing multilateration by measuring phase differences. Either is doable - I have implemented an ultrasonic 3D "mouse" back in 1998 using an AT89C2051 micro and a PC for doing the data processing. The mouse had an ultrasonic transmitter and there were 4 fixed receivers.
However even an ultrasonic system can get complicated fast - e.g. in my case I wasn't able to take the time measurements simultaneously because that micro has only 2 timers and one was generating the 40kHz ultrasonic ping already. That leads to large errors unless the object moves very slowly. Modern micro with 4 or more input capture channels wouldn't have that issue.
Then there is the problem with ultrasound transducers - most are very directional so if you want to cover a larger volume, you will need many of them (you need to have at least 3 signals at any given time in order to be able to track). Which will likely require an FPGA to do the time measurements. Or you will have to find some special transducers with wide angle sensitivity but those aren't common (nor cheap).
Another issue is with detecting when the "ping" has actually arrived. If you use only a simple threshold, like I did, you will get errors due to external noise and due to the characteristics of the transducers (they don't transmit/receive with equal power in all directions). This is solvable in the way sonar does it - by sending a modulated "chirp" instead and then using auto-correlation to find the center of the "chirp" in the incoming signal, regardless of any sensor-related attenuation or false signals. But that requires some decent processing power on the micro/FPGA because you are doing that for every channel.