So we can clearly see that you are trained in marketing. But it's like shark tank. Without a prototype, a patent heck even a white paper with example images you are basically just an idea.
The thought is understandable, have AI watch alk the data feeds and decide if it's normal or misconduct. But unless you have some cool tricks up your sleeve - AI won't do it. You say it will run on a Linux based small device, good luck with that being powerful enough to feed any kind of network. A thermal imaging feed like the you you described, 320*240; 9hz and 14 bit pixels gives you a lot of data already, it is manageable and real time image classification for HD video @60fps is possible, but needs a well trained auto encoder and lots of levels of lowering the vectors inputted. If you increase your input with a visible light camera(for identification) and a IR iluminated camera for dark parts of the screen, and a microphone or two and some more sensors you can come up with. If your device is stationary you can eliminate a lot of repeating data and your tCNN should as well.
The task is very huge and you need training data of potentially upwards of 10k hours of all feeds at a good framerate.
If you want to reconstruct a 3D model skeleton of people in the room, you don't need AI, Kinect for room scale and Leap Motion for closeup hands work already, fairly reliable.
You could use multiple devices in one room to create a 3D reconstruction and have a real time model of what is happening. Thermal cameras could give you information about how much skin is showing, arousal status and maybe even touch, as fingerprints and footprints can be left.
Yet without seeing a single data point or even a drawing of how you believe this will all be done. I am calling BS, even if your idea and intend is genuine. Thinking about buying 10k cores without having used a single one, is not the right move.
Start with a Boson 320 get the wide angle lens and 30hz model with 40mk. The Boston core has a chip for "AI" stuff and classification is an example project among others in the FLIR developer community. Look up what flir offers on it's survalliance side with mixed camera sensors and think about going a similar direction.
Just to criticize your whole pitch. "in places where survalliance is inappropriate or not allowed" we will but a device with even more sensors and hide it. In dressing rooms, toilets etc. Any investigator and judge will believe a technical device with unknwon formulars over a victim testifying that doesn't want to? You could end up dragging people into court that are not victims at all because you need years of training and testing.
No data recorded, means no evidence for humans to see and judge, no connection to the internet means it won't be able to alert you or the employer etc.
Leaving good training and behavior to be controller by a non working robot is just a bad idea.
Show some examples to back yourself up.
The only thing that I can think of that is close, was a wide angle ceiling mounted camera that detected how man people are in the room and if they are working, if they are smoking, if they are naked - no AI needed. I can't find the video right now but I will look for it.
Show something and stop trying to sell us and the world an idea with nothing behind it. It's a utopian thought that is very dystopian on second thought.
Deep learning can do a lot and it's awesome for video forensics and survalliance on large scale, but without he resources and data sizes of the Chinese government, this won't happen.