The screenshot they show has the video streams in separate windows, so it could be as simple as a USB hub with three cameras hooked up to it.
What you are trying to achieve isn't easy, you essentially do need an embedded computer to drive both cameras, perform the image processing and pretend to be a camera to another USB host. You can't simply "combine USB signals" from multiple devices. Pretty surely the "computer" could be reduced to a single ARM SoC, however.
The CCTV market may have some off the shelf solutions, but I'm not familiar with that stuff.