Ok, let me retry.
This got me to wondering how exactly this worked on something like one of those cheap USB document scanners that canon and epson sell.
See SANE.
Is it one of the USB gadget protocols, and if so, which one?
No. There is USB Still Image Capture Device Definition, and PTP/MTP, but they're mostly used with cameras. Scanner families have tended to have their own communications protocols and drivers, with scanner-specific variations.
I've been able to find surprisingly little information about how this works, but since an rpi w can operate as a gadget I'm wondering if I can emulate that approach.
You can. Simply use USB bulk transfers and your own protocol.
I would not do it that way, personally; I'd use a TCP/IP connection over Ethernet or WiFi. This is simpler, because the host OS does not matter, and you get galvanic isolation for free (because the Ethernet interface includes magnetic isolation for all signals). (No ground loop risk, if you power the 'Pi using a grounded or safety-grounded wall wart, i.e. not a class II isolated wall wart with no connection to ground.)
As to the protocol, I would definitely design it as an interface boundary: as a control and data interface to an
appliance. That is, each image (uncompressed RGB) would be preceded by a header, which identifies all the settings and properties of the image (and unless you standardize on all multi-byte values being in little-endian order, even the byte order), and perhaps a simple checksum of the entire frame. In addition to coordinates (x, y, z?), I'd also include a serial number based on the currently active document.
An important facet of this is the asynchronous nature of the beast. If you design your control protocol as a synchronous one, you'll waste time waiting for stuff to complete everywhere. Instead, you must design each command (sent via the TCP/IP connection) so that it is immediately acknowledged/rejected, and if acknowledged/accepted, responded with a completed/failed status. This means that each command you send should have a nonce, for example a monotonic counter, that is included in each immediate response and status message. For example, to move the camera to a specific location, you can have separate commands for each axis, that are sent concurrently; the device will only send the status message when each axis has reached the desired position.
This also means you can add new features, like say zoom capability, light/backlight intensity controls, etc., by adding new commands to the protocol.