Why making a Linux kernel specification in such a lousy way that a click into another window will render useless any input controller with physical media buttons on it, like this play/pause button on a PC keyboard? 
Nothing to do with Linux, and everything to do with X Windows.
Shouldn't such a standard specify to play/pause the last player in focus?
- specify in the standard to send the multimedia events to the last player that was in focus
- specify a switch current player mechanism (like the Alt+TAB for windows, but for players only), in case more players are already open
That would be the task of the Window Manager, or the application itself. Any application can passively grab a specific keypress (including modifier state, so you can grab say Alt+Shift+K without affecting how K or Shift+K or Alt+K or Ctrl+K is handled) so that unless someone already focused has it also grabbed, it will be delivered to the application (see
XGrabKey()).
One could argue that it is silly that media player applications do not do this for Play/Pause/Stop/Forward/Back.
(Then again, certain folks insist that instead of using X11, we should use D-Bus for these, and in e.g. Cinnamon desktop, register for such events provided by the csd-media-keys daemon. Because.. you know. I'd like to rant, but I'll try not to.)
Alternatively, you can bind the keypress to a script that determines which window the direct the event to, and synthesize a suitable X11 event directed to that window using e.g.
xdotool.
Or, following
Unix philosophy, you could use a small X11 application that manages such applications you call "players", by grabbing the keypresses. It can obtain the list of X applications using
XQueryTree (similar to
xwininfo -tree -root), and send synthesized events like multimedia keypresses to specific applications using
XSendEvent(). It can even monitor focus change events using
XSelectInput(display, window, FocusChangeMask). This is low-level X11 stuff, too; very lightweight.
It seems to me that your needs would be best addressed using a small X11 application that reads codes/sequences/commands directly from your serial port (connected to a microcontroller via USB-Serial bridge, or to a microcontroller with native USB, and that microcontroller having at least one IR receiver already discussed). It would of course also grab the multimedia keys, so that stuff would Just Work. It would need to know how to tell which windows are "Players" (either by window title, or by executable name), and it might be useful to have it have OSD capability (backgroundless frameless windows at fixed position) to display applied actions, but it would be quite straightforward application.
This has the benefit that it would adapt to the desktop user, quietly staying on the background, and only consuming very little memory and CPU time whenever window focus changes (if you want it to track focus among player applications) –– or it could also use PulseAudio services to track applications that generate sound, associating them with their respective windows (if any) ––, so that the serial-IR-controller would Do What You Mean, by monitoring which players are active or capable of generating sounds, associating them with the windows, and using X11 to send them the events.
In fact, the absolute majority of such an application written in C would be in how to manage the underlying mappings. And the UI for that if you want an UI for it instead of configuring it fully using only flat text config files. Because only X11 (
libX11) would be the requirement (plus possibly some PulseAudio crap), it would be very stable and easy to maintain.