It's more likely that the STL implementation is correct than any personal equivalent we might write. I'm not knocking your code specifically, but STL has gotten a great deal of exposure and most of the kinks have (hopefully) been worked out.
But not for embedded... A lot of "taught" C++ is about the STL and various other libraries, and they're swell until you try to use them on a chip with severely limited resources, no "exception" OS support (or no OS at all), requirements that prohibit dynamic allocation, and etc. Then you're back to implementing things yourself.
(I suppose that arguably, "severely limited environments" will just disappear. (hah!))
you could not implement any threading code in C
AFAIK, you can't "implement threading" in ANY language other than asm. You get the choice of putting in a bit of assembler, or using a threading implementation that has been put into the language for you. That's swell if it happens to meet your needs. In general, you get a choice between simple languages (C, asm) that make it easier for you to do things yourself, and complex languages that try to do everything for you, and end up with "bloated" implementations. (it's EXACTLY the same problem that vendor-provided "low-level libraries" have - by the time you get it to do everything possible, you have an implementation that no one wants to use.)
Last I heard, "Java Embedded Micro Edition" ran on "as little as 1MB/128k", and Embedded C# implementations were similar. And aren't getting much traction, despite people wanting standard-ish GUI interfaces (for example) on their embedded projects (for which these ought to be ideal.) (OTOH, micro-Python is catching on, and interpreted tokenized BASIC still has a significant following. So "performance" doesn't necessarily seem to be the limiting factor...)
"Seems like every firmware engineer has their own FIFO implementation. Why is that?"
because the usual embedded system doesn't need or want a full general FIFO implementation that can be used on any data structure, allocates dynamic memory as needed, throws exceptions on errors, catches exceptions from lower levels, is fully thread-safe for any number scheduling algorithms, etc, etc. And the "simple" FIFO code that they do need is pretty easy to write. And read. (I'm sure the STL is wonderful in its way. But the few times I've looked at an STL implementation of some algorithm, to see how it did something, I was faced with a relatively unintelligible mass of generalization, standardized but ugly internal names, and pretty advance C++ features (I'm sure it all make sense to someone who works ON STL. But I don't think the beginner or average C++ programmer has much of a chance. Sigh.)