I use my own. It is not difficult to customize a simple one, but there are some details one needs to consider first.
- Usage and matrix sizes
Matrices can be used to represent various different things, from systems of equations (in linear algebra) to rotations and transforms in D-dimensional geometry.
If the matrix sizes and/or shapes are limited to a relatively small set, it is often more computationally efficient to implement the basic operations separately for each.
OP already mentioned they don't need sparse matrices, which simplifies such a library a lot already.
Things like "all my matrices are either square or one-dimensional (vectors)", or "my matrices have at most 32 rows/columns", can simplify the library a lot.
- Matrix inversion approach
There are algebraic expressions for square matrix inversion, but they are only feasible for small square matrices (say up to 3×3 or 4×4). For larger square matrices, there are several methods to choose from.
For non-square matrices, we use generalized inverse, or pseudoinverse; useful in e.g. linear algebra, when seeking solutions to a system of equations (with equal number of unknowns as lineraly independent equations).
In particular, if the matrices are only used to represent systems of equations, it would make sense to use a different interface than just a generic "invert this matrix, please". In particular, most cases can be simplified to Ax=y with y and matrix A known, but x unknown; the solution being the traditional x=A⁻¹y, where matrix A⁻¹ is the (generalized) inverse of matrix A, so a dedicated interface for that would be better than having the programmer remember the sequence of "pure" linear algebra operations themselves.
- Memory fragmentation
If the matrices have varying lifetimes –– especially if you have sequences like "create matrix A, create matrix B, create matrix C, delete matrix B, create matrix D, delete matrix A, delete matrix C, ..., delete matrix D" –– it is possible to end up with badly fragmented heap, with small matrices scattered around with roughly matrix-sized holes in between, so that even though almost half of the memory is still available, you suddenly cannot allocate a slightly larger matrix anymore, because the only holes available are all too small.
For example, if there are a limited number of matrices at any point in time, then referring to them using an index would allow the matrix data memory manager to compact existing matrices every now and then, basically avoiding all issues with memory fragmentation, at the cost of occasional cleanup work.
I personally use matrix and vector structures that can be declared as local variables (i.e., they are structures, not pointers to structures), with views to other matrices and vectors indistinguishable from the original matrices and vectors (similar to Unix/POSIX inodes and dentries). The actual data is stored in separate reference-counted "data owner" structures, which are deleted when the last vector or matrix referring to it is deleted.
If the maximum matrix size is small enough, and the maximum number of concurrently used unique matrices small enough, memory fragmentation can be avoided by always using the maximum data owner size.
I do not keep a ready-made library handy, because when I need computational power, I use Gnu Scientific Library and others; and when I need something small and lightweight, I write/rewrite my 'library' to fit the needs of the particular use case.
If I knew what OP uses the matrices for, and what are the actual matrix size and shape limits, and what kind of embedded system the code is intended to run on (8-bit with FP emulation? Fixed point? 32-bit with FP emulation? 32-bit with hardware single-precision floating point support? 32-bit with double-precision emulation?), I might be able to give practical advice. The library itself is simple to write, but choosing the appropriate inversion method depends on the actual use case. One can even make tradeoffs between speed and accuracy. It is the "general" case that is the most painful, because that really requires implementing more than one, and using heuristics to detect when to switch; especially so with single-precision floating point numbers, as many almost-singular matrices do have a generalized inverse, but finding it numerically is tricky.