Apologies to those who have heard me describe this before, but:
Xorshift64* is a very simple LFSR pseudo-random number generator, whose upper 32-40 bits pass all known randomness tests, and is thus better than e.g. the industry standard, Mersenne Twister (MT19937).
It is not cryptographically secure, but it is random enough even for scientific simulations and numerical work; again, both faster and "more random" than the industry standard. Just do not assume that the next number in a sequence cannot be predicted, as that would require a cryptographically secure PRNG.
It uses 64 bit state, and has a period of 2⁶⁴-1. The only disallowed state is all zeros. (No other state will ever lead to the all-zeros state, so all other seed states will give the full sequence of 2⁶⁴-1 numbers.)
The state is updated using three exclusive OR operations and binary shifts:
state = state ^ (state >> 12);
state = state ^ (state << 25);
state = state ^ (state >> 27);
To generate the PRNG corresponding to the (old or new) state, it is multiplied by a 64-bit value 2685821657736338717, and bits 32..63 (or 24..63) are used:
number = (uint64_t)(state * UINT64_C(2685821657736338717)) >> 32;
To use a 96-bit seed, you can use the first 32 bits and fixed (nonzero) lower 32 bits as the initial seed, then update the state roughly dozen times. Then, exclusive-OR the state with the next 32 bits (combined with fixed low 32 bits to form a 64-bit value), and update the state some dozen times or so again. Repeat that with the last 32 bits, too. The result is suitable for use as a Xorshift64* PRNG state, yielding the full 2⁶⁴-1 sequence from that point forwards.
On a 32-bit architecture, each state update involves nine bit shifts (by 5, 20, or 25 bits left/up; or by 7, 12, or 27 bits right/down), six exclusive-OR operations, and three OR operations, using four 32-bit registers. The multiplication is fast if you have a 32×32=64-bit or 32×32=high/low 32-bit operation, but slower if you only have a truncating 32×32=32-bit multiplication. (On ARMv6-m/Cortex-M0/M0+ it is slow, but on ARMv7e-m/Cortex-M3/M4/M7 and better it is just four instructions: muls, mla, umull, and adds.)
Example C code:
#include <stdint.h>
uint32_t prng_u32(uint64_t *state) {
uint64_t x = *state;
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
*state = x;
return ((x * UINT64_C(2685821657736338717))) >> 32;
}
uint64_t hash_96to64(const uint32_t seed[3], const uint32_t fixed[3], const uint8_t rounds[3]) {
union {
uint64_t u64;
uint32_t u32[2];
} state = { .u32 = { 0, 0 } };
uint_fast8_t w = 2;
do {
state.u32[0] = state.u32[0] ^ fixed[w];
state.u32[1] = state.u32[1] ^ seed[w];
uint_fast8_t r = rounds[w];
do {
state.u64 ^= state.u64 >> 12;
state.u64 ^= state.u64 << 25;
state.u64 ^= state.u64 >> 27;
} while (r-->0);
} while (w-->0);
return state.u64;
}
The hash_96to32() is a parametrized hash function, letting you tune each use case using the fixed and rounds parameters. I've found that something like a dozen iterations suffices for a full avalanche effect (at least half the bits differing, whenever the seed value differs by one bit). Each of the fixed values should be nonzero, to ensure the state update rounds have full effect; preferably with both set and clear bits in the least and most significant bits (instead of all-zeros or all-ones).
While I am not a mathematician, nor a cryptologist, a lot of stuff I do relies on "random" pseudorandom number sequences, and me understanding the details, so I've had to learn to work with the various methods. Linear feedback shift registers are definitely the fastest. As Xorshift and Xoroshiro variants show, they can be extremely random, too. The two industry benchmarks for empirical randomness testing are the TestU01 framework's BigCrush test suite (which includes over a hundred individual tests), and the Diehard tests by Marsaglia et al. and its derivatives like Dieharder. Using only 32 to 40 high bits of Xorshift64* is not the absolute best (least "weak" results in Dieharder while passing all BigCrush tests), nor the fastest, but it really has no "poor" seeds (except for zero, which is not allowed, and will not occur by itself during normal operation), while the better ones tend to suffer from some "poor" seed states generating less-random sequences so that such seeds need to be "avoided" or worked around. The relative density of poor seeds is not a problem; it is really only the initial sequence that is impacted by "poor" seeds – but we tend to use those initial sequences the most.
Also, apologies for the wall of text: I had a lot to say.