Oh state machines are routine, (And a CPU is a state machine!), but there are definite limits to how complex a state machine written as such can be and remain tractable.
Throwing a microblaze or such in the fabric gains you two things:
You can program the stupid thing in C, debug it with a debugger and generally treat some of the work as software dev.
Also, changes to the software side do not usually require P&R to be rerun, particularly with a mostly full part that can save literally hours of compute time.
High performance those little cores are not, but a lot of control (as opposed to data flow) stuff just does not need a high performance computer.
The on chip block rams are in fact a KEY feature of an FPGA, these are usually relatively small memories (a few tens of kb typically), BUT there are potentially hundreds of them! This makes for fantastic memory bandwidth, and they get used for everything from filter coefficients, to fifos, to look up tables, to yes, program and data storage. They are explicitly designed to be useful for crossing clock domains and similar purposes.
Same with the 'DSP' cores, which are really some combination of adder/multiplier/accumulator, good typically (with some pipelining) to a few hundred MHz, but again you got a hundred or so on the chip, and they can be wired directly into the block rams for data or coefficients.
A common mistake is to thing of writing an FPGA configuration as software, you can sort of do that, but software generally has an implicit notion of sequence, VHDL does not, it describes wiring, look up tables, and registers, if you need sequential, you have to describe the circuit to make things happen one after another, it is not built in.
Say, rather, IO, IO, IO is what FPGAs are really about, you can get some very nice performance as well, for problems that fit the model well, basically masses of data flow stuff that pipelines nicely, but the selling feature is hundreds of fast IO pins and a handful of scary fast transceivers.