Thanks, vivado provides some AXI Verification IP (VIP), I'm going to try it.
I use it a lot, and it's great for simulating all kinds of AXI transactions!
BTW, initiating single transfers 8/16/32-bit manually over microblaze is easy, but is there any way to initiate wrapped cache transfer manually?
I honestly don't remember off top of my head. Try skimming through the user guide for MB - I remember it is very detailed so I'm sure it covers everything.
You didn't mention the IDELAY. Am I right, that you also had used IDELAY to delay the RWDS strobe before feeding it to CLK/CBLB pins of SERDES?
You may or may not need them depending on what your pinout looks like and which clock buffers you use. I don't have that design handy so I can't look it up, but I remember that clock buffers have rather large insertion delay, such that you might need to add IDELAY to data lines to compensate for that. In my 200 MHz LPDDR1 design I have IDELAY blocks on both clock and data lines. BTW - in case you need more resolution for delay, there is undocumented IDELAYE2_FINEDELAY component, which can provide finer delays (though that is only really useful at much higher speeds, because documented version has a resolution of 78 ps at 200 MHz).
You have quite interesting implementation, though I don't understand how does it work stable, as your free-running clocks are still not synchronouse to the rwds domain. I don't understand how data is synchronized when going through the CLK->OCLK path inside the ISERDES in MEMORY_MODE...
This mode has been specifically designed to support strobe-based memory interfaces (which is what DDRx interfaces are, and so is HB1/2 as far as reads are concerned). They provide functional schematics of how it works in UG471 (it's basically a version of a classic double-flop synchronization CLK->OCLK->CLKDIV, latter two are required to be phase-aligned), but you can just assume that it does, since this is a hard silicon block, provided that you use it exactly how they are suggesting. Carefully read entire section of that user guide which talks about ISERDESE2, you might need to read it several times to "get it" - as was in my case - because it's not exactly the cleanest explanation, and the component itself is rather complex and covers many different use cases.
As for stability - this is what IO timing constraints are for. If you constrain your design properly, timing analysis can guarantee proper functioning of interface in all conditions as Vivado will adjust placement and routing to help satisfy them (basically it can add quite a bit of delay by choosing a longer connection path inside the fabric's interconnect backbone, or move components around to ensure both setup and hold constraints are satisfied). As you increase your interface speed, at some point timing margins will become so small that you won't be able to perform a static capture using constraints, and in this case you will need to implement/perform some sort of calibration to ensure that you sample data right in the middle of the data eye, but in my experience with right pinout and some effort on your part you can successfully achieve static capture at 200 MHz DDR and below.
Finally - I'm sure you know this, but someone reading this might not - Cypress provides simulation models for these chips, so you can run full functional simulations to make sure it works like it should in all cases and scenarios.