There is one separate driver for SD card which uses SDMMC interface (configurable 1/4 bit mode) and another separate driver which uses SPI interface. While the former one (SDMMC) works like a charm, the latter (SPI) sometimes doesn't work (the card is not detected) - it seems that SPI mode for SD cards is a bit tricky...
When working on my bootloader, I ran across some good information in a forum post that solved my problem with non-responding cards using SPI:
> After the last SD Memory Card bus transaction, the host is
> required to provide 8 (eight) clock cycles for the card to
> complete the operation before shutting down the clock.
> It could work without it, but since 8 cycles = 1 SPI output
> byte, it won't hurt much and it's just good to have it.
It appears that some cards just require more clocks to respond to commands properly, and I actually ran across one of them. It worked fine in my computer, but not in my bootloader.
I had already adopted the idea of briefly toggling the CS line high, then back low, at the beginning of each command since I read that some cards need that (instead of just tying CS low all the time). So I added transmitting one byte of 0xFF just before toggling the CS line, and the flaky card then worked properly. So instead of sending an extra byte at the end of each command, I send it just before the beginning of the next command, and that works so far.
I'll attach the flow chart I followed for SD v1 and v2 and for SDHC. I didn't support MMC or SDXC.