Don't FLASH chips have bad blocks remapped already? They must have otherwise when a machine starts using it, it would do an awful lot of remapping of bad blocks. Also I do a factory test (on the 4MB Adesto device) and this checks every block.
Managed Flash does, unmanaged does not. Serial flashes that take commands are managed, and do.
I believe the managed Flash ones, like Winbond serial flashes, have a controller with eeprom to retail the bad blocks list, in a hash table: a fixed-size array of logicalSector → physicalSector mappings, so that each sector is typically O(1) time complexity; or perhaps in sorted order so a binary search in logicalSector is exactly O(log₂N). If we assume 5% of sectors may become bad, then 32GiB with 4096-byte pages has 8,388,608 pages, and up to 419430 may be remapped; that would require about 25 Mbits at 64 bits per mapping, and 25 look-ups per sector using a binary search. For a hash table, the performance starts degrading when it is more than about half full (more checks per sector), so there are obvious tradeoffs here.
This sounds like an odd solution to me. It adds complexity and now the solution depends on both eeprom and flash endurance.
Nevertheless!
As an example of the non-managed Flash, consider
Micron MT29F2G08ABAGAWP 256M×8 that Mouser sells, and its datasheet; specifically, the Error Management chapter in page 37. You identify bad page blocks (up to 40 out of 2048 blocks ≃ 1.95%, from the factory, not more than 2% during device lifetime) by the first spare address on the first or second page per block.
As an example of managed Flash, consider
Winbond W25N512GVEIG also from Mouser. It has an internal Bad Block Management Look-Up Table, BBM LUT, with up to ten relocated blocks. See 8.2.7 Bad Block Management (A1h) command, and 8.2.8 Read BBM LUT (A5h) command, on pages 34 and 35. As it is just a maximum of 10 relocations, it is rather unlikely they use Flash itself for this, and instead use some kind of EEPROM that is copied to RAM on power-on; this EEPROM only requires say a thousand write cycles to be effective in practice.
These are simple round-robin systems which do wear levelling by themselves. If you need something more complex, it is time to look for a wear levelling layer which does all the heavy lifting.
Wear leveling and bad block management are two completely different things, because even from factory fresh, raw NAND may contain up to 1.9% of bad blocks (Micron), with the maximum lifetime bad blocks allowed below 2% (Micron) and up, depending on the manufacturer.
Most NAND Flash has spare bytes per block; for example both of the above have 64 extra bytes per each 2048 byte page. The first extra ("spare") byte is often used to record the page or block failure from the factory. The rest are used for ECC and such. In the Micron one, you relocate and mark entire blocks (128 KiB + 4 KiB spare) bad. On a 2 Gb one, you have 2048 blocks, and up to 40 may become bad during the device lifetime. Since 2048 = 2¹¹, you only need 22 bits per block relocation. As the very first block is guaranteed to last 1,000 PROGRAM/ERASE cycles, and typical ECC needs 6 bytes per 512 bytes, that means the spares left over from ECC in block 0, say one relocation per page (so max. 64 relocations), could be used. It'd be simpler to use page 0 of block 0, though, and not use block 0 for data at all.
However, consider the case where your NAND controller/MCU has 1024 bits of EEPROM with > 100 guaranteed write cycles at hand. You could store the bad block information there (at 24 bits per relocation, you can fit 42 relocations in 1008 bits). If you have the same in 12-bit/24-bit RAM, and you keep the list sorted, a binary search will find the mapping or determine lack of by examining max. seven entries. Comparing to the 300µs page program on the Micron, that's nothing.
Why use separate memory for the bad block mapping, then? Because you want to read it as fast as possible during startup, and it is typically not much data at all. It is not modified per se (hash table; is modified if sorted/binary search), only new entries added to it. The number of write cycles is the maximum number of relocations! It makes sense to make this small section more robust, at the cost of extra silicon.