I used it when I implemented SSL a long time ago, not on FPGA through. I had a regular version, an MMX version, and the super modern SSE2 version (although SSE2 didn't add much). For every one of these cases, I compared Karatsuba to simpler algorithms (diagonal or school-boy). It was slower for shorter multiplications. It only got better for longer multiplications, starting from 512x512. Its performance was even worse for squaring - the smallest size where it outperformed others was 1536x1536. The results were consistently the same for each of the 3 hardware versions.
I think it's the same in FPGA. If you use DSP blocks, an extra multiplication is less of a problem than multiple additions.