LwIP seems to be pretty much only free TCP stack that is used everywhere. Most of the uc vendors integrate it, mbed integrates it, etc. As with all bare metal and resource-optimized code, there can be issues when going to the bare minimum and running in low-memory conditions etc. For example, messing with buffer pool sizes can give you configurations, where there is not enough memory to work after some conditions and everything goes bad.
The performance depends a lot on what configuration is the stack used with. How many RX DMA buffers, how much memory for buffer pool. Increasing TCP MSS gives good results on throughput, but takes again much more memory and memory is one thing that uc-s typically lack.
Probably most of the issues happen in user code, because implementing a service on top of raw api can have quite lot of pitfalls. For example, proper handling of TCP abort (abort can come at any time, including before accepting connection etc). It took me a lot of debugging and testing to get everything right (espescially when programming a multi-connection service having quite high throughput). Another possible issue is a out-of-memory deadlock when the application gathers a lot of incoming data and does not free it until it gets a buffer to send out reply, but the memory is stuck in the incoming buffers. So definitely programming network stuff on embedded bare-metal platforms is much more difficult and error-prone than on desktop os-s.
A quirk i've patched out of lwip is the tcp out-of-memory connection priority handling. For some reason, they implemented it in a way, that in out-of-memory condition new incoming connection with SAME or higher priority is preferred. This means that if another connection comes into the same service and the system is low on memory, the previous connection that is being serviced is killed and buffers released. At some point, nothing gets done anymore, because services retrying will kill eachother. This happens in very low memory implementations only. After patching it out, my HTTP/1.0 server is able to respond >400 connections/sec of dynamic JSON system status on 48MHz STM32F107.
Pricewise, the STM32F107 + KSZ8081 has really good value and it is hard to beat; at the same time it still is cortex-m3. If it has enough processing power and memory for the task, then use it. I've used LPC1768 (and 69, 7x) too; they are also very good chips, better in most of the aspect, but not in the same price league as F107. If the product is higher volume and/or price sensitive, this can be a deal breaker. If your budget allows for higher end micro (for TCP, RAM memory is the first thing you run out of), go with it and don't waste your time trying to optimize and slim down everything. If you're planning to migrate down to F107, better start with F207 or F407 and then you can later on use the same know-how, tools and code on F107. Afaik, NXP does not have pricewise anything close.