It's not 2 interger units with 1 FP unit, it's 2 interger units with 2 128 bit FP units that can act together as 1 256 bit FP unit, the problem is that the compiler and schedulers need to be set up properly in order to take advantage of this fact.
Of course Intel beats out AMD at single threaded operations, however AMD beats out Intel at heavily threaded ones, this has been the status quo forever, this doesn't mean that AMD's is any less efficient. Also the point of the Bulldozer design is completely on multithreading and in the future heterogenous compute.
And simple fact is that if the application is old enough to be so lightly threaded you're not going to see the difference anyway because it doesn't require that much single threaded performance, however where you need it with multithreaded performance AMD will be faster.