In your case, it is about 4ms faster at rendering a single texture.

Optimising 4ms away is really hard work, especially if it consists of 100 different miliseconds collected across different parts of the driver. That's what my armchair response was about.
4 Ms is really a very long time, a huge amount of CPU instructions. Plus CPU utilization is not high. This is not about code optimization. I am certain it is not a hundred little things, it must be a couple big ones.