From what I understand from the mailing list, the "specific workloads" that are supposed to be improved are situations where a core runs two *different* processes that are using the same shared library. Due to address space randomization, the same library code might end up at different virtual addresses, which will confuse the CPUs instruction cache.
Benchmarking one process at a time shouldn't show any difference, then. Even multi-threaded benchmarks within a single process wouldn't (since the library will be at the same address in both threads). But using a desktop where most of your processes heavily utilize Qt or GTK+, or a server where several different programs encrypt their traffic inside libssl would.
If you want to measure an actual difference, try compressing files using two different binaries that utilize zlib at the same time, i.e. compile this twice with slightly different compiler settings and/or source code changes and run both binaries at the same time. I'm not sure if they're required to run on the same core for improvements to show.