Thanks for the benchmarks, the asm vs compiler generated ratio is pretty much as expected but if I'm reading this correctly the PGO versions are not faster (even slightly slower!?) which means you are not getting it to work properly. You need to run the pgo versions through an encoding and then re-compile for it to be able to use the generated runtime data. As I recall there is a semi-automated framework for this in x264, I'll see if I can find some proper instructions and redo the PGO tests myself (unless you would like to). Even with enabling all assembly optimizations, using PGO gave another 5% performance increase total according to 'Dark Shikari' so PGO isn't working in your tests.
I talked with Dark Shikari on #x264 about the results. He said A.) PGO would help more with the hand asm, since it apparently does not benefit what the pure C build spends most of its time doing (DSP functions), and B.) I screwed something up with the PGO, because it should be ~1% faster. I did the build correctly from what I can tell (make fprofiled VIDS="videohere.y4m"), but I couldn't be be bothered to recompile, retest, etc. etc. to confirm a ~1% performance increase
Weird, why would PGO help more with the asm version? Should be the ecaxt opposite imo since the compiler can't optimize that hand-written assembly in any way, but it should atleast be able to do some optimizing with the c code. Anyway I can understand why you wouldn't want to redo the tests since the argument was regarding hand-optimized assembly vs compiler generated code, maybe I'll give it a shot myself since I'm curious as to what Shikari said. Again thanks for the benchmarks!