Thanks for the benchmarks, the asm vs compiler generated ratio is pretty much as expected but if I'm reading this correctly the PGO versions are not faster (even slightly slower!?) which means you are not getting it to work properly. You need to run the pgo versions through an encoding and then re-compile for it to be able to use the generated runtime data. As I recall there is a semi-automated framework for this in x264, I'll see if I can find some proper instructions and redo the PGO tests myself (unless you would like to). Even with enabling all assembly optimizations, using PGO gave another 5% performance increase total according to 'Dark Shikari' so PGO isn't working in your tests.
I talked with Dark Shikari on #x264 about the results. He said A.) PGO would help more with the hand asm, since it apparently does not benefit what the pure C build spends most of its time doing (DSP functions), and B.) I screwed something up with the PGO, because it should be ~1% faster. I did the build correctly from what I can tell (make fprofiled VIDS="videohere.y4m"), but I couldn't be be bothered to recompile, retest, etc. etc. to confirm a ~1% performance increase