AMD Has A Nice Performance Optimization Coming With Linux 6.8
Queued up into tip/tip.git's x86/cpu branch ahead of the Linux 6.8 merge window opening in a month is an optimization that should prove helpful in cloud/VM scenarios.
The change slated to be introduced in Linux 6.8 is for not serializing model-specific register (MSR) accesses on AMD (and Zen 1 derived Hygon) processors. Intel CPUs need to serialize MSR accesses for the Time Stamp Counter (TSC) deadline (IA32_TSC_DEADLINE) and X2APIC MSRs and thus that's been the default behavior for Linux x86_64 use. That behavior was previously explained by an Intel Linux engineer as:
So the Linux x86/x86_64 kernel has defaulted to an MFENCE and LFENCE but without any CPU-specific checks. It turns out AMD CPUs don't need this and avoiding the serialized MSR access for TSC_DEADLINE/X2APIC can help with performance.
The patch slated for Linux 6.8 will no longer serialize MSR accesses on AMD processors. The patch outlines the performance benefits from this change:
With this MSR access behavior having been the default behavior of the Linux x86_64 kernel for a few years now, it's a bit surprising it was not spotted sooner by AMD or their partners for optimizing.
Barring any issues from coming up with the patch, now that it's part of a TIP branch it should in turn be part of the Linux 6.8 kernel changes for early 2024.
The change slated to be introduced in Linux 6.8 is for not serializing model-specific register (MSR) accesses on AMD (and Zen 1 derived Hygon) processors. Intel CPUs need to serialize MSR accesses for the Time Stamp Counter (TSC) deadline (IA32_TSC_DEADLINE) and X2APIC MSRs and thus that's been the default behavior for Linux x86_64 use. That behavior was previously explained by an Intel Linux engineer as:
"The reason the kernel uses a different semantic is that the SDM changed (roughly in late 2017). The SDM changed because folks at Intel were auditing all of the recommended fences in the SDM and realized that the x2apic fences were insufficient.
Why was the pain MFENCE judged insufficient?
WRMSR itself is normally a serializing instruction. No fences are needed because the instruction itself serializes everything.
But, there are explicit exceptions for this serializing behavior written into the WRMSR instruction documentation for two classes of MSRs: IA32_TSC_DEADLINE and the X2APIC MSRs.
Back to x2apic: WRMSR is *not* serializing in this specific case. But why is MFENCE insufficient? MFENCE makes writes visible, but only affects load/store instructions. WRMSR is unfortunately not a load/store instruction and is unaffected by MFENCE. This means that a non-serializing WRMSR could be reordered by the CPU to execute before the writes made visible by the MFENCE have even occurred in the first place.
This means that an x2apic IPI could theoretically be triggered before there is any (visible) data to process.
Does this affect anything in practice? I honestly don't know. It seems quite possible that by the time an interrupt gets to consume the (not yet) MFENCE'd data, it has become visible, mostly by accident.
To be safe, add the SDM-recommended fences for all x2apic WRMSRs.
This also leaves open the question of the _other_ weakly-ordered WRMSR: MSR_IA32_TSC_DEADLINE. While it has the same ordering architecture as the x2APIC MSRs, it seems substantially less likely to be a problem in practice. While writes to the in-memory Local Vector Table (LVT) might theoretically be reordered with respect to a weakly-ordered WRMSR like TSC_DEADLINE."
So the Linux x86/x86_64 kernel has defaulted to an MFENCE and LFENCE but without any CPU-specific checks. It turns out AMD CPUs don't need this and avoiding the serialized MSR access for TSC_DEADLINE/X2APIC can help with performance.
The patch slated for Linux 6.8 will no longer serialize MSR accesses on AMD processors. The patch outlines the performance benefits from this change:
"AMD does not have the requirement for a synchronization barrier when acccessing a certain group of MSRs. Do not incur that unnecessary penalty there.
...
On a AMD Zen4 system with 96 cores, a modified ipi-bench on a VM shows x2AVIC IPI rate is 3% to 4% lower than AVIC IPI rate. The ipi-bench is modified so that the IPIs are sent between two vCPUs in the same CCX. This also requires to pin the vCPU to a physical core to
prevent any latencies. This simulates the use case of pinning vCPUs to the thread of a single CCX to avoid interrupt IPI latency.
...
With the above configuration:
*) Performance measured using ipi-bench for AVIC:
Average Latency: 1124.98ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously]
*) Performance measured using ipi-bench for x2AVIC:
Average Latency: 1172.42ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously]
From above, x2AVIC latency is ~4% more than AVIC. However, the expectation is x2AVIC performance to be better or equivalent to AVIC. Upon analyzing the perf captures, it is observed significant time is spent in weak_wrmsr_fence() invoked by x2apic_send_IPI().
With the fix to skip weak_wrmsr_fence()
*) Performance measured using ipi-bench for x2AVIC:
Average Latency: 1117.44ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from 48 vCPUs simultaneously]
Comparing the performance of x2AVIC with and without the fix, it can be seen the performance improves by ~4%.
Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option with and without weak_wrmsr_fence() on a Zen4 system also showed significant performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores CCX or CCD and just picks random vCPU.
Average throughput (10 iterations) with weak_wrmsr_fence(),
Cumulative throughput: 4933374 IPI/s
Average throughput (10 iterations) without weak_wrmsr_fence(),
Cumulative throughput: 6355156 IPI/s"
With this MSR access behavior having been the default behavior of the Linux x86_64 kernel for a few years now, it's a bit surprising it was not spotted sooner by AMD or their partners for optimizing.
Barring any issues from coming up with the patch, now that it's part of a TIP branch it should in turn be part of the Linux 6.8 kernel changes for early 2024.
8 Comments