New Patches Aim To Optimize Context Switching With Two Improvements
A set of Friday night patches provide for some exciting context switching optimizations to the Linux kernel.
Longtime Linux developer Rik van Riel with Meta has worked on a set of context switching optimizations. There are two targeted improvements after finding on an undetailed web server that significant portions of the CPU time were spent within the switch_mm_irqs_off function.
Rik van Riel explained with the patch series:
With the first patch tested using Hackbench on an AMD EPYC Milan server, a simple Hackbench test case dropped from 4.5 seconds to 4.2 seconds with the CPU time savings. The second optimization also provides significant CPU time savings but without quantifying the benefits any further than around 17% of all the CPU time of switch_mm_irqs_off.
These Linux context switching optimizations are now under review for hopefully being mainlined in an upcoming kernel cycle... Perhaps for v6.13 while we'll see how quickly the code gets reviewed and signed off.
Longtime Linux developer Rik van Riel with Meta has worked on a set of context switching optimizations. There are two targeted improvements after finding on an undetailed web server that significant portions of the CPU time were spent within the switch_mm_irqs_off function.
Rik van Riel explained with the patch series:
"While profiling switch_mm_irqs_off with several workloads, it appears there are two hot spots that probably don't need to be there.
The first is the atomic clearing and setting of the current CPU in prev's and next's mm_cpumask. This can create a large amount of cache line contention. On a web server, these two together take about 17% of the CPU time spent in switch_mm_irqs_off.
We should be able to avoid much of the cache line thrashing by only clearing bits in mm_cpumask lazily from the first TLB flush to a process, after which the other TLB flushes can be more narrowly targeted.
A second cause of overhead seems to be the cpumask_test_cpu inside the WARN_ON_ONCE in the prev == next branch of switch_mm_irqs_off.
This warning never ever seems to fire, even on a very large fleet, so it may be best to hide that behind CONFIG_DEBUG_VM. With the web server workload, this is also about 17% of switch_mm_irqs_off."
With the first patch tested using Hackbench on an AMD EPYC Milan server, a simple Hackbench test case dropped from 4.5 seconds to 4.2 seconds with the CPU time savings. The second optimization also provides significant CPU time savings but without quantifying the benefits any further than around 17% of all the CPU time of switch_mm_irqs_off.
These Linux context switching optimizations are now under review for hopefully being mainlined in an upcoming kernel cycle... Perhaps for v6.13 while we'll see how quickly the code gets reviewed and signed off.
6 Comments