Announcement

**bezirg** · 13 June 2022, 06:31 AM

Is this strictly related to multiple socket systems (>= 2 sockets) or can 1-socket desktop systems with a multiple llc cpu (zen 5950x, 5900x, 3950x 3900x) also benefit from this?

**Vorpal** · 13 June 2022, 07:03 AM

Originally posted by bezirg View Post

Is this strictly related to multiple socket systems (>= 2 sockets) or can 1-socket desktop systems with a multiple llc cpu (zen 5950x, 5900x, 3950x 3900x) also benefit from this?

The way I read it: any cpu with multiple chiplets. I believe some of the lower end Ryzen CPUs have only a single chiplets.

**bezirg** · 13 June 2022, 07:37 AM

Originally posted by Vorpal View Post

The way I read it: any cpu with multiple chiplets. I believe some of the lower end Ryzen CPUs have only a single chiplets.

If this is true, can Michael throw in the benchmarks a plain desktop zen 3 5950x to see the impact? These would include way more interested phoronix readers

**ET3D** · 13 June 2022, 07:39 AM

Originally posted by bezirg View Post

Is this strictly related to multiple socket systems (>= 2 sockets) or can 1-socket desktop systems with a multiple llc cpu (zen 5950x, 5900x, 3950x 3900x) also benefit from this?

As I understand it, it won't help Ryzen 3000 and up, as such CPUs don't have any NUMA domains internally. All RAM accesses go through the I/O die and are therefore equally distant from all chiplets. Threadripper 1000 and 2000 worked as NUMA, where each CPU in the package had its own RAM channels, so while data was transferred between the CPUs, access to the CPU's local channels was faster. However whether this helps TR 1000 and 2000 depends on how current scheduling treats them, which I don't know.

**S.Pam** · 13 June 2022, 08:19 AM

Originally posted by Vorpal View Post

The way I read it: any cpu with multiple chiplets. I believe some of the lower end Ryzen CPUs have only a single chiplets.

Yea. On our Epyc servers I see big boost by enabling NUMA emulation per chiplet/CCX.

**scottishduck** · 13 June 2022, 09:04 AM

Important to remember this is for the fair scheduler. Most people will be using schedutil

**Michael** · 13 June 2022, 09:07 AM

Originally posted by scottishduck View Post

Important to remember this is for the fair scheduler. Most people will be using schedutil

Completely different things... This isn't a CPU frequency driver / governor optimization.

**Degra** · 13 June 2022, 09:45 AM

Originally posted by ET3D View Post

As I understand it, it won't help Ryzen 3000 and up, as such CPUs don't have any NUMA domains internally. All RAM accesses go through the I/O die and are therefore equally distant from all chiplets. Threadripper 1000 and 2000 worked as NUMA, where each CPU in the package had its own RAM channels, so while data was transferred between the CPUs, access to the CPU's local channels was faster. However whether this helps TR 1000 and 2000 depends on how current scheduling treats them, which I don't know.

What ?!?
How is this about RAM access?
This is about cache access!

In the Ryzen 5000 series the CPUs up to Ryzen 5800X use only 1 chiplet with 8 cores and 32 MB L3 cache and thus won't benefit,
but Ryzen 5900X and 5950X use 2 chiplets, with 12 or 16 cores respectively and 2 x 32 MB L3 cache.

You will absolutely see a NUMA optimization benefit on 5900X and 5950X, since a process can be pegged to a chiplet that has the data cached in its L3 cache.
There is a significant latency penalty from accessing the L3 cache from a different chiplet, so you always want to have the process running on the chiplet with the appropriate cache.

The bandwidth of the infinity fabric is also limited and better be avoided if you can help it by not utilizing the infinity fabric.
Simply have process on the same chiplet as the cached data for the process.

**Linuxxx** · 13 June 2022, 10:11 AM

Originally posted by Degra View Post

What ?!?
How is this about RAM access?
This is about cache access!

In the Ryzen 5000 series the CPUs up to Ryzen 5800X use only 1 chiplet with 8 cores and 32 MB L3 cache and thus won't benefit,
but Ryzen 5900X and 5950X use 2 chiplets, with 12 or 16 cores respectively and 2 x 32 MB L3 cache.

You will absolutely see a NUMA optimization benefit on 5900X and 5950X, since a process can be pegged to a chiplet that has the data cached in its L3 cache.
There is a significant latency penalty from accessing the L3 cache from a different chiplet, so you always want to have the process running on the chiplet with the appropriate cache.

The bandwidth of the infinity fabric is also limited and better be avoided if you can help it by not utilizing the infinity fabric.
Simply have process on the same chiplet as the cached data for the process.

True, that's why RPCS3 (PS3 emulator) tends to perform significantly worse on these multi-CCX Ryzens & Threadrippers, because this software actually makes proper use of all available cores & threads while doing real-time 3D rendering, unlike most "modern" AAA games.

Announcement

With A Few Lines Of Code, AMD's Nice Performance Optimization For Linux 5.20

With A Few Lines Of Code, AMD's Nice Performance Optimization For Linux 5.20

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment