Announcement

**coder** · 08 December 2021, 07:23 PM

The patches do also confirm up to twelve memory controllers per socket with the next-gen processors, compared to the current limit of eight.

OMG. Just thinking about the number of contacts on those packages is like... woah.

I wonder if we'll ever see CPUs with them all enabled. The probability of all the controllers + pins + motherboard traces working perfectly seems low enough that maybe only 10 or 11 will be used in actual practice.

That reminds me of another concern that recently came to mind. How common is it that a core or even an entire chiplet goes bad? I guess there's probably some kernel parameter you can use to disable some cores, if that happens. I know the kernel has the concept of some cores being offline...

Another random thought: how popular is NPS=2 or NPS=4, in current EPYC deployments? Because, at some point, it seems like the cost-efficiency of packing ever more cores per package is going to level off. And if most customers aren't even using all of them in the same memory domain, then what's the point?

**numacross** · 08 December 2021, 09:32 PM

Originally posted by coder View Post

I wonder if we'll ever see CPUs with them all enabled. The probability of all the controllers + pins + motherboard traces working perfectly seems low enough that maybe only 10 or 11 will be used in actual practice.

I don't see why not. EPYC has been 8 channel since the first generation (only embedded versions are 2- and 4-channel) and it didn't really cause any problems. Having extra traces for ECC also helps

Originally posted by coder View Post

That reminds me of another concern that recently came to mind. How common is it that a core or even an entire chiplet goes bad? I guess there's probably some kernel parameter you can use to disable some cores, if that happens. I know the kernel has the concept of some cores being offline...

If this happens the Power On Self Test won't complete successfully and the computer most likely won't boot at all.
The options to disable cores have their uses, for example you might disable cores that share cache in order to make one core in that particular cluster have more cache to itself. That is why models like EPYC 72F3 exist - it has 8 chiplets with one core enabled in each to maximize frequency and cache sizes.

Originally posted by coder View Post

Another random thought: how popular is NPS=2 or NPS=4, in current EPYC deployments? Because, at some point, it seems like the cost-efficiency of packing ever more cores per package is going to level off. And if most customers aren't even using all of them in the same memory domain, then what's the point?

NPS and its Intel equivalent (Cluster on Die) mostly make sense if your workload is sensitive to the issue it solves: over-contention of RAM accesses that saturate the on-chip interconnect. It also makes sense if your workload is sensitive to memory latency, by keeping the RAM controllers physically near. If you can fit in a NUMA node then it makes things easier on the interconnect, which can result in increased performance.

There are other reasons like easier separation in virtualization environments, for example.

NPS was IMO most useful when first generation EPYC was composed of chiplets without a separate I/O die (containing the RAM and PCIe controllers). In that configuration the RAM controllers were on chiplets themselves while Infinity Fabric was not quite as fast as on the later generations. With Rome and its I/O die the cost differences between target RAM controllers are not as high as previously. However NPS is still able to split even the I/O die's RAM controllers and match them to physically closest chiplets (it can go even further with the L3 as NUMA setting).

All in all, it is something that has to be tuned for specific workloads. AMD (and Intel) provide a lot of documentation with examples on how to configure their CPUs.

**coder** · 08 December 2021, 10:54 PM

Originally posted by numacross View Post

Having extra traces for ECC also helps

Absolutely. However, if a motherboard trace or the connection on a DIMM or CPU pin is bad, then you waste 1 bit of ECC that can only detect up to 2-bit errors and correct 1-bit. So, at that point, I think the rational thing to do would be to swap out the bad part, unless the DIMM or memory channel could be entirely disabled.

Originally posted by numacross View Post

If this happens the Power On Self Test won't complete successfully and the computer most likely won't boot at all.

With CPUs having so many cores, it'd be nice if you had the option simply to disable the faulty core(s). Once the CPU is out of warranty, most users would probably prefer to simply disable one or two cores on a 64-core or 96-core CPU, rather than having to replace the whole thing.

Originally posted by numacross View Post

The options to disable cores have their uses, for example you might disable cores that share cache in order to make one core in that particular cluster have more cache to itself. That is why models like EPYC 72F3 exist - it has 8 chiplets with one core enabled in each to maximize frequency and cache sizes.

I thought that was just about clock speed. I either didn't know or forgot that it had full cache. That's pretty cool. ...or hot, I guess.

Originally posted by numacross View Post

(it can go even further with the L3 as NUMA setting).

Does that reduce the cache coherency domain? Because that would be a pretty big win, if your broadcasts actually got cut down.

Originally posted by numacross View Post

All in all, it is something that has to be tuned for specific workloads. AMD (and Intel) provide a lot of documentation with examples on how to configure their CPUs.

My point is that the more features like NPS get used, the less sense it makes for CPUs to have so many cores. I'm sure interconnect complexity and cache coherency overhead probably scale at something like O( n * log n) of the number of cores, if not worse. So, at some point, the overheads of having so many cores become significant, relative to the cost savings of having fewer CPU packages.

**numacross** · 09 December 2021, 09:22 AM

Originally posted by coder View Post

Absolutely. However, if a motherboard trace or the connection on a DIMM or CPU pin is bad, then you waste 1 bit of ECC that can only detect up to 2-bit errors and correct 1-bit. So, at that point, I think the rational thing to do would be to swap out the bad part, unless the DIMM or memory channel could be entirely disabled.

Yes, but having 12 channels is not that different from having 8. Motherboard PCBs are not even the most complex we're able to reliably produce. I think we'll be alright

Originally posted by coder View Post

With CPUs having so many cores, it'd be nice if you had the option simply to disable the faulty core(s). Once the CPU is out of warranty, most users would probably prefer to simply disable one or two cores on a 64-core or 96-core CPU, rather than having to replace the whole thing.

How often does that kind of failure happen? I guess we can't really know, but I'd guess not often.

Originally posted by coder View Post

Does that reduce the cache coherency domain? Because that would be a pretty big win, if your broadcasts actually got cut down.

The L3 in an EPYC Rome (Zen 2) chiplet is split into two parts (so a 8-core chiplet with 32MB L3 is actually 2x4c with 16MB in each CCX), and that's what the L3 as NUMA setting uses as separation points.
EPYC Milan uses Zen 3 chiplets which do not have this split. I don't know how L3 as NUMA works in those, because I haven't played with those yet.

Originally posted by coder View Post

My point is that the more features like NPS get used, the less sense it makes for CPUs to have so many cores. I'm sure interconnect complexity and cache coherency overhead probably scale at something like O( n * log n) of the number of cores, if not worse. So, at some point, the overheads of having so many cores become significant, relative to the cost savings of having fewer CPU packages.

This question would be better answered by someone with intimate knowledge of how the cache works on those CPUs, and that's not me

I'm pretty sure that having more cores in one package will always be better than having to use multiple packages. The interconnect on-package will outperform inter-package one.
For most use cases simply throwing your workload on more cores will yield easy performance benefits. Actually understanding the hardware you're running at and tuning accordingly (with NPS and other knobs) will always increase the gains even further, but for some the cost of doing so might not be worth it.
On the other hand it looks like AMD (and maybe Intel will too with Ponte Vecchio) is pretty good at hiding those complexities so that software can be presented with an illusion of a uniform single processor without significant performance differences. EPYC Genoa is supposed to bring the chiplet count up, which I expect brings improved and more performant Infinity Fabric to match the requirements of that and DDR5.

**coder** · 09 December 2021, 02:35 PM

Originally posted by numacross View Post

Yes, but having 12 channels is not that different from having 8. Motherboard PCBs are not even the most complex we're able to reliably produce. I think we'll be alright

What are the most complex, then? And how much do they cost?

Originally posted by numacross View Post

I'm pretty sure that having more cores in one package will always be better than having to use multiple packages. The interconnect on-package will outperform inter-package one.

But, if you're running them in separate memory domains, then there's no cross-domain communication. Right? So, at that point, they're just sharing a package, heatsink, socket, and VRM. The extra interconnect, that exists to enable the domains being joined, is just overhead.

Originally posted by numacross View Post

For most use cases simply throwing your workload on more cores will yield easy performance benefits.

For a lot of server/cloud apps, they don't only want more throughput, they want more performance per Watt. If the overheads of scaling up the cores/package get big enough, it might become more appealing to use smaller CPUs.

**numacross** · 09 December 2021, 03:50 PM

Originally posted by coder View Post

What are the most complex, then? And how much do they cost?

I'd say very specific gear like a high end oscilloscope, for example this one with a launch price of $1 300 000. Most high end RF gear requires wizardry in PCBs with exotic materials, layer amounts, and non-silicon components.

Originally posted by coder View Post

But, if you're running them in separate memory domains, then there's no cross-domain communication. Right? So, at that point, they're just sharing a package, heatsink, socket, and VRM. The extra interconnect, that exists to enable the domains being joined, is just overhead.

There is communication, even with a multi CPU system the entire memory is usually in one address space. The cost of getting from one point in the system to another gets complicated by the interconnect bandwidth and latency.
NUMA nodes can, but do not have to limit the memory to certain cores. NUMA-aware software can exploit this and keep data it's working on close to the cores working on said data, but the entire program can span all RAM and cores. For other software you can use the operating system to force this behaviour.

Originally posted by coder View Post

For a lot of server/cloud apps, they don't only want more throughput, they want more performance per Watt. If the overheads of scaling up the cores/package get big enough, it might become more appealing to use smaller CPUs.

You will get less perf/watt for multi-socket systems when compared to the same amount of cores on one package. Sending a lot of data over vast (from the perspective of a CPU) distances takes power. Many more things have to be duplicated in a multi-socket system. There's also savings in space - with dense processors you can fit more cores per rack.

**coder** · 10 December 2021, 12:53 AM

Originally posted by numacross View Post

I'd say very specific gear like a high end oscilloscope, for example this one with a launch price of $1 300 000. Most high end RF gear requires wizardry in PCBs with exotic materials, layer amounts, and non-silicon components.

Cutting-edge test equipment is amazing to me. Like, how do you even debug and validate that stuff?

As for the video... OMG, why is he not using an anti-static mat ??? ...with, like, double wrist straps and those little booties to cover your shoes?

That gold plated (what I assume is) input section looks friggin' rad, tho. It's funny how there's that one little blurred bit, in the center. As if... you can show everything else, just not this one, tiny square.

**numacross** · 10 December 2021, 12:56 AM

Originally posted by coder View Post

Cutting-edge test equipment is amazing to me. Like, how do you even debug and validate that stuff?

With even more cutting-edge equipment

Originally posted by coder View Post

As for the video... OMG, why is he not using an anti-static mat ??? ...with, like, double wrist straps and those little booties to cover your shoes?

Because those boards were provided by the manufacturer for demonstration purposes, they are most likely non-functional.

Announcement

AMD Linux EDAC Driver Prepares For Zen 4, RDDR5 / LRDDR5 Memory

AMD Linux EDAC Driver Prepares For Zen 4, RDDR5 / LRDDR5 Memory

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment