Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory

Written by Michael Larabel in AMD on 18 February 2024 at 08:47 AM EST. 34 Comments
AMD
For the upcoming Linux 6.9 kernel cycle there are a number of AMD Instinct MI300 additions to the EDAC (Error Detection And Correction) and RAS (Reliability, Availability and Serviceability) drivers.

This work includes adapting the AMD EDAC driver to use the AMD Address Translation Library, MI300 support for that ATL library, other MI300 RAS additions, and then a new feature for MI300 hardware is row retirement support.

AMD MI300A slide


The MI300 row retirement support within the amd64_edac driver is summed up in that patch as for dealing with defective/errored out high bandwidth memory (HBM) on the MI300:
"AMD MI300 systems have on-die High Bandwidth Memory. This memory has a relatively higher error rate, and it is not individually replaceable like DIMMs.

Uncorrectable ECC errors are individually reported as Deferred errors using the AMD Deferred error interrupt. Each reported error corresponds to a single hardware error.

Correctable ECC errors get reported in batches through MCA Thresholding. Users can configure the threshold limit based on their policy. Each reported correctable error represents a single occurrence of the threshold limit being reached.

The current guidance from AMD designers is that memory affected by ECC errors within a DRAM row should be retired. Action should be taken on every reported ECC error.

Add a helper function to apply this policy for MI300 systems.

This and similar functionality can also be best handled in a separate, generic module. In the meantime, do this in AMD64 EDAC for simplicity."

A code comment within that row retirement support patch reaffirms the intentions of retiring all memory within that DRAM row on errors:
"When a DRAM ECC error occurs on MI300 systems, it is recommended to retire all memory within that DRAM row. This applies to the memory with a DRAM bank."

That latest AMD MI300 work is to be found in Linux 6.9 now that those patches are part of RAS.git's "edac-for-next" Git branch.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week