Announcement

Collapse
No announcement yet.

AMD Continues With MCE/SMCA Linux Driver Changes Ahead Of Zen 4 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD Continues With MCE/SMCA Linux Driver Changes Ahead Of Zen 4 CPUs

    Phoronix: AMD Continues With MCE/SMCA Linux Driver Changes Ahead Of Zen 4 CPUs

    This year AMD engineers working on hardware enablement for Linux have been busy with EDAC driver improvements like RDDR5 and LRDDR5 handling, AMD Scalable Machine Check Architecture (SMCA) additions for "future" CPUs, and the various other areas outside of the error detection and correction field. Today though is a new patch series back in that hardware error handling space with new SMCA code...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Is there any reason certain registers are still not documented or listed as "reserved" ?
    For example https://developer.amd.com/resources/...uides-manuals/
    19H has no 'Open-Source Register Reference'
    And, in the 17H reference pdf, certain MCE registers are listed as "reserved" ie: bank 17 and bank 18

    When an mce error is triggered in a standard linux kernel there are often no details at all on AMD hardware.
    rasdaemon seems to be the solution to this problem, but when a register is not documented it essentially just spits out useless nonsense.
    I can't even be sure if the CPU reported by rasdaemon is logical or physical.

    IE:
    159 2022-04-18 10:47:32 -0700 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x13a086560, misc=0xd01a000301000000, walltime=0x625da433, cpuid=0x00a20f12, bank=0x00000011

    It seems to just be saying "Unified Memory Controller" because that's in the array before the reserved/missing (bank=17). I suspect "CPU 2" is the same, because if I disable CPU 2 (physical and ht partner) the error still says CPU 2.

    The current state of mce for ryzen just isn't very usefull these days.
    Hopefully this improves going forward.


    Comment


    • #3
      Further investigation shows "CPU 2" is indeed wrong.

      dmesg:
      [Sat Apr 23 17:27:18 2022] mce: [Hardware Error]: Machine check events logged
      [Sat Apr 23 17:27:18 2022] [Hardware Error]: Corrected error, no action required.
      [Sat Apr 23 17:27:18 2022] [Hardware Error]: CPU:0 (19:21:2) MC27_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x982000000002080b
      [Sat Apr 23 17:27:18 2022] [Hardware Error]: IPID: 0x0001002e00000500, Syndrome: 0x000000005a020005
      [Sat Apr 23 17:27:18 2022] [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.
      [Sat Apr 23 17:27:18 2022] [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
      rasdaemon:
      228 2022-04-23 17:27:17 -0700 error: Corrected error, no action required., CPU 2, bank Power, Interrupts, etc. (bank=27), mcg mcgstatus=0, mcgcap=0x0000011c, status=0x982000000002080b, misc=0xd01a000400000000, walltime=0x62649966, cpuid=0x00a20f12, bank=0x0000001b

      Comment

      Working...
      X