Problems with ASRock TRX40D8-2N2T, AMD 3970X, and various RAM; “mce: [Hardware Error]: Machine check”

I put together a setup a while back using

  • Motherboard: ASRock TRX40D8-2N2T
  • CPU: AMD 3970X Ryzen Threadripper
  • Cooler:Dynatron A38
  • Memory:OWC 256GB (8x32GB) DDR4 3200MHz PC4-25600 CL22 2RX8 ECC UDIMM 1.2V
  • Drives: 2TB NVME Samsung 970 Evo plus (OS RAID1, Storage RAID6)
  • Case: In-Win IW-R200-02N-CR550 with 550W redundant PSUs

I then got sidetracked on other projects at work, but when I came back to this setup I noticed it was having memory issues and appeared to be the MB. I’ve since tried swapping out MB/CPU/RAM and even have a second setup running for testing now. I’ve tried 2 brand new and 2 used CPUs; brand new OWC, Crucial, and Kingston RAM all 3200MHz ECC UDIMM; and 3 replacement motherboards (these were used though as I couldn’t find someone with new ones in stock). Even with all of this – I always end up getting some sort of mce hardware error.

I’m not sure if possibly I’m using some sort of incorrect memory setting which could be causing the issue, but I’d love any guidance on how to check as it seems like something must be amiss. When I run memtest86+ for 1-2 full passes it will go through all tests and pass without errors, and when I run mprime stresstest for around 12 hrs it will be ok, but after letting the computer simply stay booted for 3-5 days, a dmesg log will eventually show up with the mce hardware error and I really don’t want to run anything important on a machine I’m worried about having hardware errors, even if the errors I’m finding do say they’re being corrected (I know ECC will occasionally report correctable errors, but every few days seems far too often).

Here are a few examples of the error after the computer was running for a few days (OWC RAM, computer 1).

[24908.528062] mce: [Hardware Error]: Machine check events logged
[24908.528103] [Hardware Error]: Corrected error, no action required.
[24908.528262] [Hardware Error]: CPU:2 (17:31:0) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
[24908.528398] [Hardware Error]: Error Addr: 0x00000005a969a5c0
[24908.528490] [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x731320000a800501
[24908.528609] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[24908.528686] EDAC MC0: 1 CE on mc#0csrow#1channel#4 (csrow:1 channel:4 page:0x16e5a69 offset:0x4c0 grain:64 syndrome:0x2000)
[24908.528806] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[31462.065935] mce: [Hardware Error]: Machine check events logged
[31462.065956] [Hardware Error]: Corrected error, no action required.
[31462.066032] [Hardware Error]: CPU:1 (17:31:0) MC18_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
[31462.066097] [Hardware Error]: Error Addr: 0x0000000970b158c0
[31462.066147] [Hardware Error]: IPID: 0x0000009600350f00, Syndrome: 0x003a20000a800300
[31462.066200] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[31462.066236] EDAC MC0: 1 CE on mc#0csrow#0channel#3 (csrow:0 channel:3 page:0x2602c56 offset:0x3c0 grain:64 syndrome:0x2000)
[31462.066295] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[368640.949776] mce: [Hardware Error]: Machine check events logged
[368640.949786] [Hardware Error]: Corrected error, no action required.
[368640.949818] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
[368640.949854] [Hardware Error]: Error Addr: 0x0000000ae7e095c0
[368640.949872] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0x397400400a800e01
[368640.949894] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[368640.949917] EDAC MC0: 1 CE on mc#0csrow#1channel#5 (csrow:1 channel:5 page:0x2bdf825 offset:0x5c0 grain:64 syndrome:0x40)
[368640.950058] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

And here are similar errors from a totally separate machine, but again the same MB/CPU/RAM combination (OWC RAM, computer 2).

[579337.206932] mce: [Hardware Error]: Machine check events logged
[579337.206942] [Hardware Error]: Corrected error, no action required.
[579337.206983] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
[579337.207036] [Hardware Error]: Error Addr: 0x0000000ad35a5140
[579337.207056] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xc21802000a800802
[579337.207085] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[579337.207106] EDAC MC0: 1 CE on mc#0csrow#2channel#5 (csrow:2 channel:5 page:0x2b8d694 offset:0x540 grain:64 syndrome:0x200)
[579337.207133] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD