Zend certified PHP/Magento developer

Is my SSD failing, and should I try to warranty return it?

In May 2021, I purchased a 2.5″ SATA SSD for my HP Elitebook 840 G3 laptop. The SSD is “Samsung 870 EVO 500GB SATA 2.5″ Internal Solid State Drive (SSD)” part code “MZ-77E500B/EU”

The laptop originally had Windows loaded on an M2 SSD (when I bought it, but not original), I set up Gentoo Linux dual boot on the Samsung SATA SSD. About January 2023 the M2 SSD failed – no sectors seem to be readable on it – so the machine has been Linux-only since then.

In July I found that some of the files in my /home/ partition were not readable – these files had been readable fine when they were created over a year previously. Kernel errors when trying to read these files below.

Kernel Error Messages

[Jul23 21:53] ata1.00: exception Emask 0x0 SAct 0x10 SErr 0x0 action 0x0
[  +0.000002] ata1.00: irq_stat 0x40000008
[  +0.000002] ata1.00: failed command: READ FPDMA QUEUED
[  +0.000003] ata1.00: cmd 60/08:20:b0:d8:43/00:00:1b:00:00/40 tag 4 ncq dma 4096 in
                       res 41/40:08:b0:d8:43/00:00:1b:00:00/00 Emask 0x409 (media error) <F>
[  +0.000001] ata1.00: status: { DRDY ERR }
[  +0.000001] ata1.00: error: { UNC }
[  +0.000795] ata1.00: supports DRM functions and may not be fully accessible
[  +0.002737] ata1.00: supports DRM functions and may not be fully accessible
[  +0.002391] ata1.00: configured for UDMA/133
[  +0.000043] scsi_io_completion_action: 3 callbacks suppressed
[  +0.000021] sd 0:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[  +0.000015] sd 0:0:0:0: [sda] tag#4 Sense Key : Medium Error [current]
[  +0.000011] sd 0:0:0:0: [sda] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed
[  +0.000013] sd 0:0:0:0: [sda] tag#4 CDB: Read(10) 28 00 1b 43 d8 b0 00 00 08 00
[  +0.000006] print_req_error: 3 callbacks suppressed
[  +0.000011] blk_update_request: I/O error, dev sda, sector 457431216 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +0.000061] ata1: EH complete
[  +0.000064] ata1.00: Enabling discard_zeroes_data
[  +0.203593] ata1.00: exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x0
[  +0.000023] ata1.00: irq_stat 0x40000008
[  +0.000002] ata1.00: failed command: READ FPDMA QUEUED
[  +0.000003] ata1.00: cmd 60/08:28:b0:d8:43/00:00:1b:00:00/40 tag 5 ncq dma 4096 in
                       res 41/40:08:b0:d8:43/00:00:1b:00:00/00 Emask 0x409 (media error) <F>
[  +0.000001] ata1.00: status: { DRDY ERR }
[  +0.000001] ata1.00: error: { UNC }
[  +0.000814] ata1.00: supports DRM functions and may not be fully accessible
[  +0.003289] ata1.00: supports DRM functions and may not be fully accessible
[  +0.002271] ata1.00: configured for UDMA/133
[  +0.000059] sd 0:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[  +0.000015] sd 0:0:0:0: [sda] tag#5 Sense Key : Medium Error [current]
[  +0.000011] sd 0:0:0:0: [sda] tag#5 Add. Sense: Unrecovered read error - auto reallocate failed
[  +0.000012] sd 0:0:0:0: [sda] tag#5 CDB: Read(10) 28 00 1b 43 d8 b0 00 00 08 00
[  +0.000014] blk_update_request: I/O error, dev sda, sector 457431216 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.000018] Buffer I/O error on dev dm-3, logical block 8350230, async page read

On checking the /home file system from a bootable USB stick (so nothing mounted), things are not happy:

livecd # fsck ext4  -fck /dev/vg0/home
e2fsck 1.46.2 (28-Feb-2021) 
Checking for bad blocks (read-only test):
99.88% done, 7:37 elapsed. (75/0/0 errors)
done
home: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
                    
Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 673706: 8350230 8350237--8350238 8350246 8350254 8350289 8350292 8350297 8350300 8350305 8350308 8350313 8350316 8350422 8350430 8350438 8350446 8350481 8350489 8350497 8350505 8350614 8350622 8350630 8350638 8350673 8350676 8350681 8350684 8350689 8350697 8350806 8350814 8350865 8350873 8350881 8350889 8350998 8351006 8351014 8351022 8351057 8351065 8351068 8351073 8351076 8351081 8351190 8351198 8351249 8351257 8351273 8351382 8351398 8351441 8351449 8351457 8351465 8351574 8351582 8351633 8351641 8351657
Multiply-claimed block(s) in inode 1188624: 4828842--4828843
Multiply-claimed block(s) in inode 3015126: 16730711 16730719 16730727 16730903 16730919 16730927 16731095 16731103 16731303 16731311
Multiply-claimed block(s) in inode 3015662: 13523212 13523220 13523228 13523236 13523412 13523604
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 4 inodes containing multiply-claimed blocks.)
  
File /ra/Documents/folkus/folkus/durham-jail2.aiff (inode #673706, mod time Mon May  6 15:42:48 2019)
  has 63 multiply-claimed block(s), shared with 1 file(s):
    <The bad blocks inode> (inode #1, mod time Mon Jul 24 19:35:22 2023)

The SMART data for the drive says that I have 13 reallocated sectors (this number seems to be constant over the last month and not increasing).

Full smartctl output

$ sudo smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.38-gentooamd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 500GB
Serial Number:    S62BNZ0R429272T
LU WWN Device Id: 5 002538 fc1409fcd
Firmware Version: SVT01B6Q
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug 27 17:37:06 2023 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  85) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct   0x0033   097   097   010    Pre-fail  Always       -       13
9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1871
12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1193
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       6
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   097   097   010    Pre-fail  Always       -       13
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   097   097   010    Pre-fail  Always       -       13
187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       682
190 Airflow_Temperature_Cel 0x0032   072   054   000    Old_age   Always       -       28
195 ECC_Error_Rate          0x001a   199   199   000    Old_age   Always       -       682
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       50
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       6080656104
SMART Error Log Version: 1
ATA Error Count: 682 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 682 occurred at disk power-on lifetime: 1852 hours (77 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 70 78 ed 42 40  Error: UNC at LBA = 0x0042ed78 = 4386168
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 08 70 78 ed 42 40 0e      03:52:56.807  READ FPDMA QUEUED
60 08 68 70 ed 42 40 0d      03:52:56.807  READ FPDMA QUEUED
60 08 60 68 ed 42 40 0c      03:52:56.807  READ FPDMA QUEUED
60 08 58 60 ed 42 40 0b      03:52:56.807  READ FPDMA QUEUED
60 08 50 58 ed 42 40 0a      03:52:56.807  READ FPDMA QUEUED
Error 681 occurred at disk power-on lifetime: 1852 hours (77 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 38 38 ed 42 40  Error: UNC at LBA = 0x0042ed38 = 4386104
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 08 38 38 ed 42 40 07      03:52:56.602  READ FPDMA QUEUED
60 08 30 30 ed 42 40 06      03:52:56.602  READ FPDMA QUEUED
60 08 28 28 ed 42 40 05      03:52:56.602  READ FPDMA QUEUED
60 08 20 20 ed 42 40 04      03:52:56.602  READ FPDMA QUEUED
60 08 18 18 ed 42 40 03      03:52:56.602  READ FPDMA QUEUED
Error 680 occurred at disk power-on lifetime: 1852 hours (77 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 98 00 ec 42 40  Error: UNC at LBA = 0x0042ec00 = 4385792
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 00 98 00 ec 42 40 13      03:52:56.276  READ FPDMA QUEUED
60 00 90 00 ea 42 40 12      03:52:56.276  READ FPDMA QUEUED
60 00 88 00 e8 42 40 11      03:52:56.276  READ FPDMA QUEUED
60 08 80 f8 e7 42 40 10      03:52:56.276  READ FPDMA QUEUED
60 08 78 f0 e7 42 40 0f      03:52:56.276  READ FPDMA QUEUED
Error 679 occurred at disk power-on lifetime: 1852 hours (77 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 48 f8 e6 42 40  Error: UNC at LBA = 0x0042e6f8 = 4384504
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 08 48 f8 e6 42 40 09      03:52:55.872  READ FPDMA QUEUED
60 08 40 f0 e6 42 40 08      03:52:55.872  READ FPDMA QUEUED
60 08 38 e8 e6 42 40 07      03:52:55.872  READ FPDMA QUEUED
60 08 30 e0 e6 42 40 06      03:52:55.872  READ FPDMA QUEUED
60 08 20 d8 e6 42 40 04      03:52:55.872  READ FPDMA QUEUED
Error 678 occurred at disk power-on lifetime: 1852 hours (77 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b8 e6 42 40  Error: UNC at LBA = 0x0042e6b8 = 4384440
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 08 08 b8 e6 42 40 01      03:52:55.603  READ FPDMA QUEUED
60 08 00 b0 e6 42 40 00      03:52:55.603  READ FPDMA QUEUED
60 08 f0 a8 e6 42 40 1e      03:52:55.603  READ FPDMA QUEUED
60 08 e8 a0 e6 42 40 1d      03:52:55.603  READ FPDMA QUEUED
60 08 e0 98 e6 42 40 1c      03:52:55.603  READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1845         -
# 2  Offline             Completed without error       00%      1343         -
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
1        0        0  Not_testing
2        0        0  Not_testing
3        0        0  Not_testing
4        0        0  Not_testing
5        0        0  Not_testing
256        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

At the moment I have allowed fsck to delete the offending files, re-checked the filesystems and things seem to be back to normal. However I can’t do this level of checking on the NTFS partition, or on the unallocated space in the Linux LVM physical volume.

According to SMART I have only written about 3TB to the drive, so I should not be running into wear issues.

Question

I’m worried this is going to go wrong again in the future and cause further data loss (this time the files were either ones that did not matter (cache) or could be restored from backups elsewhere). Can I return the drive under warranty for replacement, or should I buy a new one?