Zend certified PHP/Magento developer

How to debug my server hard drive I/O issue

I have a small headless server on my LAN running Ubuntu 20.04. Recently I had an issue where one of my external drives was intermittently going into read-only mode and sometimes failing to mount at all after a restart. This being an older mechanical drive, my assumption was that it was on the way out. I then bought a new Crucial MX500 4TB internal drive, installed it and began to rsync the data.

Around 60GB into the process the internal drive began to exhibit some I/O errors. Here are a few snippets from the syslog:

Nov 15 20:46:33 hm80 kernel: [ 1138.328403] ata3.00: failed command: WRITE FPDMA QUEUED
Nov 15 20:46:33 hm80 kernel: [ 1138.328409] ata3.00: cmd 61/e8:00:08:0f:04/08:00:52:01:00/40 tag 0 ncq dma 1167360 ou
Nov 15 20:46:33 hm80 kernel: [ 1138.328409]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Nov 15 20:47:42 hm80 kernel: [ 1206.798072] sd 2:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=98s
Nov 15 20:47:42 hm80 kernel: [ 1206.798084] sd 2:0:0:0: [sdb] tag#18 CDB: Write(16) 8a 00 00 00 00 00 4c 03 d9 00 00 00 09 c8 00 00
Nov 15 20:47:42 hm80 kernel: [ 1206.798090] I/O error, dev sdb, sector 1275320576 op 0x1:(WRITE) flags 0x4000 phys_seg 44 prio class 2
Nov 15 20:47:42 hm80 kernel: [ 1206.798105] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 179055013 starting block 159415072)
Nov 15 20:47:42 hm80 kernel: [ 1206.798123] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 179055014 starting block 159415080)

Nov 15 20:47:42 hm80 kernel: [ 1206.798250] sd 2:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=98s
Nov 15 20:47:42 hm80 kernel: [ 1206.798252] Buffer I/O error on device sdb1, logical block 159414816
Nov 15 20:47:42 hm80 kernel: [ 1206.798257] sd 2:0:0:0: [sdb] tag#19 CDB: Write(16) 8a 00 00 00 00 00 4c 03 d2 d0 00 00 06 20 00 00
Nov 15 20:47:42 hm80 kernel: [ 1206.798264] I/O error, dev sdb, sector 1275318992 op 0x1:(WRITE) flags 0x0 phys_seg 29 prio class 2
Nov 15 20:47:42 hm80 kernel: [ 1206.798286] Buffer I/O error on device sdb1, logical block 159414817
Nov 15 20:47:42 hm80 kernel: [ 1206.798296] Buffer I/O error on device sdb1, logical block 159414818

After a reboot all seemed to be ok and I resumed the sync. Everything went fine. The same drive is also host to a handful of backups sync’d from other servers. That job runs overnight. Today after 12 hours of rsync’ing, the machine became close to unresponsive and had only managed to d/l approx 400GB. Looking at the rsync output it looked again to be writing very slow. Then some really odd stuff started happening like I could no longer ssh into the server and basic commands like ls would hang.

I’ve opened the computer, checked cable connections and on reboot, the computer actually went to GRUB which I’ve never seen it do before. Now it is running the same rsync routine and everything again seems to be fine.

Since this all comes in response to another external drive having similar issues I’m considering the following possibilities:

  • Purely by coincidence, I got a bad drive. Return it and try again.
  • Something is wrong with the drive controller or the computer itself (which may have caused the original issues I had with my first drive)
  • Bad SATA cable?

How can I debug an issue like this when it is intermittent? Is there anything to be gleaned from those syslog entries that could point to where, specifically, things are going awry?