I have the same processes running on 2 (or more machines) for which we notice a massive slow down in disk io on some machines. The machines have very similar hardware (i am happy to post dmidecode if necessary).
We are having trouble even diagnosing the issue. So let me describe it the best I can.
Everything is running ubuntu 1604. (but tested on 1804 with the same results). The program we are running randomly picks 32 files (images) to read. These files are about 512k each. On the ‘fast’ machines we can read all files in about 0.5s, and on the ‘slow’ machines we see about 2-5 seconds to read 32 files.
Some things we have figured out so far:
- Read-time output of
hdparam
on slow and fast machines is comparable. - Slowdowns consistent for 5400, 7200 RPM disks as well as SSDs.
- slowdown happens on both raid5 and raid0
- the slowdown seems to go away when we don’t shuffle images (no random access)
****.
Here is the output of atop during the tests:
Fast:
MDD | md1 | busy 0% | read 22474 | write 0 | KiB/w 0 | MBr/s 350.73 | MBw/s 0.00 | avio 0.00 ms |
DSK | sdh | busy 20% | read 7251 | write 0 | KiB/w 0 | MBr/s 86.58 | MBw/s 0.00 | avio 0.27 ms |
DSK | sdf | busy 20% | read 7375 | write 0 | KiB/w 0 | MBr/s 88.31 | MBw/s 0.00 | avio 0.27 ms |
DSK | sde | busy 19% | read 7399 | write 0 | KiB/w 0 | MBr/s 89.36 | MBw/s 0.00 | avio 0.26 ms |
DSK | sdg | busy 19% | read 7303 | write 0 | KiB/w 0 | MBr/s 86.50 | MBw/s 0.00 | avio 0.25 ms |
Slow:
MDD | md0 | busy 0% | read 4018 | write 0 | KiB/w 0 | MBr/s 140.46 | MBw/s 0.00 | avio 0.00 ms |
DSK | sdd | busy 100% | read 1181 | write 0 | KiB/w 0 | MBr/s 36.34 | MBw/s 0.00 | avio 8.45 ms |
DSK | sdc | busy 99% | read 1153 | write 0 | KiB/w 0 | MBr/s 35.29 | MBw/s 0.00 | avio 8.55 ms |
DSK | sde | busy 99% | read 1150 | write 0 | KiB/w 0 | MBr/s 34.84 | MBw/s 0.00 | avio 8.57 ms |
If I am reading it correctly slow
machine has maxed out percentages, and ~17th of the read speed.
We’ve tried schedulers which did not change anything.
I am looking for help even diagnosing the issue. I am happy to help improve the question with any extra details I can. We are at a loss in even figuring out what the issue is, let alone solving it…