I’m currently working on a system that stores files in EXT4 partitions in large disk (10 TiB and 16 TiB ones). All disks are connected via SATA to an HBA (Dell PERC H310).
The file structure is like this: /mnt/diskN/nXXX/store/some_id/twoletters/random_name
and the files varies in size, between a few KiB to some MiB.
As you can imagine, for a 500 GiB /mnt/diskN/nXXX
folder we get a total of about 1.5 million files in there, which are in different subfolders. Each /mnt/diskN/nXXX/store/some_id/twoletters/
folder can contain hundreds of small files.
My problem is that I am seeing very different performances doing directory traverses between disks that have the same specs on the paper, and even the same model in some cases.
To measure my performance, I’m doing a du
on each /mnt/diskN/
, seeing a 100x performance improvement between disks of about the same used space.
I have checked some things so far:
- Ext4 has
dir_index
enabled in all cases. - All disks are mounted with
defaults,noatime
- Executed
e2fsck -fyvD
on all disks to optimize directories. - Ext4 reserved blocks are ~ 50 GiB of space to allow defragmentation.
- Non contiguous files are ~3.7%.
Example of a tune2fs output:
tune2fs 1.46.5 (30-Dec-2021)
Filesystem volume name: disk2
Last mounted on: /app/storage
Filesystem UUID: edc25fd5-48c9-48cc-86f9-115fe9fe1d2b
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index fast_commit filetype needs_recovery extent 64bit flex_bg large_dir sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 305201152
Block count: 2441609203
Reserved block count: 13107200
Overhead clusters: 19533367
Free blocks: 1584334746
Free inodes: 294994216
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Reserved GDT blocks: 883
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 4096
Inode blocks per group: 256
Flex block group size: 16
Filesystem created: Sun Mar 27 16:55:23 2022
Last mount time: Tue Jul 26 20:30:38 2022
Last write time: Tue Jul 26 20:31:37 2022
Mount count: 2
Maximum mount count: -1
Last checked: Tue Jul 26 20:13:48 2022
Check interval: 0 (<none>)
Lifetime writes: 9 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
First orphan inode: 17367446
Default directory hash: half_md4
Directory Hash Seed: c98a7272-eddd-4bd8-aac4-dfe714d4cc48
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0xfd641daa
The only difference I can see between the two disks is that one of them has ~9 TB written while other has only ~2 TB written according to tune2fs -l
stats (however, current used space is the same).
Any idea how to improve the situation?
Is there any Ext4 feature I can deactivate to improve performance in this use case?