I’m only considering the total inference time.
I’m using a PyTorch model with CUDA for a computer vision task, and I’ve observed two different scenarios:
Multiple models in parallel:
In this setup, n models are launched simultaneously, each processing a batch of 1 image.
The GPU is not heavily utilized, and each image takes approximately 25 ms to infer.
Single model with a batch of n images:
Here, one model processes a batch of n images.
The inference time per image is faster, around 17 ms, but the total time becomes 17 × n, which is significantly longer.
This becomes problematic in real-time applications.
My question is:
Could this be due to a configuration issue?
And why doesn’t the batch of n images take roughly the same time as a single image, especially when the GPU isn’t saturated?