Am I using my GPU right with the AI (llama.cpp)?

My machine has NVIDIA GeForce GTX 5060 Ti card, which normally runs my two monitors utilizing about 1846MiB of the 4096 installed.

I built the latest (as of yesterday) llama.cpp and tried using it. Some GGUF files wouldn’t load at all — with the software complaining about lack of video RAM on the card — but another loaded, even though it complained about being unable to use all capabilities of the model:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 1 repeating layers to GPU
load_tensors: offloaded 1/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 23323.42 MiB
load_tensors:      Vulkan0 model buffer size =   563.16 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     2.00 MiB
llama_kv_cache:        CPU KV buffer size =   624.00 MiB
llama_kv_cache:    Vulkan0 KV buffer size =    16.00 MiB
llama_kv_cache: size =  640.00 MiB (  4096 cells,  40 layers,  4/1 seqs), K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan0 compute buffer size =   946.00 MiB
llama_context: Vulkan_Host compute buffer size =    18.01 MiB

I notice, that, when I ask a question, the GPU is used:

  • The CPU-utilization by the llama-server process goes down in top;
  • The GPU-utilization — as shown by nvidia-smi utility — goes up (to 100%)

But when it is generating responses, all four of my CPU-cores become saturated, while the GPU(s) go unused. The tokens-per-second statistics output by the program seem to confirm that. For example:

prompt eval time =   31469.38 ms /   225 tokens (  139.86 ms per token,     7.15 tokens per second)
       eval time =  148773.26 ms /   100 tokens ( 1487.73 ms per token,     0.67 tokens per second)
      total time =  180242.64 ms /   325 tokens

Is this really about all I can squeeze out of the present hardware, or am I not using it correctly? In particular, is there any way to make it utilize the GPUs for response-generation too, not just for parsing the queries?