My machine has NVIDIA GeForce GTX 5060 Ti card, which normally runs my two monitors utilizing about 1846MiB of the 4096 installed.
I built the latest (as of yesterday) llama.cpp and tried using it. Some GGUF files wouldn’t load at all — with the software complaining about lack of video RAM on the card — but another loaded, even though it complained about being unable to use all capabilities of the model:
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 1 repeating layers to GPU
load_tensors: offloaded 1/41 layers to GPU
load_tensors: CPU_Mapped model buffer size = 23323.42 MiB
load_tensors: Vulkan0 model buffer size = 563.16 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 2.00 MiB
llama_kv_cache: CPU KV buffer size = 624.00 MiB
llama_kv_cache: Vulkan0 KV buffer size = 16.00 MiB
llama_kv_cache: size = 640.00 MiB ( 4096 cells, 40 layers, 4/1 seqs), K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: Vulkan0 compute buffer size = 946.00 MiB
llama_context: Vulkan_Host compute buffer size = 18.01 MiB
I notice, that, when I ask a question, the GPU is used:
- The CPU-utilization by the
llama-serverprocess goes down intop; - The GPU-utilization — as shown by
nvidia-smiutility — goes up (to 100%)
But when it is generating responses, all four of my CPU-cores become saturated, while the GPU(s) go unused. The tokens-per-second statistics output by the program seem to confirm that. For example:
prompt eval time = 31469.38 ms / 225 tokens ( 139.86 ms per token, 7.15 tokens per second)
eval time = 148773.26 ms / 100 tokens ( 1487.73 ms per token, 0.67 tokens per second)
total time = 180242.64 ms / 325 tokens
Is this really about all I can squeeze out of the present hardware, or am I not using it correctly? In particular, is there any way to make it utilize the GPUs for response-generation too, not just for parsing the queries?