Zend certified PHP/Magento developer

Why peer-to-peer (P2P) access between two Tesla K40c GPUs fails in CUDA?

I would like to run a CUDA C program using two Tesla K40 devices and enable peer-to-peer (P2P) between them as my data will be shared among the devices. My computer has the following deviceQuery summary and NVIDIA-smi results (OS: Windows 10).

deviceQuery:

Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 21 / 0

Device 1: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 45 / 0

Device 2: "Quadro P400"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         WDDM
Device PCI Domain ID / Bus ID / location ID:   0 / 153 / 0

Nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.22       Driver Version: 441.22       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          TCC  | 00000000:15:00.0 Off |                    0 |
| 23%   36C    P8    24W / 235W |    809MiB / 11448MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          TCC  | 00000000:2D:00.0 Off |                  Off |
| 23%   43C    P8    24W / 235W |    809MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro P400        WDDM  | 00000000:99:00.0  On |                  N/A |
| 34%   35C    P8    N/A /  N/A |    449MiB /  2048MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+

I make the Quadro P400 invisible to my CUDA program via set CUDA_VISIBLE_DEVICES=0,1 and run the SimpleP2P example. The P2P example runs successfully but the results indicate that there is a P2P issue here. Specifically, the memcpy speed shows only about 0.2 GB/s despite the fact that the Tesla devices are connected to two PCIe3 x16 CPU0 slots:

Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.19GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

The SimpleP2P example also fails when I modify the code slightly to examine the P2P performance deeper (please see this question for programming details if you find them relevant). The tests I have done and the comments of a few experts on my post here indicate that the problem is a system/platform issue. My motherboard is an HP 81C7 with a BIOS version of v02.47 (up to date as of Apr/11/2020). I have also installed Nvidia driver and CUDA several times and have tried CUDA 10.1 as well but with no luck. Can someone shed some light on how I can dig into the problem more and find the source of the problem?