Windows cluster + generic service + lost quorum -> which timeout? [migrated]

I’m having a problem with my Windows cluster, which consists of two nodes and one generic service. Sometimes when the node with the active service loses communication with the other node and the witness, the node (cluster service on this node?) waits properly for the service to be shut down. Sometimes it takes 60 seconds for the service to clean up and end but that’s fine – it always waits. At the same time, the second node connected to the witness has quorum and brings up the service — that’s fine for me too, even though I have two services running simultaneously for a while. The most important thing for me is that the service always shuts down gracefully. Unfortunately, sometimes the node on which the service is being shut down kills it with force (always after 30 seconds) before the service finishes shutting down gracefully and I really don’t want that. I have tested this on my DEV environment and observed that when I move the service too many times one of the nodes enters the quarantine state and at that point if I move the service to the quarantined node and turn off ethernet on the quarantined node the service starts its graceful shutdown but is killed after 30 seconds so there is a way to reproduce it. However on my PROD env the same happened but the node was NOT in quarantine. Could you tell me:

  1. Is this related to clussvc.exe being shut down?
  2. What other process takes responsibility for killing the service if clussvc.exe isn’t present?
  3. Where is the 30-second timeout stored after which someone forcibly kills my service?

I’ve already checked:

  • PendingTimeout : 60000 (so it’s not PendingTimeout)
  • DeadlockTimeout : 90000 (so it’s not DeadlockTimeout)

I’ve also added a registry property (ServicesPipeTimeout) and set it to 40000 (millis) but my service is always killed after 30 seconds.

More details: the service consists of nssm wrapper and wildfly application server (so it’s java)

Another problem: sometimes wildfly does not want to close gracefully. In that scenario nssm is supposed to kill the java process with force (after 90s timeout). If the two scenarios apeear simultaneously (i.e. wildfly does not want to close AND service is killed after 30s) then only nssm is killed and java becomes a zombie who is bound to some tcp ports so next time the service is failed over it cannot be started (this, however, can be fixed by killing any process that holds ports the service wants to bind to).

Below are some of the properties:

Get-ItemProperty registry::HKLMSYSTEMCurrentControlSetControl ServicesPipeTimeout


ServicesPipeTimeout : 40000
Get-ClusterResource "..." | fl *


Characteristics         : 0
Cluster                 : ...
DeadlockTimeout         : 90000
Description             :
Id                      : 91512ada-c76b-4b3e-84cf-b017daa3de0d
IsCoreResource          : False
EmbeddedFailureAction   : 2
IsAlivePollInterval     : 4294967295
IsNetworkClassResource  : False
IsStorageClassResource  : False
LastOperationStatusCode : 0
LooksAlivePollInterval  : 4294967295
MaintenanceMode         : False
MonitorProcessId        : 3184
Name                    : ...
OwnerGroup              : ...
OwnerNode               : ...
PendingTimeout          : 60000
PersistentState         : 1
ResourceSpecificData1   : 0
ResourceSpecificData2   : 0
ResourceSpecificStatus  :
ResourceType            : Generic Service
RestartAction           : 2
RestartDelay            : 500
RestartPeriod           : 600000
RestartThreshold        : 1
RetryPeriodOnFailure    : 600000
SeparateMonitor         : False
State                   : Online
StatusInformation       : 0
Get-ClusterGroup ... | fl *


AntiAffinityClassNames : {}
AutoFailbackType       : 0
ColdStartSetting       : 0
Cluster                : ...
DefaultOwner           : 4294967295
Description            :
GroupType              : GenericService
FailoverPeriod         : 1
FailoverThreshold      : 150
FailbackWindowEnd      : 4294967295
FailbackWindowStart    : 4294967295
FaultDomain            : 0
IsCoreGroup            : False
Name                   : ...
OwnerNode              : ...
PersistentState        : 1
PlacementOptions       : 0
PreferredSite          : {}
Priority               : 2000
ResiliencyPeriod       : 0
State                  : Online
StatusInformation      : 0
UpdateDomain           : 0
Id                     : df900af6-5e04-49c2-8235-e9241f63e398
Get-ClusterGroup "cluster group" | fl *


AntiAffinityClassNames : {}
AutoFailbackType       : 0
ColdStartSetting       : 0
Cluster                : ...
DefaultOwner           : 4294967295
Description            :
GroupType              : Cluster
FailoverPeriod         : 6
FailoverThreshold      : 4294967295
FailbackWindowEnd      : 4294967295
FailbackWindowStart    : 4294967295
FaultDomain            : 0
IsCoreGroup            : True
Name                   : cluster group
OwnerNode              : ...
PersistentState        : 1
PlacementOptions       : 0
PreferredSite          : {}
Priority               : 13000
ResiliencyPeriod       : 0
State                  : Online
StatusInformation      : 0
UpdateDomain           : 0
Id                     : 3128b123-3322-4c3a-ba08-49508402194d

The aforementioned cluster is a feature of Windows Server OS, in my case it’s Windows Server 2019 Standard. A cluster can be created with Windows’ Failover Cluster Manager software (and maybe also with Powershell script). There is a single instance of a windows service on each node (nssm + wildfly) and the cluster takes care that there is exactly one service running at a time (i.e. active-passive mode). These wildflys are NOT connected with each other in terms of wildfly cluster i.e. these servers do not see each other, they only know when to start and stop.

It’s also not WaitToKillServiceTimeout because Get-ItemProperty registry::HKLMSYSTEMCurrentControlSetControl WaitToKillServiceTimeout -> WaitToKillServiceTimeout : 5000.