Heartbeat, Failover and Quorum in Windows and Linux Clusters

When there is a network isolation, the behavior with a split-brain checker is:

a split-brain checker has been configured with the IP address of a witness (typically a router),
the split-brain checker operates when a server goes from PRIM to ALONE or from SECOND to ALONE,
in case of network isolation, before going to ALONE, both nodes test the IP address,
the node which can access the IP address goes to ALONE, the other one goes to WAIT,
when the isolation is repaired, the WAIT node resynchronizes its data and becomes SECOND.

Note: If the witness is down or disconnected, both nodes go to WAIT and the application is no more running. That's why you must choose a robust witness like a router.

SafeKit heartbeats

In normal operation, the two servers exchange their states (PRIM, SECOND, the resource states) through the heartbeat channels and synchronize their application start and stop procedures.

In particular, in case of a scheduled failover, the stop script which stops the application is first executed on the primary server, before executing the start script on the secondary server. Thus, replicated data on the secondary server are in a safe state corresponding to a clean stop of the application.

Loss of all heartbeats

When all heartbeats are lost on one server, this server considers the other server to be down and transitions to the ALONE state.

If it is the SECOND server which goes to the ALONE state, then there is an application failover with restart of the application on the secondary server.

Although not mandatory, it is better to have two heartbeat channels on two different networks for synchronizing the two servers in order to separate the network failure case from the server failure one.

SafeKit split brain checker

With the SafeKit high availability software, the quorum within a Windows or Linux cluster requires no third quorum server and no quorum disk. A simple split brain checker is sufficient to avoid the double execution of an application.

On the the loss of all heartbeats between servers, the split brain checker selects only one server to become the primary. The other server goes into the WAIT state, until it receives the other server's heartbeats again. It then goes back to secondary after having synchronized replicated data from the primary server.

How the split brain checker works?

The primary server election is based on the ping of an IP address, called the witness. The witness is typically a router that is always available. In case of network isolation, only the server with access to the witness will be primary ALONE, the other will go to WAIT.

The witness is not tested permanently but only when all heartbeats are lost. If at that time, the witness is down, the cluster goes into the WAIT-WAIT state and an administrator can choose to restart one of the servers as primary through the SafeKit web console.

What happens without a split brain checker?

In case of network isolation, both servers will go to the ALONE state running the critical application. The replicated directories are isolated and each application is working on its own data in its own directory.

When the network is reconnected, SafeKit by default chooses the server which was PRIM before the isolation as the new primay and forces the other one as SECOND with a resynchronization of all its data from the PRIM.

Note: Windows can detect a duplicate IP address on one server and remove the virtual IP address on this server. SafeKit has a checker to force a restart in that case.

🔍 SafeKit High Availability Navigation Hub

Explore SafeKit: Features, technical videos, documentation, and free trial
Resource Type	Description	Direct Link
Key Features	Why Choose SafeKit for Simple and Cost-Effective High Availability?	See Why Choose SafeKit for High Availability
Deployment Model	All-in-One SANless HA: Shared-Nothing Software Clustering	See SafeKit All-in-One SANless HA
Partners	SafeKit: The Benchmark in High Availability for Partners	See Why SafeKit Is the HA Benchmark for Partners
HA Strategies	SafeKit: Infrastructure (VM) vs. Application-Level High Availability	See SafeKit HA & Redundancy: VM vs. Application Level
Technical Specifications	Technical Limitations for SafeKit Clustering	See SafeKit High Availability Limitations
Proof of Concept	SafeKit: High Availability Configuration & Failover Demos	See SafeKit Failover Tutorials
Architecture	How the SafeKit Mirror Cluster works (Real-Time Replication & Failover)	See SafeKit Mirror Cluster: Real-Time Replication & Failover
Architecture	How the SafeKit Farm Cluster works (Network Load Balancing & Failover)	See SafeKit Farm Cluster: Network Load Balancing & Failover
Competitive Advantages	Comparison: SafeKit vs. Traditional High Availability (HA) Clusters	See SafeKit vs. Traditional HA Cluster Comparison
Technical Resources	SafeKit High Availability: Documentation, Downloads & Trial	See SafeKit HA Free Trial & Technical Documentation
Pre-configured Solutions	SafeKit Application Module Library: Ready-to-Use HA Solutions	See SafeKit High Availability Application Modules

Heartbeat, Failover and Quorum in Windows and Linux Clusters

SafeKit Proposes a SANless Cluster with a Simple Split-Brain Checker

What are the different scenarios in case of network isolation in a cluster?

A single network

Two networks with a dedicated replication network

A single network and a splitbrain checker

How heartbeats and failover work in a Windows or Linux cluster?

What is a heartbeat?

SafeKit heartbeats

Loss of all heartbeats

Split brain problem and quorum when servers are in two remote computer rooms

Remote computer rooms

Split brain

Complexity of solutions

Simple cluster quorum with the SafeKit split brain checker

SafeKit split brain checker

How the split brain checker works?

What happens without a split brain checker?

🔍 SafeKit High Availability Navigation Hub