[VMware] Notes on Redundancy in vSAN Configuration


Software Defined Storage (SDS) is becoming popular and I feel it is becoming more stable and operational. In my project, we have been operating a system applying vSAN architecture provided by VMware. However, there may be service-impacting events caused by problems unique to vSAN, so it is important to be prepared to solve problems by understanding the characteristics of vSAN.

vSAN Configuration

An example of a vSAN configuration is shown in the figure below. vmk1 (vmkernel1) is allocated for vSAN communication, and vmk1 is designed to be connected to vmnic2 and vmnic3, which are physical NICs. This configuration enables service continuity even in the event of a physical NIC failure.

Disk write flow across ESXi servers

In vSAN, disk writes by virtual servers are processed across ESXi servers. In this case, one of vmnic2 and vmnic3 associated with vmk1 is Active and the other is Standby. In 2023, the Active-Active configuration cannot be adopted as a VMware product.

Behavior in case of vmnic failure

When a vmnic failure occurs and link down is detected, the Standby vmnic is promoted to Active. After that, an arp packet is sent to the L2SW, which recognizes the newly active vmnic and sends a write packet.

Let me share a supplementary note on vmnic behavior in the event of a vmnic failure. Standby vmnic changes to active when the ESXi server detects a link-town. Note that if a failure occurs without link down, the failed vmnic will continue to be used. If network communication becomes impossible, the other ESXi server will detect the failure and disconnect the server, and the virtual server will recover by VMHA. However, please note that if CRC errors occur frequently, the other ESXi servers will not detect the failure, and disk writes will continue to be unstable, requiring manual recovery or server disconnection.


For more detailed information, I recommend you to read the blogs provided by VMware.

