VTEP: VXLAN Tunnel End Point. An entity that originates and/or terminates VXLAN tunnels
VXLAN: Virtual eXtensible Local Area Network
VXLAN Segment: VXLAN Layer 2 overlay network over which VMs communicate
VXLAN Gateway: an entity that forwards traffic between VXLANs
2 Conventions Used in This Document
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 [RFC2119].
当今的虚拟化环境对连接到机架顶部(ToR)交换机的 MAC 地址表提出了更多要求。不是每个服务器链路只有一个 MAC 地址,如今,ToR交换机还得学习每个虚拟机的地址(每台服务器可能有几百个)。这是因为虚拟机到物理网络其他部分的流量将穿过服务器和交换机之间的链路。典型的 ToR 交换机可连接 24 或 48 台服务器,具体取决于其面向服务器的端口数量。数据中心可能由多个机架组成,因此每个ToR 交换机需要为不同物理服务器上的通信虚拟机维护一个地址表。与非虚拟化环境相比,这对转发表容量的需求要大得多。
VNI 可识别由单个 VM 发起的内部 MAC 帧的范围。因此,您可以在不同网段之间使用重复的 MAC 地址,绝不会出现流量 “交叉 “的情况。因为流量是通过 VNI 隔离的。VNI 位于外层报头中,该报头封装了由虚拟机发起的内层 MAC 帧。在以下章节中””VXLAN segment”与”VXLAN overlay network”可以互换使用。
以下各节将讨论 VXLAN 环境中使用基于学习的控制方案的典型流量场景。在这里,虚拟机 MAC 与 VTEP IP 地址的关联是通过源地址学习发现的。组播用于传输未知目标、广播和组播帧。 除了基于学习的控制平面外、还有其他分配 VTEP IP 到虚拟机 MAC 映射信息的方案。可选方案包括基于权限/目录的查询、由中央机构将此映射信息分发给 VTEP 等。这两种模式有时分别被称为push and pull模型。本文将重点讨论基于学习的控制方案。
4.1 单播虚拟机到虚拟机通信
在VXLAN子网中的虚拟机,是意识不到存在VXLAN的。要与不同主机上的虚拟机通信,它会像往常一样发送一个MAC帧。物理主机上的 VTEP 查找与该虚拟机关联的 VNI。然后确定目标 MAC 是否在同一网段上,以及是否有目的地 MAC 地址到远程 VTEP 的映射。如果有,一个外层的MAC,外层的IP和VXLAN header将会封装在原始MAC帧上。这个封装后的帧将被发送到远端的VTEP上。远端VTEP接受后,会确认是否是合法的VNI和是否存在使用与内部目标 MAC 地址匹配的 MAC 地址的虚拟机。如果是这样,数据包就会被去掉封装头,并传递给目标虚拟机。目的地虚拟机永远不会知道VNI,也不知道帧是用VXLAN封装。
Consider the VM on the source host attempting to communicate with the destination VM using IP. Assuming that they are both on the same subnet, the VM sends out an Address Resolution Protocol (ARP) broadcast frame. In the non-VXLAN environment, this frame would be sent out using MAC broadcast across all switches carrying that VLAN.
With VXLAN, a header including the VXLAN VNI is inserted at the beginning of the packet along with the IP header and UDP header. However, this broadcast packet is sent out to the IP multicast group on which that VXLAN overlay network is realized.
To effect this, we need to have a mapping between the VXLAN VNI and the IP multicast group that it will use. This mapping is done at the management layer and provided to the individual VTEPs through a management channel. Using this mapping, the VTEP can provide IGMP membership reports to the upstream switch/router to join/leave the VXLAN-related IP multicast groups as needed. This will enable pruning of the leaf nodes for specific multicast traffic addresses based on whether a member is available on this host using the specific multicast address (see [RFC4541]). In addition, use of multicast routing protocols like Protocol Independent Multicast - Sparse Mode (PIM-SM see [RFC4601]) will provide efficient multicast trees within the Layer 3 network.
The VTEP will use (*,G) joins. This is needed as the set of VXLAN tunnel sources is unknown and may change often, as the VMs come up / go down across different hosts. A side note here is that since each VTEP can act as both the source and destination for multicast packets, a protocol like bidirectional PIM (BIDIR-PIM -- see [RFC5015]) would be more efficient.
The destination VM sends a standard ARP response using IP unicast. This frame will be encapsulated back to the VTEP connecting the originating VM using IP unicast VXLAN encapsulation. This is possible since the mapping of the ARP response's destination MAC to the VXLAN tunnel end point IP was learned earlier through the ARP request.
Note that multicast frames and "unknown MAC destination" frames are also sent using the multicast tree, similar to the broadcast frames.
When IP multicast is used within the network infrastructure, a multicast routing protocol like PIM-SM can be used by the individual Layer 3 IP routers/switches within the network. This is used to build efficient multicast forwarding trees so that multicast frames are only sent to those hosts that have requested to receive them.
Similarly, there is no requirement that the actual network connecting the source VM and destination VM should be a Layer 3 network: VXLAN can also work over Layer 2 networks. In either case, efficient multicast replication within the Layer 2 network can be achieved using IGMP snooping.
VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may fragment encapsulated VXLAN packets due to the larger frame size. The destination VTEP MAY silently discard such VXLAN fragments. To ensure end-to-end traffic delivery without fragmentation, it is RECOMMENDED that the MTUs (Maximum Transmission Units) across the physical network infrastructure be set to a value that accommodates the larger frame size due to the encapsulation. Other techniques like Path MTU discovery (see [RFC1191] and [RFC1981]) MAY be used to address this requirement as well.
Payload: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Ethertype of Original Payload | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Original Ethernet Payload | | | |(Note that the original Ethernet Frame's FCS is not included) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Inner Ethernet Header: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Inner Destination MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Inner Destination MAC Address | Inner Source MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Inner Source MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |OptnlEthtype = C-Tag 802.1Q | Inner.VLAN Tag Information | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Payload: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Ethertype of Original Payload | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Original Ethernet Payload | | | |(Note that the original Ethernet Frame's FCS is not included) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Frame Check Sequence: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | New FCS (Frame Check Sequence) for Outer Ethernet Frame | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
对于 VXLAN 连接接口上的传入帧,网关会去除 VXLAN头,并根据内部以太网帧的目标 MAC 地址将其转发到物理端口。除非显式配置为传递到非 VXLAN 接口,否则带有内 VLAN ID 的解封装帧应被丢弃。反之亦然,非 VXLAN 接口的传入帧会根据帧中的 VLAN ID 映射到特定的 VXLAN 覆盖网络。除非明确配置为在封装的 VXLAN 帧中传递,VLAN ID 会在为 VXLAN 封装帧之前被移除。
1 2 3 4 5 6 7
These gateways that provide VXLAN tunnel termination functions could be ToR/access switches or switches higher up in the data center network topology -- e.g., core or even WAN edge devices. The last case (WAN edge) could involve a Provider Edge (PE) router that terminates VXLAN tunnels in a hybrid cloud environment. In all these instances, note that the gateway functionality could be implemented in software or hardware.