I think many people have noticed that the VSL channel is analogous to the stack bus. That's why it's very interesting to look at the traffic that is being transmitted through it:
- control system traffic (traffic of protocols that provide operation of the virtual switch VSS, including synchronization of the state between the switches),
- network management traffic (traffic addressed to the control plane, but received by the backup switch: CDP, VTP, STP, EIGRP/OSPF, etc.)
- user traffic (including broadcast and multicast traffic),
- service traffic (e.g. SPAN).
In the process of discussion we'll somehow dwell on each type of traffic in more detail.
So, after the switches have started, the protocols providing the initial initial VSS initialization are going into battle:
Link Management Protocol (LMP)
- Role Resolution Protocol (RRP)
The LMP protocol checks that the VSL channel is up and the devices can see each other. The RRP protocol checks the hardware and software compatibility of the devices, and determines who will be the primary switch and who will be the backup switch.
The Control plane on the main switch performs two functions. The first one provides the logic of the switch operation: programming the switch based on the configuration, processing all network protocols (L2/L3), forming the routing table, CEF tables, port management, etc. The second function is to fill in all hardware tables (FIB, Adjacency, ACL, QoS, etc.) on both switches to ensure the processing of user traffic (at the hardware level). The Control plane on the redundant switch is in a hot standby state. At the same time, the state of the active copt-wall plane is constantly synchronized with the redundant one. This is necessary to ensure continuous operation of our virtual switch in case of failure of the main physical switch.
The following information is synchronized: parameters of devices loading, their configuration, network protocols and various tables state (launched on the active control plane), devices state (line cards, ports).
Transmission of control data and synchronization of the state between the main and backup switches is performed with the help of specialized protocols:
- Serial Communication Protocol (SCP) - provides communication between the processor and line cards (both local and remote switches)
- Inter-process Communication Packets (IPC) - provides communication between distributed device processors
- Inter-Card Communication (ICC) - provides communication between line cards
All of these protocols refer to the control traffic of the system, which is transmitted between the switches over a VSL channel and forms an Inter-Chassis Ethernet Out Band Channel (EOBC).
The Stateful Switchover (SSO) mechanism is responsible for synchronizing the state between the switches. This mechanism appeared long ago. For example, it is used for reserving supervisors within one 6500 switch. It is also used in VSS technology (they did not come up with anything new). But as we remember, SSO does not allow to synchronize the state of routing protocols. So, when switching to a backup switch, the dynamic routing protocols are launched from scratch. That automatically breaks all L3-connections with remote devices. That is, we get a temporary loss of connection with the outside world. To solve this problem the SSO technology works in conjunction with the Non-Stop Forwarding (NSF) technology. This technology performs the following tasks: it provides the transmission of L3 packets at the moment of switching (in fact, it freezes the old records of all the routes), notifies the remote routers that they don't need to break the connection and asks them for all the necessary information to build a new routing table. Of course, in this case remote devices should also support NSF technology (so to speak, be NSF-aware).
By the way, for reference: it takes up to 13 seconds for the backup switch to fully restart the dynamic routing process. So without NSF technology, it would be very sad.
And how fast would it take to switch to a second switch if the first one failed? Vendor gives an average of 200 msec. For the 6500E (and probably for the 6800 too) in some cases it can reach 400 ms (so to speak, the cost of distributed architecture).