When connecting network elements in order to create a network, it is common practice to consider and provision for various failures. Ideally, traffic should continue to flow even if a failure of a link or a port has occurred.
There are three types of failures to consider:
Network Element (NE) failure
Since MEF services define interfaces only, we focus on the first two types of failures. It should be noted that NE failure handling is heavily dependent on the transport technology of the specific network in question.
We focus our discussion on the external Interfaces – namely UNI and ENNI.
MEF 20 defines as a mandatory feature of UNI Type 2.2 the capability to protect against a UNI-N port failure and/or protect against a failure of a physical link crossing the UNI point.
The solution is LAG (Link Aggregation). The basic concept is having standby links that can be used upon failure of a working (active) link. LAG enables the definition of a LAG group of at least 2 ports/2 physical links and considering them as a single logical link. The UNI-N and UNI-C can be perceived as connected over this single logical link. Note that one could connect 4 × 1GbE links and operate 3 of them as active links with one standby link. This would yield a logical link of 3 Gbps between the UNI-N and UNI-C.
For ENNI the definition is stricter. LAG is defined for exactly 2 ports where only one link is active and the second one is standby. The reason for this limitation is the desire to have the service frames and SOAM frame traverse the same link as each other, which cannot be guaranteed with LAG operating more than one active link (load sharing LAG). MEF recommends operating LAG with its control protocol LACP.
UNI protection is depicted in figure 1 at right.
ENNI Protection is depicted in figure 2 at right.
It should be noted that some vendors do allow LAG between different line-cards or chassis and thus utilize LAG not only for port and link protection but also for NE protection. However, there is no industry standard for this solution.
Figure 1 - UNI Protection using LAG
Figure 2 - ENNI Protection using LAG
Service Protection deals with the need to ensure that the EVC/OVC can provide the service even if a specific link or node within the CEN fails. This enables the Service Provider to offer high availability (e.g. five 9s availability). It should be noted that not all services require resiliency. Services that offer this capability are sometimes priced higher. Also, it could be that service protection is offered for a certain CoS ID and not for others. For example, a certain enterprise could request that a business critical application be protected while passing on protection for the lower-priority Internet access. This is supported by the fact that performance attributes like availability are per CoS ID.
There are two approaches to providing service protection, active/standby EVC and single EVC with transport protection.
In this approach, there are 2 EVCs with identical service attributes. At any given time only one EVC is used for the service. At the ingress UNI a certain logic is used to decide which of the 2 EVCs are in use. (See figure 3.)
Single EVC with transport protection
In this approach, there is a single EVC. The transport network is provisioned to provide protection in case of link failure. For example in MPLS-TP there would be 2 LSPs connecting the UNIs, one designated as active while the other is designated as standby.
Figure 3 - Active/Standby EVC
Figure 4 - Internal Resiliency Mechanism
Fault Identification and Recovery
When there are two paths through a network that act as active/standby, the fundamental question is how to detect the situation where the active path is no longer functioning.
A common requirement from the TDM world is to detect and switchover in less than 50 msec. Ethernet protection that was built upon spanning tree would longer than that ( up to a few seconds in some cases) to find a new path.
Since then, new mechanisms have been defined to facilitate the sub-50 msec requirement. One common approach is to run CCM messages at a high rate (e.g. 10 msec or even 3.33 msec). When one end of the EVC does not receive 3 consecutive CCMs, it assumes that the path is not functioning and will take several actions like issuing a link fault alarm and switch to the backup path. Note that some implementations may constantly monitor the backup path too and determine whether the backup path is alive.
Once switchover has occured, the service is considered recovered.
This approach hides the internal resiliency mechanism from the end user who will feel only a very short traffic disruption.
The concept is depicted in figure 4.
The ITU-T has defined two standards that handle path protection. These are G.8031 (Ethernet linear protection switching) and G.8032 (Ethernet ring protection switching) Both utilize Y.1731 CCM messages for fault detection. G.8031 is similar to SONET path protection. It is based on a working and a protection path between two end points. Switching upon failure can occur in under 50 msec. The concept is illustrated in figure 5.
G.8031 can be implemented over many transport technologies and is network topology independent. G.8032 is specifically for ring architectures, including virtual rings, where there are obvious main and alternate paths along the ring between any 2 points.
The protocol also breaks loops and therefore make STP redundant. The ring protection is illustrated in figure 6.
In both mechanisms assuming that the protection is done between two external interfaces (UNI to UNI, UNI to ENNI, ENNI to ENNI), the EVC/OVC does not sense the protection switching, other than the short interval where all service frames are lost.
MPLS Fast Re-route (FRR)
MPLS FRR is a local protection mechanism in MPLS networks where the LSP can bypass a faulty node or link using locally created bypass. This is achieved in under 50 msec and can be triggered locally upon node or port down status.
The Ethernet layer does not sense the bypass and the EVC that is carried over the LSP continues to flow normally, other than the short interval where all service frames are lost.