Home
Designing VPC and Routing

Pete's faceI've been seeing some network problems lately, at sites where the problem was designing the VPC and routing mix correctly. Generally, there's plenty of room to make a mistake, the situation is a bit confusing to most people. So I'm going to try to explain how to separate out routing and Layer 2 (L2) forwarding with VPC's, so the routing will work correctly. I'm hoping to help by explaining the problem situation you need to avoid as simply as I can, and showing some simple examples, with lots of diagrams. For a simple description of how basic VPC works, see my prior posting, How VPC Works

Cisco has put out some pretty good slideware on the topic, but there are an awful lot (too many?) diagrams. Either that's confusing folks, or people just aren't aware that VPC port channels have some design limitations, you can't just use them any which way as with normal port channels (or port channels to a VSS'd 6500 pair). 

The short version of the problem: routing peering across VPC links is not supported. (Adjacency will be established but forwarding will not work as desired.) The "vpc peer-gateway" command does not fix this, and is intended for another purpose entirely (EMC and NetApp end systems that learn the router MAC address as the source MAC in frames, rather than using ARP and learning the default gateway MAC address).

Let's start by repeating the basic VPC forwarding rule from the prior blog:

VPC Rule 101

VPC peers are expected to forward a frame received on a member link out any other member link that needs to be used. Only if they cannot do so due to a link failure, is forwarding across the VPC  peer link and then out a member link allowed, and even then, the cross-peer-link traffic can only go out the member link that is paired with the member link that is down. 

The same rules apply to routed traffic. Since VPC does no spoofing of the two peers being one L3 device, packets can get black-holed.

The Routing with VPC Problem

Here's the basic situation where we might be thinking of doing VPC and can get into trouble. Note I've been using dots for routed SVI's, just as a graphical way to indicate where the routing hops are. (No connection with the black spot in the novel Treasure Island.)

This is where we have a L3-capable switch and we wish to do L2 LACP port-channeling across two Nexus chassis. If the bottom switch is L2-only, no problem. Well, we do have to think about singly-homed servers, orphan (singly-homed) devices, non-VPC VLANs, failure modes, etc., but that is much more straight-forward.

All is fine if you're operating at Layer 2 only. 

Let's walk through what VPC does with L3 peering over a L2 VPC port-channel. Suppose a packet arrives at the bottom switch C (shown by the green box and arrow in the diagram above or below). The switch has two routing peers. Let's say the routing logic decides to forward the packet to Nexus A on the top left. The same behavior could happen if it chooses to forward to B. The router C at the bottom has a (VPC) port channel. It has to decide which uplink to forward the packet over to get it to the MAC address of the Nexus A at the top left.

Approximately 50% of the time, based on L2 port channel hashing, the bottom L3 switch C will use the left link to get to Nexus A. That works fine. Nexus A can forward the frame and do what is needed, i.e. forward out another member link. 

The other 50% or so of the time, port channel hashing will cause router C to L2 forward the frame up the link to the right, to Nexus B. Since the destination MAC address is not that of Nexus B, Nexus B will L2 forward the frame across the VPC peer link to get it to A. But then the problem arises because of the basic VPC forwarding rule. A is only allowed to forward the frame out a VPC member link if the paired link on Nexus B is down. Forwarding out a non-member link is fine. 

So the problem is in-on-member-link, cross-peer-link, out-another-member-link: no go unless paired member link is down. Routing does not alter this behavior. 

Yes, if there is only one pair of member links, you cannot have problems, until you add another member link. If you add a 2nd VLAN that is trunked on the same member links, inter-VLAN routing may be a problem. If you just do FHRP routing at the Nexus pair, no, the L2 spoofing handles MAC addresses just fine (using the FRHP MAC so no transit of the peer link is necessary). It's when your inter-VLAN routing is via an SVI on one of the bottom switches routing to a peer SVI on the Nexus pair that you will probably have problems.  

You can have similar problems even if only one of the two Nexus switches is operating at L3, or has a L3 SVI in a VLAN that crosses the VPC trunks to the switch at the bottom. We will see an example of this later. 

Conclusion: it is up to us to avoid getting into this situation! That is, VPC is not a no-brainer, if you want to mix it with routing you must design for that.

You can also do this sort of thing with two switches at the bottom of the picture, e.g. pair of N5K to pair of N7K's. Or even VSS 6500 pair to VPC Nexus pair. See also our Carole Reece's blog about it, Configuring Back-to-Back vPCs on Cisco Nexus Switches, and the Cisco whitepaper with details, http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps708/white_paper_c11_589890.html. VPC is allowed and works, but we need to design it to operate at L2 only. 

Drilling Down on VPC Routing

We are also OK if we use a FHRP with a VPC to get traffic from a VPC'd server to a pair of Nexii, and then route across non-VPC point-to-point links, e.g. into the campus core or WAN.  VPC does very well at spoofing L2, and the virtual MACs used with the three FHRP's allow direct forwarding out VPC member links by VPC peers. Routing to the core uses non-VPC non-member links, so no problem.

The problem in the L3 story above is that the frame is being forwarded at L2 to the real MAC not virtual MAC of A, and B is not allowed to do the routing on behalf of A. 

The next diagram shows how this typically bites us. If we're migrating from 6500's (bottom) to Nexus (top) and we are inconsistent, we can get in trouble. If our packet hits an SVI, is routed to Nexus B but sent via Nexus A, then Nexus B will not be able to route the frame again out the member link marked with the red X, to get to a L3 SVI on the bottom right switch D.

This might happen from datacenter to user closet, if you have L2 to a collapsed core/distribution Nexus pair, with some SVI's between old 6500 C and new Nexus switches A and B in the datacenter, and closet switches with SVI's on the same switches as the datacenter SVI's (switch D in the diagram). It might also happen if you have some VLANs with SVI's on datacenter access switches like C, and other VLANs on other datacenter access switches like switch D (perhaps even with all SVI's migrated to live only on the Nexus pair). It can even happen on one switch, where C and D are the same switch, and you're routing between VLANs via an SVI on C. (Same picture, just a little more cluttered because the green arrow and red X are on the link back to C.)

Summary: Making Routing Work with VPC

Here's the Cisco-recommended design approach, using my drawing and words. The black links are L2 VPC member links. The red links are additional point-to-point routed links.

The simple design solution is to only allow L2 VLANs with SVI's at the Nexus level across the VPC member links. If you must have some SVI's on the bottom switch(es) and some others on the Nexus switches, block those VLANs on the L2 trunks that are VPC members, and route them instead across separate L3 point-to-point links, shown in red in the above diagram. Of course, if you're routing say VLAN 20, there would be no point to having a routed SVI for VLAN 20 on the bottom switch and on the Nexus switches as well. 

The point to point routed interfaces do not belong to VLANs, so they cannot possibly accidentally be trunked over the member links, which are usuallly trunks.

When you have SVI's rather than routed interfaces or dot1q subinterfaces, you have to be aware of which VLANs you do and do not allow on the VPC  member links. If you have many VLANs that need routing, use dot1q subinterfaces on the routed point-to-point links to prevent "VPC routing accidents". Or use SVI's and trunking over the point-to-point non-VPC links, just be very careful to block those VLANs on the VPC trunk member links. 

Using VPC to Buy Time to Migrate to L3 Closets

As you will have noticed in my recent blog, Simplicity and Layer 2, I like L3 closets. That generally means your L2 is mostly confined to the datacenter. No L2 problems out in the closets! 

Our present discussion is highly relevant if you are migrating from L2 to L3 closets. Several hospitals we are working with have had spanning tree problems (or risk). They wish to reduce their L2 domains size and any risk by moving to L3 closets. One way to tackle this is to drop Nexus switches in at the core or distribution layer (they are sometimes combined layers), and start out running VPC to all the L2 closets. That "stabilizes the patient" to buy time and stability for the cure, L3 closets. 

If you whittle away at sprawling VLANs spanning closets, buildings and campuses, you can generally manage to clean up one closet at a time. Iterate for the next year or two. Painful, but much more robust! 

Consider a single closet switch that you're working on. You can get yourself to a situation where the SVI's are in the distribution layer Nexuses, say, and you have L2 VPC member trunks to the closet switch (now represented by the bottom switch in our above diagrams). When all the VLANs are single-closet-only VLANs, you can then un-VPC the uplinks to that one closet, turn them into point-to-point routed links, put the SVI's on the closet switch instead, and be done. If you want a slower transition, add separate L3 routed point-to-point links like the above red lines, and control which VLANs are trunked across the VPC member links. All it takes is organization and being clear about where  you're doing L2 and where you're doing L3 -- which I'd say should be part of the design document / planning. 

Another Example

One more real world example shows how it is easy to not see the potential problem. Suppose you have a router, e.g. an MPLS WAN router, and for some reason you have to attach it to a legacy switch at the bottom of the picture, as shown in the following diagram:

Why would you do this? In one case we've seen, the vendor router had a FastEthernet port, and the Nexus switch had no 100 mbps capable ports. Another is copper versus fiber ports and locations of the devices in question. 

Suppose the uplinks are VPC members, and because of the VPC routing problems, the site is trying to make this work with just a VPC on the right Nexus switch, switch B. In the case in question,; C and D were actually the same switch, but I'm presenting it this way since the diagram is more clear when I show two switches. 

At the left we see a packet hitting a SVI in the leftover 6500 (which some sites would shift to being a L2-only access switch, and other sites would discard or recycle elsewhere in the network.). 

The bottom left switch SVI can route to other SVI's that are local. To get to the WAN router, the left switch needs to somehow route the packet via the top right Nexus, Nexus B somehow. It turns out there is exactly one VLAN with SVI's on all three switches, which gives switch C a way to route to the rest of the network. Switch C therefore follows the dynamic EIGRP routing, by routing into the shared VLAN with next hop Nexus B. 

In 50% of the flows, the packet goes via the left Nexus A, across the peer link, and thus B cannot forward it out the VPC member link to get to the router. 

Exercise for the reader; Consider traffic going the other way, from the WAN back to the datacenter. See the following diagram:

Does it work? If not, what goes wrong? Can you explain it? [Hint: there's a red X in the above diagram for a reason!]

Possible Solutions

(1) Attach the MPLS VPN WAN router to one or both Nexii directly. Note that dual-homing via the 6500 (bottom right) is a Single Point of Failure (SPoF), so connecting to only one N7K is no worse (or better). 

(2) Put the SVI for the router's VLAN on the bottom right switch, and convert the uplinks to L3 point-to-point. Or use dedicated point-to-point links for all routed traffic from bottom right to the two Nexii. Since point to point routed interfaces don't belong to VLANs, they can't accidentally be trunked over VPC member links. 

(3) Have no SVI's on the Nexii -- do all routing on the bottom switches. That actually works -- but doesn't help in terms of getting the routing onto the much more powerful Nexus switches, which is where you probably want it. 

Conclusion

Please don't draw the conclusion that you can't do routing with VPC. You can in certain ways. What you do not want is a router or L3 switch interacting with routing on VPC peers over a VPC port-channel link. You can route to VPC peers as long as you're not using a VPC port-channel, e.g. just a plain point-to-point link or a L3 port-channel to a single Nexus. If there is an SVI at the bottom (that is, not on the Nexus pair) for a given VLAN, block it from the member links and thereby force it to route over the dedicated routed links. In that case, don't allow the VLAN across the VPC peer link either: that link should only carry the VLANs that are allowed on the VPC  member links, and no others, no routing, nothing else.

You can also route over a VPC port-channel, as long as your routing peers are reached at L2 across the VPC but are not the VPC peers your VPC connects to. That is, routing peering across a L2-only VPC Nexus pair in the middle is OK.  

In the datacenter, stick to pure L2 when doing VPC, up to some sort of L3 boundary. When doing L3, use non-VPC L3 point-to-point links. If you have a pod running off a pair of L3-capable Nexus 55xx's and you feel the need to VPC some L2-ness through your Nexus 7K core, fine, just use dedicated links for the L3 routing. And when doing so, don't use SVI's, use honest to goodness L3 ports, that is, "no switchport" type ports. That way you cannot goof and forget to disallow any relevant VLANs across VPC member links that are trunks. 

Upcoming design consideration: don't VPC multi-hop FCoE. It's OK to VPC FCoE at the access layer, just don't do it beyond there. Why not VPC multi-hop FCoE? Among other reasons, it makes it far too easy to merge fabrics accidentally. That's a Bad Thing, definitely something you do not want to do! Also, you do have to be careful about FCoE with a 2 x 2 VPC -- that's covered in the Nexus course (now named "DCUFI"). Which I'm teaching about once a month for FireFly (www.fireflycom.net)  

Why Did Cisco Do It This Way?

I think the engineers expected everyone doing L3 to put it on separate links. It's not clear to me why they thought people would WANT to do that. Nor the confusion about SVI's and where you were doing routing that a lot of people seem to have (i.e. understanding it too complex for real world). It might also have something to do with the datacenter switch positioning of the Nexus products.

References

Cisco NX-OS Virtual PortChannel: Fundamental Design Concepts ...

https://supportforums.cisco.com/thread/2054000

https://supportforums.cisco.com/thread/2047031

Quote from that thread: "We don't support running routing protocols over VPC enabled VLANs."

 

 

Comments (27)Add Comment
0
L3 link needed parallel to vPC peer link ?
written by Alois, November 06, 2011
Hello Pete,
I thought about DCI with L3 and L2 interconnections. Like in Carole Reece Blog.
Extra L3 links parallel to L2 vPC between Datacenders is okay.
Are there situations where I also need an extra L3 link parallel to the vPC peer link ? I think a point-to-point peering vlan (SVI) trunked on the vPC peer link is okay.
I think when all Server Vlans located on both Datacenters there should not be such a situation, where the vPC loop prevention should occur.
If I have separate server Vlans in each Datacenter e.g. L3 closet so I think there would be a situation where I need the extra L3 link parallel to the vPC peer link.
Pete Welcher
VPC and DCI
written by Pete Welcher, November 06, 2011

I wouldn't do DCI with VPC, I'd rather do OTV, although admittedly it isn't quite mature yet. Better STP isolation and broadcast behavior.

If you do do DCI with VPC, separate routing onto point-to-point L3 links. Also, you always always want a L3 routed link between the VPC peers in parallel to the peer link. Well, I do, anyway. I don't want any possible problems with traffic crossing the peer-link.

If you do VPC between datacenters, be careful to not have any situations with SVI's for the same VLANs at both ends of the VPC port-channel on the actual VPC peers. Keep the VPC port-channel L2 only.
0
...
written by Robert, November 08, 2011
Hi Pete,

good article, thank you.

Can you clarify the following statement "If our packet hits an SVI, is routed to Nexus B but sent via Nexus A, then Nexus B will not be able to route the frame again out the member link marked with the red X, to get to a L3 SVI on the bottom right switch D"

I understand that the packet is routed first from SVI on C to SVI on B and then from SVI on B to SVI on D.
In this case B will do L2 packet rewrite before switching it out of an interface towards D. At that point the packet is different (it has different L2 header) compared with the packet which arrived from A on the peer-link, hence vPC's loop prevention should not kick in and B should switch the packet towards D. Why this doesn’t work

Regards,
Robert
Pete Welcher
Response to Robert
written by Pete Welcher, November 09, 2011
The frame arrives at B traveling at L2 via A. It arrives on the VPC peer link. The way Cisco coded it, anything arriving on the peer link cannot be forwarded out any member link from B, unless the matching (paired) link on A is down.

The decision has nothing to do with outbound header, etc. It is based on inbound and outbound interfaces. The frame came in the peer link, lookup says it should get sent out a member link whose paired link is up, therefore the packet gets dropped. It never makes it to being readied to be sent out the outbound interface, no header rewrite etc.
0
...
written by Robert, November 09, 2011
" The decision has nothing to do with outbound header, etc. It is based on inbound and outbound interfaces. The frame came in the peer link, lookup says it should get sent out a member link whose paired link is up..."

This makes sense if B is doing L2 only. In a case B is doing L3 which I believe is a case in your example (B is a L3 next hop from C's perspective), the decision can't be based only on inbound and outbound interfaces. The next hop (L3) might not be connected via vPC? In that case B should make a forwarding (L3) decision based on the information in FIB, rewrite the L2 header and switch the frame out of an appropriate interface even if the packet from C has originally arrived on the peer link. Am I correct or I am still missing something?
Pete Welcher
Response to Robert (2)
written by Pete Welcher, November 09, 2011

Good discussion point. I've gotten basically the same reaction from some other people, including some very sharp CCIE's I know who had not looked closely at VPC and routing (yet).

The L3 decision determines the outbound interface. If that outbound link is a VPC member link, the same forwarding rule is applied as for L2: the frame (packet) came in the peer-link, the member interface that is VPC-paired to the chosen outbound link is up, therefore drop the packet.

I think what's bothering you is whether that makes sense, in terms of what it should ideally be doing. Or "why the heck would they do it that way?" I personally don't think it does. It strikes me that in making a L3 decision, the fact that the inbound interface was the peer-link (or a flag indicating that is set in some internal data structure) should become irrelevant. So you and I would have coded the behavior different.

What we need to understand to work with VPC and routing is the way the Cisco team programmed it: what they designed it to do. It might be a Phase 1 stage where Phase 2 is smarter. I haven't seen anything clear in the roadmaps in that regard. I have heard a hint about something deeper involving VPC, but it was vague enough I'm not going to try to speculate in public on what it might be or What It Means.

I get the feeling that VPC was really intended only for L2 links (well, it is) but that important message got lost along the way (and is still muddled in with vast amounts of other detail when people read about VPC). Maybe to the Cisco folks involved (tech, marketing) it is obvious and they think other people know it already. To be fair: maybe it is obvious and you, I, and my small sample of others are all suffering the same form of not getting it?

I'm seeing real world evidence that people have not heard that key piece of information. That's why I wrote this blog article!
0
An interesting scenario
written by Robert, November 13, 2011
I have created a similar topology to the first diagram, the only diffrent is that A doesn't have an SVI for VLAN 10 and C has an SVI for VLAN 10. VLAN 10 is active and allowed on all trunks. When I ping from C to B (at L2 the traffic goes via A) I am not getting any responses from B, as expected. When I shutdown the link between A and C then C is getting replies from B, as expected. Now, the interesting part. In the first case with all links up and C not getting any replies from B, after I create an SVI for VLAN 10 on A I can ping from C to B, when I shutdown the SVI on A, I can't! Why?
0
Why not peer gateway for rescue?
written by Volker, November 15, 2011
Peer gateway will forward frames with the peer switch's MAC address. It works for storages, that send packets back to the MAC address, where they received the request from. This is the physical address of the switch, not the FHRP address where they're supposed to send it. The router in step 1 and 2 will exactly do that, because it learned the destination from both switches and will use the configured IP address and pyhsical MAC as the next hop. So why does peer gateway does not kick in here and Nexus B forwards the packet out?

I know, this still does not solve problems with dynamic routing protocols, as the neighborship must be forwarded across the peer link. That's a different feature I guess.
Pete Welcher
Response to Robert
written by Pete Welcher, November 15, 2011

I like the testing idea.The first part of your test, as you note, went as expected. The second part was surprising. I'm not sure why it turned out that way -- I have no quick answer to that, and would have to work with you to use debug and so on to figure out what's going on (check forwarding tables?). Did you leave your ping going for 30-45 seconds in case of some delayed convergence / switchover event? Even if it did start working, I'd be scratching my head over why. What's coming to mind is that the problem might be the return path somehow.
Pete Welcher
Response to Volker
written by Pete Welcher, November 15, 2011
Very good comment/question. I had exactly the same question when I started this blog.

The best answer I could find was something about multicast routing getting screwed up if you do this. I would roll that up into "peer gateway might seem to work but since it isn't supported by Cisco, you might get surprises". I couldn't find a lucid explanation of why it might not otherwise solve the VPC routing issue, after spending a few minutes doing various Google searches. I'd love to have one.

I worry whether there's a performance issue with peer-gateway or something like that as well. Since I get the strong feeling that the Cisco folks who designed this don't like traffic on the peer link, does that cause a technical bias on their part in not viewing "peer gateway" as a solution to the L3 issue? Possibly, I suppose.

Anyway, if anyone reading this has a better answer, I'd love to hear it!

Hmm, I just found a partial answer, see http://bradhedlund.com/2010/12...es-and-no/. The peer gateway may or may not solve the problem with peering to the Nexii, but it doesn't solve it in some external situations.
0
Response to Pete
written by Robert, November 15, 2011
I wasn't able to figure it out yet.

When the A's SVI is up, A knows the MAC of the B’s SVI (first entry), note the static keyword:

tim-n5k-1# sh mac address-table vlan 211
* 211 547f.ee0a.bf3c static 0 F F Po1
* 211 547f.ee0a.c081 static 0 F F Router
* 211 c471.fe8b.cbff dynamic 120 F F Po3

When I shut the A's SVI, A sees only the MAC of the C’s SVI learned when a frame with icmp echo came from C.

tim-n5k-1# sh mac address-table vlan 211
* 211 c471.fe8b.cbff dynamic 60 F F Po3

This shouldn't really make any difference, if A doesn't know the MAC of the B’s SVI it should flood it on all trunks which allow vlan 211, in this case (when the SVI on A is shut) there is no response from B (as expected, vPC's loop prevention), A never learns B's MAC address.

When the A's SVI is up, B is getting an icmp echo and is repsonding to it:

tim-n5k-2#
2011 Nov 16 12:29:05.345962 netstack: [4650] (default) Rcvd packet on Vlan211 (mbuf_prty 5): s=1.1.1.1, d=1.1.1.22, proto=1 (icmp), ICMP_ECHO, tos/dscp=0x0/0x0, ip_len=100, id=02ee, ttl=254

2011 Nov 16 12:29:05.346223 netstack: [4650] (default) Send packet (mbuf_prty 5): s=1.1.1.22, d=1.1.1.1, proto=1 (icmp), ICMP_ECHOREPLY, tos/dscp=0x0/0x0, ip_len=100, id=02ee, ttl=255

If you want to have a look at this, I can give you a console access to all three devices.
Pete Welcher
Robert's followup
written by Pete Welcher, November 15, 2011

Hmm, why would A know B's MAC as static? Is "vpc peer-gateway" configured? If it is, A is responding to the MAC that really belongs to B.

When you remove the SVI on A, traffic goes at L2 to A and across the peer-link. If B responds, maybe the traffic gets switched across the peer link due to that MAC table entry, but cannot thereafter be forwarded out A's member link to C. This behavior probably depends on the MAC aging and ARP timers, since B would have to already know C's MAC and have it in the MAC table. I'd want to test this conjecture.

If peer-gateway is not configured, then I'd expect the same behavior even when A's SVI is up -- ping failure whether A's SVI is up or not. The ping probably gets to B: goes to A and across the peer link. It seems like the problem must be B's reply not going across the peer link and out a member link.

It's pretty late (interesting but long day teaching Nexus class) so I may be missing something here ...
0
Response to Pete
written by Robert, November 15, 2011
vpc peeer-gateway is not configured, vpc config from A below:

version 5.0(3)N1(1c)
feature vpc

vpc domain 10
role priority 10
peer-keepalive destination 192.168.200.21 source 192.168.200.22

interface port-channel1
vpc peer-link

interface port-channel3
vpc 3

When the A's SVI is up and ping from C to B works, B is receiving it via a peer link from A, but sends the reply on the vpc member port to C. Unfortunately SPAN doesn't work on ports belonging to a port-channel, so I can't really wireshark it.

Also, I have created a few other vlans on A and B and every time they learn each other SVI's MAC addresses as static, without any prior traffic. Both boxes are 5548UP running the latest code. Time to read the release notes? smilies/smiley.gif
0
Follow-up to my previous post
written by Robert, November 15, 2011
From the latest release notes:


Layer 3 Limitations
Asymmetric Configuration
In a vPC topology, two Cisco Nexus 5000 switches configured as vPC peer switches need to be configured symmetrically for Layer 3 configurations such as SVIs, Peer Gateway, routing protocol and policies, and RACLs.

Note: vPC consistency check does not include Layer 3 parameters.
Pete Welcher
Reply to 2 commentss by Robert
written by Pete Welcher, November 16, 2011

Thanks,Robert!

I'm not sure why the static entries, I haven't seen them documented. VPC peer-gateway was my first guess, but apparently that's not it.

Re your 2nd comment, Cisco says "don't do this, but not why". That's partly helpful of them.
0
Docs for the Nexus 5500
written by Volker, February 13, 2012
I found some more information for the Nexus 5500 L3 modul.
http://www.cisco.com/en/US/doc...l#wp999181

Seems to be different between the Nexus 7k and 5k5. Here they mention, that control packets are forwarded to the other peer switch. OSPF or BGP are nowhere mentioned in the document, but the TTL of 1 indicates an IGP. Though this is not optimized, it is supported. Data traffic forwarded to the next hop via wrong member link should be mitigated by the peer-gateway command.
0
Response to Volker
written by Peter Welcher, February 14, 2012
That Cisco link says you can use peer-gateway to solve the L3 problem. This keeps coming up (most recently something a Cisco SE told a customer).

I researched it again a couple of weeks ago and found the statement that using peer-gateway for routing peering was "unsupported", which flat out contradicts your fairly recent link. So I don't know quite what the best approach is. Here are a couple of URL's supporting the unsupported theory:

http://www.gossamer-threads.co...nsp/147293

http://www.cisco.com/en/US/doc...mkr1013509

I suppose we could look at dates of documents etc. since things might have changed.

What worries me here is that the issue might be subtle, i.e. performance problems for peer-gateway (invisible until traffic gets large), or the stated multicast problem.

To add to the mix, the "Layer 3 Backup Routing VLAN" feature seems somehow related, but the explanations I'm finding (of why the commands are there or what the feature is intended to do) make no sense whatsoever to me. See for instance http://www.cisco.com/en/US/doc...l#wp374398. (How can a switch be a gateway for its peer if you exclude the VLAN between them from the peer-link? Does this mean you carry it instead on a non-vPC peer-link? If so, why don't they say so?
0
peer-gateway
written by Volker, February 14, 2012
The "peer-gateway exclude" seems to be tailored for transfer VLAN on the routing link between Nexus peer switches. Brad Hedlund descripts that in figure #4 in his post (link above in your post). The description of peer-gateway is to L3 forward packets destined to the peers MAC address - exactly what a connected OSPF router would do. All they'd have to do is to have an exception to the drop rule for IGP protocoll traffic, don't they?

I also think, there is an increasing gap between supported and working in DC designs in generall.
0
Reply to Volker
written by Peter Welcher, February 14, 2012
You know, that link from Brad looks to me like what he's saying is just plain wrong. The issue as I understand it is that the routing forms an adjacency with both peers on what the downstream thinks is a L2 port-channel. It may only see a peer's MAC on one of the 2 port-channel links, but enters it as an adjacency on the port-channel interface, in the adjacency table. Well, when a packet comes along, L3 forwarding tries to send out the vPC port-channel to the selected routing next hop, but then port-channel hashing sends 1/2 the traffic up the "wrong" link, and then L2 forwarding tries to deliver to the routing peer "real" MAC across the peer-link.

I agree exclude applies to removing VLANs from the peer-link. What escapes me is why you'd want to do that, since how can vPC function properly in that case? And the trunk allow command gives you a perfectly good way to remove non-member-link VLANs from the peer-link. So something's missing in the explanation of the exclude command (or obvious but I'm not seeing it).

I agree that the drop rule seems to be the problem, and at least at first glance, just allow traffic on the peer-link to be forwarded at L3. I.e. set the "dirty / no-forward" bit or whatever, but then ignore it when a L3 lookup is being made. Maybe there are corner cases where that creates looping. I find myself wondering a bit uncharitably if the programmers thought vPC all the way out beforehand, or perhaps charged into it thinking all datacenters were L2 only, and then were surprised by the customer problems. I imagine the peer-gateway command got added that way, in response to problems with EMC and NetApp gear that somewhat violates standard behavior.
0
After-thoughts to above.
written by Peter Welcher, February 14, 2012
Upon more consideration, maybe Brad's not wrong. What he's saying fits with my explanation. And when you're at L2, the whole adjacency thing gets a little ... complex. The crux either way is some traffic to routing peer going across the peer link and being verboten.

The charitable interpretation for why the drop rule might be that either (a) there are complications, or (b) it might have require messier EPLD code to handle. And avoiding complexity is sometimes a good thing (complexity leads to more bugs).
0
vpc peer-gateway exclude revisited
written by Peter Welcher, February 15, 2012
In the light of morning and fresh coffee, the documentation about "vpc peer-gateway exclude VLAN" are making a bit more sense. It seems to be simply which VLANs do not do vpc peer-gateway, per "The peer-gateway functionality is not enabled for those VLANs specified in the exclude VLAN list".

The part above that about "Layer 3 backup routing VLAN" is what still doesn't make sense. I'm guessing that this is a VLAN running L3 support for the problematic EMC or NetApp devices. But the bullet points I'm seeing in the N7K guide sound like you must put that VLAN in the exclude VLAN command, which sounds backwards / wrong. The other option that comes to mind is that it is talking about a routing VLAN on a non-vpc-peer-link link, and that makes more sense ... Referring to other Cisco docs than the N7K 5.1 Release Notes, that appears to be the intended meaning of the term. I'm guessing the tech writer or someone muddled together two concepts:

(1) You can use the "vpc peer-gateway exclude VLAN" command to not do peer-gateway for certain VLANs on the peer-link.

(2) When you have a L3 VLAN to carrying routing on a non-peer-link trunk, you want to exclude that VLAN from being shut down on the vPC secondary in case of a peer-link dual-active situation. That's what the "vpc exclude interface-vlan" command is for.
0
more vpc peer-gateway -- yay!
written by Tom K, March 24, 2012
@ Peter

"Layer 3 backup routing VLAN" to me simply says it's a backup route in case all your uplinks fail. So if 7K-A has 2 ECMP uplinks, both of which fail, it needs a backup route through 7K-B to still be able to handle northbound traffic it receives from all its downlinks (orphaned ports or vpc-member ports).

So, in essence, peer-gateway WITHOUT exclude will handle frames locally that were destined to peers MAC. This should take care of data plane functions, but doesn't really address control plane functions, which is why Cisco says "good luck, we don't support it"... ie, ospf peering may work and it may not (think ospf unicast packets destined to the IP of 7K-B arriving on 7K-A). Peer-gateway WITH exclude will say for vlan 10 (for example), do NOT handle frames destined to peers MAC, and instead send it on over. This way the TTL of the packet is not decrimented when sent to the vpc peer, and ultimately will get routed fine. However, this is meant for ospf/IGP peering ONLY between the 7Ks, not to an external router.


I do agree with you about there may be hidden hw limitations, since its required for F1 linecards, and optional for M1.
Pete Welcher
response to Tom K
written by Pete Welcher, March 26, 2012
Ok, thanks for making sense of it. "Backup routing" wasn't parsing for me. I always design a separate routed pt-pt VLAN or link for backup routing, on a link other than the vPC peer-link. So as to preserve routing via the peer if there's no other way to the core.

Good point re control plane functions. If A and B are peers, and C sends to phy-mac-A but up the link to B, does B process it? (One hopes not.) Does B forward it selectively at L2 to A? (One doubts it, that code could get rather complex.) So you could definitely get odd routing protocol behavior, perhaps where multicast-based updates work but unicast routing fallbacks don't.

Peer-gateway with exclude just seems to me to not work. The whole point is peer-gateway avoids sending traffic to mac-A across the peer link from B. If you do that for some routed VLANs, they break.

Ok, if you do exclude a backup pt-pt VLAN across only the peer link, then I could see that allowing for routing peering on the peer-link. I just think it safer to do it elsewhere, even at the not cheap cost of burning another 10 Gbps port. It's perhaps a pay now or it bites you later situation? Let's chalk this up as somewhat unclear documentation (aka "a picture is worth quite a few words") and puzzle solved. Thanks!
0
...
written by Patrick G, April 08, 2013
Hi,

I have one question regarding HSRP and vPC.

Can you have a non-Nexus Switch attached via vPC to two 7Ks and have HSRP running on all three devices.
All routing is configured statically.

Im still not sure what the N7K receiving multicast packets (especially HSRP packets) on vPC Member Ports is doing with it.
0
Reply to Patrick
written by Peter Welcher, April 16, 2013

I'm not sure why you'd want a 3rd switch running HSRP. It seems like it should work as an ultimate fallback (I'd make it the lowest HSRP priority). Test it? Static routes to HSRP peers don't thrill me. Static routes period. I'm a fan of dynamic routing, even to / through firewalls (with controls). The point being that dynamic routes assure liveness and outage detection, something static routes don't do. Getting static routes to work properly in a dual site DR or business partner situation gets messy, fast. Dynamic routing done well can require less maintenance.

I would hope that N7K receiving HSRP multicast on member ports is fine, whether received that way or on the peer link. For IP multicast of content, DR selection and Forwarder selection should suffice -- there's no clear reason IPmc should need to cross the peer link other than some combination of member ports being down.
0
VRF routing
written by StevenJ, May 06, 2013
I have the need to route between vrf's on a pair of N5k's running vPC peerlink. How can I achieve this? Can I connect a cable between each nexus and configure layer 3 ports and run OSPF between them?
0
Reply to StevenJ
written by Peter Welcher, May 07, 2013

Yes. Run a cable, make it a routed link. Or route using a VLAN that is only on the link, a "point-to-point" VLAN for routing. By doing the latter, you can also trunk non-vPC VLANs on the link as well. That way hosts in the non-vPC VLANs are not subject to the orphan ports rules.

Write comment

busy