| The Missing Link |
| Thursday, 04 November 2004 21:00 |
Introduction
Hello again! Last month we discussed some background info about Certificate Authorities. I planned to write about lab work with the Cisco IOS Certificate Authority for this month's article. However, I didn't look ahead, and the Feature Navigator now tells me this feature is supported on 2620 XM but not 2620. It's the usual CIsco IOS image size sort of thing, my convenient lab router has enough RAM (64 M) but not enough flash (16 M, needs 32 M). Time to punt (seeing as it is football season as I write this). So this month we'll talk about something else. I'd like to thank Chespeake Netcraftsmen's David Yarashus for this month's topic. He and I have been working in several large enterprise and government organizations' designs using 3 or 4 layers of campus hierarchy. We've had some good discussions about why or why not should distribution layer switches be linked. Since he and I have both recently had fun evenings troubleshooting situations where the lack of a link was part of the problem, it seems like a timely topic to talk about. The topic may start off sounding a bit trivial, but I hope by the time we're done you'll see that it really does matter! And I do hope that along the way I manage to make it interesting as well.The rest of this article discusses the question at hand, and then goes through some Real World situations where we've seen this problem arise. The Basic SettingTo explain what the fuss is about, see the following figure. Recall that L2/L3 Hierarchical Design uses a L3 switches at the distribution layer, which might be for the entire building or for a group of floors, part of the building, etc. This approach has the virtue of modularity, so that you can use a cookie-cutter design. ![]() The big virtue to Hierarchical Design is limiting the size of spanning tree domains, so that L2 issues, broadcast storms, etc. don't affect too much of your network. I and my coworkers see security and some other folks now recommending disabling STP (Spanning Tree Protocol) on user and server ports. If you've ever had to troubleshoot a Spanning Tree Loop, you'll know why we strongly advise that you leave STP running on access ports. Think of it as insurance against having a really bad day (or week!). We've seen 3 or 4 enterprise melt-downs now, they aren't pretty, it takes considerable time to track down the cause, and they can be career-impacting events. We've seen too many shops say "we'll be careful and so we don't have to worry about spanning tree loops". Well, other people may connect or rearrange things. They might not be as careful. And it's too easy to have accidents. Conclusion: don't disable STP on switches! Ok, if you have a couple of hundred users on two to four switches, this all may seem a bit abstract to you. I do recommend using a two layer hierarchy there, with say one or two core L3 switches and a bunch of access switches, sized to fit number of users, budget, and other needs. The key thing to realize in smaller networks is that networks tend to grow, sometimes quickly, and if you start out modular, you won't have to spend time reworking things. We've seen some ad hoc and daisy-chained switch networks that "just kept growing" over time. They may work, but they can also take substantial time to administer. As more and more sites use L3 switches in wiring closets and elsewhere, the traditional link between distribution layer switches seems to be vanishing. We also see sites with L2 closets not using the link, which can be a rather bad idea as well, unless you have amazing levels of internal discipline. The next section explains why. VLAN Black-HolingFor this section, please refer to the following figure.
Suppose the "Missing Link" isn't there. In this story, the access switches are Layer 2 only. Suppose the blue links are say VLAN 10, all in one subnet. Perhaps all the users on the connected access switches are in VLAN 10 as well. The two distribution switches are connected to that subnet. Since they are L3 switches (routers), they advertise the connected subnet to R.O.N. (the Rest of The Network). The problem I'm about to describe can happen even if VLAN 10 only goes to one closet, but some other port somewhere on a distribution switch is in VLAN 10. Suppose a failure occurs, as indicated by the red X. Because another port on switch A is in VLAN 10 and is up, the L3 interface VLAN 10 is also considered to be still up. So switch A continues advertising that subnet to RON. Switch B also advertises that subnet, since it too is connected to VLAN 10. As packets are sent from RON back towards users, the packets may go to A or B. But if they go to A, they have no way within VLAN 10 to reach switch C -- VLAN 10 has become discontiguous (not all in one piece). In that case, A has to drop the packets. This is known as "black-holing". They go into A and never come out. Even after the user MAC address ages out, flooding the packets through the parts of VLAN 10 that A is connected to doesn't reach switch C. Eventually, ARP ages out, and the ARP broadcast from A also fails to reach C. The diagram was updated 1/16/2005 for clarity, since either a double outage is needed to create the problem, or there has to be only one closet switch (plus some other port on switch A) in VLAN10. Even if there is a single outage in the above picture, you have traffic to C going via one of the other closet switches and switch B, which is undesirable. Better to add the dotted link! Usually we design for this setting with the Missing Link a trunk between A and B. We then allow VLAN 10 across the trunk. Consider our failure scenario now. When the packet arrives at A, it is connected to the trunk, even after the outage. Flooding frames in the VLAN still reaches C. After STP settles, packets from the user will probably update the MAC tables in A, B, and C, providing an optimal L2 path from A to C and the end user. Yes, you can plan to absolutely never ever put more than one port on a distribution layer switch into any closet/user VLAN. But will everybody faithfully do that? Will new employees get the word and understand the importance? If somebody fails to follw the plan, you won't notice until there's an outage and you get to figure out why the user lost connectivity, despite the redundant connections. Or maybe there's a department server you hook up to the distribution switch, and put into the user VLAN. In short, planning this way fails to tolerate exceptions. And while we try not to have to create them, some business or networking situations require exceptions, time after time. By the way, this can also happen with L3 switches if some other port on the distribution switch gets put into the same VLAN as the link to the L3 switch. That's why with the newer Cisco code we prefer to make switch-to-switch links routed links, i.e. put the address on the Gig interface rather than on a VLAN interface. Please note that I still do believe in portfast (and specifying non-trunking) for user ports. Portfast short-circuits some of STP startup. Whereas BPDU filtering disables the ability of STP to function, almost the same thing as turning it off. Summarization Misses That Link!
Change the story now. For now, suppose the routing protocol is OSPF. Suppose some or all of the access layer switches are L3 switches, i.e. routers. Suppose the distribution switches are OSPF Area Border Routers (ABR's). This is commonly done to allow the subnets in the building or L3 module to be summarized outside the building. Suppose you're summarizing like that. Now suppose a link failure occurs. You may see traffic following a path like that in the following picture.
What happened is that the packet arrived at switch/router A due to the summary route. Router A learned the specific subnet route from D who learned it from B. And B is connected to the relevant subnet, if switch C is L2-only. So traffic takes the indicated round-about path. This can be particularly disconcerting if one intended all the access switches to be non-transit. If the volume of traffic clobbers switch D's links, this can be disconcerting, especially if D happens to connect to the mainframe (as in one outage we recently saw). Nothing said above is really OSPF-specific, other than the picture. So the above might in fact also happen with EIGRP, assuming you're doing route summarization at the switches A and B (sort of an EIGRP ABR setting). The problem is related to the summarization, not the routing protocol. The reason the story started out with OSPF is that OSPF can take some really weird paths when there's an outage. The reason is OSPF's routing rules:
The way to prevent OSPF weird paths is to link A and B in the picture, preferably with a link on each side of the area border. That way traffic can get from A to B within area 0 and also within area 5. What about with EIGRP? It seems like if you're summarizing, it's a good idea to have a link within the "area 5" part of the above picture, i.e. a link between A and B that does not summarize the building routes. And make any routers in the "area" EIGRP stubs (non-transit routers).
We've seen a lot of sites going to "passive-interface default" for large L3 switches. This keeps from forming a lot of OSPF or EIGRP adjacencies. If you do this, you do have to remember to explicitly "no passive-interface ..." for the link between A and B. This is a nice control for switches with enterprises with most interfaces connected to user or server VLAN's. You un-passive the interfaces that connect to the outside world or the other routers. HSRP and What It MissesNow we get to a situation I ran into recently at 4 AM. The following figure illustrates the problem.
The network in question happened to have an MPLS VPN core running over optical gear, but while they added complexity and other factors to troubleshoot, they're irrelevant to the present story. A chain of events led to no OSPF adjacencies over the link with the red X (and one or two other places). My guess at the time was stuck OSPF state due to changes, probably taking out OPSF authentication. (The order in which you delete the commands may matter in some IOS releases). Router A connected to a GSR and I didn't have permission to bounce the OSPF process on the GSR. The first focus was getting up and running. Router B was happily talking OSPF to the MPLS core network. It had an MPLS VPN configuration error at the time, but that's not really relevant here. For security and simplicity, the Customer Edge routers had static routes and we didn't have access to them at the time because another organization controlled them. But I guessed that the design had the HSRP address as next hop, and later it did turn out the static routes did in fact forward traffic to the HSRP primary, A. Some voice gateways in a separate VLAN had B as HSRP primary, and I was told that they were actually happily forwarding. The problem was that the customer static routes were forwarding to A, as shown by the blue arrows. But A had no routes to anywhere but connected subnets. So it was dropping packets for other destinations. It had no way to pass them to B, because there was no OSPF on the customer side of routers A and B. The solution at the time was to change the HSRP priorities in A and B to make B primary and allow B to preempt A. That swung the traffic from the customer edge routers over to B and got the customer up and running. That's how I determined that the guess about the HSRP address being the static route next hop was correct. The interesting thing to me here is the design point it makes. There are two design methods that come to mind for avoiding this black-holing problem:
The other approach would be to use HSRP tracking. This feature in the past allowed you to configure HSRP to track the state of a link. If the link went down, the HSRP priority would be lowered, allowing the other router to become the primary for HSRP. In this particular incident, the routers considered the A to Core link to be up. In fact, we used it to telnet to A from the NOC (telnet to the core GSR, then to the directly connected interface). However, HSRP can now track state of objects. So loss of routes or connectivity could be used to trigger HSRP failover. In the case above, the customer routers needed to establish a GRE tunnel to run IBGP over (I'm not going to go into why). So loss of a route to the other tunnel endpoint might have been an interesting object to track. Tracking routing state would not have sufficed, routers A thought OSPF was active but saw no neighbors. We don't have space to go into details here. The object tracking feature was added beginning in Cisco IOS 12.2(15) T, with enhancements in 12.3 as well. Things you can track:
There's a new acronym to go with this: FHRP = First Hop Routing Protocol (HSRP, GLBP, VRRP). They can all track objects in the new code. Static and Policy-Based Routing can also do this! SummaryI haven't seen most of this anywhere in print / web form, with the exception of the HSRP design issue noted above. Here are the links concerning FHRP's and tracking objects: Standby track command: http://www.cisco.com/univercd/cc/td/doc/product/software/ios122/122cgcr/fipras_r/1rfip2.htm#wp1021373 FHRP's and SAA: http://www.cisco.com/en/US/products/sw/iosswrel/ps5207/products_feature_guide09186a00801d2d74.html Objects and routing: http://www.cisco.com/en/US/products/sw/iosswrel/ps5207/products_feature_guide09186a00801d1e95.html http://www.cisco.com/en/US/about/ac123/ac114/ac173/Q2-04/department_techtips.html http://www.cisco.com/en/US/about/ac123/ac114/ac173/Q2-04/department_techtips.html This topic definitely feels like an article I'm going to want to write, after some lab time! Your comments, questions, and suggestions for future articles are of course welcome! See below to decipher my email address.
11/4/2004 |

















