CNC Logo

Lessons Learned: Route Filtering

A Real World Story

Peter J. Welcher


Introduction

My regular column for CiscoWorld magazine was a bit short this month, so I've written up a real world story, sanitized of names and identifying information. It is presented in the form of a puzzle, where you're challenged to think about what might be the problem.

I hope you find it interesting. If you do (or do not), please let me know. I've got a whole collection of little stories that I think might be of interest, stories I sometimes tell in the appropriate class...

The Story

The consulting customer network of Company X consisted of a large headquarters (HQ) in the Reston Virginia area. The campus core was Bay Centillion switches, and the ATM LANE core extended (bridged) Reston to a building in downtown DC. Although that only tangentially affects the story. The campus LAN was large, with 2000 users (more or less), so the company Class B address was subnetted as a /20 (say 172.16.32.0 /20 for purposes of this story).

The Company X campus site had a 7500 router connecting to the WAN (ignoring a few others). The 7500 connected to business partners, foreign resellers. These foreign resellers were legally distinct companies within various countries. There were firewalls, but it wasn't clear who was being protected from whom by them -- another side issue. The resellers needed to enter sales information and retrieve billing data to service accounts in their country. The information was on database servers located across the HQ campus LAN, which was supposed to be kept very secure (another interesting side issue).

One not so fine day, Company X lost a LANE blade in an ATM switch, disrupting HQ campus connectivity. Staff ran around and eventually got the problem fixed. Emphasize, eventually (it took a while).

At about this time, people at HQ started reporting connectivity problems, whereby some desktops could not connect to the database servers. Since this might be somehow related to the ATM LANE outage, staffers focussed on that at first.

Question: Was this a reasonable assumption? What else could it have been?

Since PC's were involved, there was some rebooting of PC's. It was soon noticed that the problem was intermittent and flaky. A PC might be connected just fine, but when rebooted would be unable to connect to the database server(s). Meanwhile, the neighboring PC which was not connecting might start connecting. Scope of the outage, based on limited sample, seemed to be most of or all of the HQ site.

Question: What should be checked next? Besides PC address, subnet mask, and default gateway, which hadn't changed?

Upon checking the ARP tables in the PC's, the server's IP address was noted next to a MAC address of 00:00:0c:something.

Question: What does this tell us?

That's the old Cisco MAC vendor code, present in 7000 series routers. The WAN Router was a 7500.

Question: So why would the PC think the IP address of the database server should be paired up with the MAC address of the router? What is the name of the protocol involved? Does this tie in with the intermittent nature of the symptoms reported?

So it looked like the router was responding to the PC ARP request for the MAC address of the server. If the server and the router were both replying, that would explain the puzzling intermittent nature of the symptoms. It would just depend on which answered first, whether the PC got connected or not. So, apparently, the router thought it should be sending proxy ARP responses all of a sudden.

Question: What are the pre-requisites before a router will respond to an ARP query, performing proxy ARP?

When we looked at the routing table for the WAN Router, we saw 172.16.32.0 255.255.240.0 as a connected route out the LAN interface. This /20 is as expected. The minor surprise was that we also saw the prefix 172.16.32.0 255.255.255.0, a /24. The route was learned by the official dynamic routing protocol (EIGRP or OSPF).

Question: How do you figure out from the routing table which router the WAN Router learned the route from?

The next hop in the routing table showed that the route had been advertised by one of the foreign resellers. During the LANE outage, they concluded that the servers were unreachable due to a routing problem. (Perhaps failing to note 172.16.32.0 /20 in their routing table.) So  they added a static route to 172.16.32.0 /24, which wasn't too terrible a reaction. But they also redistributed it (possibly due to an existing redistribute static or redistribute connected in their router config), and that created the surprise.

Lessons Learned

The small scale lesson was, router proxy ARP depends on having a matching prefix in the routing table. The router does not proxy ARP reply to ARP requests received on an interface for addresses matching the configured subnet or secondaries on an interface. However, this can be over-ridden by a more specific route. If you examine the LAM technique (Local Area Mobility, see also http://www.netcraftsmen.net/welcher/papers/mobileip.html ), you'll realize that it uses a /32 host route and proxy ARP to provide the host computer mobility onto another LAN segment.

One might also suggest that dynamic routing is gentle persuasion, and static routing is the use of force. When persuasion fails, don't resort to force! Instead, figure out the real problem! There are places for static routes, but generally folks are far too eager to shove in static routes to solve problems (creating other problems). Static routes should only be used as part of a carefully considered design. Static routes are like graffiti, they kind of clutter up your router config, are annoying when troubleshooting, and also can be hard to get rid of. (Nobody can take them out, because they were put in to fix something, and if you take them out you might break whatever it is all over again.) Once you've got some static routes, they start accumulating faster and faster, too.

The higher level observation is that routing protocols trust whatever peer routers send. Whereas a human would (I hope) apply common sense: "what on earth is that route doing being advertised from over there, that's not the way to that subnet!"

We can help prevent nasty surprises by putting in some common-sense inbound route filters. Outside organizations shouldn't be advertising our subnets or other companies' subnets to us, just subnets of their major networks. Europe should not be telling North America about North American subnets (at least, not under most of the usual network topologies). And so on.

In this case, the foreign reseller should not have been telling the WAN Router about Campus HQ subnets. There should have been inbound route filtering, to make sure that each foreign reseller only advertised their block of subnets to the core.

Further questions for the reader: what two things do you think outbound route filters might be good for? Could we make a mistake and then pass it on to our peers? What can happen if we advertise someone else's subnets back to them, at a time when they are experiencing internal loss of connectivity?

There also really should be an outbound route filter (distribute list), so that only the key server subnet(s) are advertised to the remote sites. That way, they can't use the WAN Router to route traffic to each other (unless such transit traffic is desired, of course). Furthermore, there's a security liability issue lurking here: if a hacker hacks into one foreign subsidiary, then uses the WAN Router to get to another, well, that sounds like a lawsuit waiting to happen to me.

In this design, I'd prefer to see separate servers for different legal entities. I'd also prefer to see the databases accessed by the foreign resellers isolated on a separate subnet and physical segment, and near the WAN Router. I don't like seeing untrusted traffic transiting a supposedly secure campus LAN to get to servers. The good thing about having a separate subnet for the foreign database servers is that now you know exactly what one subnet to advertise to the outside.

Question: Why should the WAN Router not advertise 0.0.0.0 /0 to the foreign resellers?

(Lesson learned from another story, which might make it into print sometime. Thanks to Jon Kadis of Mentor Tech. for the insight.)

Conclusion

Ironically enough, I'd been at this site about a month earlier, had written up a design review, and among the recommendations was inbound and outbound route filtering. I believed in route filtering before this, and I now believe in route filtering even more strongly.

Thanks to the person who told me the unsanitized version of this tale (with full gory details). (You know who you are.) Any changes are mine, either due to faulty memory, invention over the course of retelling the story, or pure exaggeration!


Dr. Peter J. Welcher (CCIE #1773, CCSI #94014) is a Senior Consultant with Chesapeake NetCraftsmen. NetCraftsmen is a high-end consulting firm and Cisco Premier Partner dedicated to quality consulting and knowledge transfer. NetCraftsmen has nine CCIE's, with expertise including large network high-availability routing/switching and design, VoIP, QoS, MPLS, network management, security, IP multicast, and other areas. See http://www.netcraftsmen.net for more information about NetCraftsmen. Pete's links start at http://www.netcraftsmen.net/welcher . New articles will be posted under the Articles link. Questions, suggestions for articles, etc. can be sent to pjw@netcraftsmen.net . 



6/13/2001
Copyright (C)  2001,  Peter J. Welcher