| Troubleshooting Poor Performance, and Dsniff Woes |
| Sunday, 10 August 2003 21:00 |
|
Introduction
This month's article is about security in a different form. Has it occurred to you that poor network performance might be due to a security problem? The article also contains a tale of troubleshooting, where I invite you to put on your thinking cap and follow along as we work through how an actual real-world troubleshooting incident evolved. The article finishes with some thoughts about managing performance in networks, and why doing so may be becoming more critical. Network Health ChecksChesapeake Netcraftsmen does some business involving "network health checks". This can entail checking out someone's network design for conformity with industry best practices, security, network management practices. It can also involve looking for what is the cause of "user slowness". That is, not sluggish or dense users, but users who complain that the network is slow. We've had discussions about procedurizing this. Enough of them seem to be different, one of a kind, events that there's some challenge to coming up with a fixed procedure for dealing with them. Furthermore, experience also plays a role. If one doesn't use one's wits, one can mindlessly collect data that isn't really helpful, and miss other things that matter greatly. I can't claim perfection in any of these regards, it's like sports, some days you play a better game than others. I'm going to stick my neck out a bit now, shoot my mouth (keyboard?) off a little bit, but hopefully not offend anyone. In looking at various organizations' networks, they seem to fall into these four approximate categories over and over again. These to some extent correspond with industry, staff size, budget, and so on. They also form part of a network evolution, where most networks move up the scale and get better as time goes on. We hope and trust our consulting plays a role in speeding up that process. One thought here is that each Stage has its own set of issues and needs. It also differs in terms of readiness or need to deal with some aspects of networking or network management. You've got to fix the big problems before you move on to the smaller ones! I provide these stages so you can ponder, "where does my network fit in". You might also think about: "how ready am I to deal with performance problems in my network". If you have constant outages, performance problems are the least of your worries. Similarly, if you have bottlenecks or stability problems in your network design, you will have wide-spread and frequent performance problems. So far, I only see firms in Stage 4 as really trying to manage performance. Everybody else responds to performance problems. Stage 1Stage 1 is the beginner network. It often happens as a company grows. The "server guy(s)" buy a few routers and switches, and things snowball. Sometimes this sort of network makes it to the 1000 user stage without a separate network staffer. Characteristics: a lot of ad hoc, unreliable cabling and gear. The network happened, and was not particularly designed or planned. There are daily outages where a hero runs around and touches things and fixes them. (I used to do this!) Network event monitoring tools: the phone, perhaps What's Up Gold or something similar, used maybe every quarter or when some need comes up.In older firms, this sort of network usually occurs in conjunction with tight financial controls, and in such cases, there's usually still some Token Ring or FDDI, Bay switches (remember Bay?), etc. with no firm plans to get rid of them. I'm not implying any criticism, this is just trying to describe the way things are. The problem here is usually, where to start in improving things. The staff is usually overwhelmed, since they're so busy fighting fires they don't have time to improve things. The old saying about fighting alligators versus draining the swamp comes to mind. The biggest problems in this network are usually physical layer and design. Stage 2The network still kind of just happened, but the cabling is often not as big of a problem. The network may have spanning tree instability, user port speed/duplex/autonegotiation mismatches, or other such problems. Generally there are still Single Points of Failure in networks at this stage. Mixed media types are less common but may still take place. Routing may not fail over dynamically but instead may require manual intervention (this is becoming rarer, as it should). Outages are still a bit frequent. Network management still consists of What's Up Gold or such, but at least somebody now uses it more regularly. Physical layer may or may not be part of the problem. Inexpensive and unmanageable equipment probably is part of the problem. Design review may also be part of the solution. This kind of network has fewer alligators, but still has enough to make life challenging.Stage 3This is possibly a larger organization. There' s often Novell or SNA traffic in this network. There may be Token Ring, with some legacy SNA servers on it. The network design may be ad hoc or it may be more carefully done. The staff has HP OpenView (HPOV) Network Node Manager (NNM), and somebody is often found with the map showing on their computer screen. The staff is reactively managing, responding to crises but not very pro-active. There is little attention paid to (a) HPOV events, (b) CiscoWorks syslog reporting, or (c) CiscoWorks Device Fault Manager (DFM). The network management tools get used, but have not been tuned to eliminate false alarms etc. Issues here might include: still some physical layer or design woes, pace of new technology, capacity planning, choice/use/procedures concerning network management tools. The other problem that can occur is "too much redundancy", i.e. some component fails, but nobody notices, because the redundant components take over. This sometimes comes to light months later when the redundant device also fails. The no outage part of this is good, that's why you built the redundant network. The problem with not noticing comes down to more and different operational and management procedures, you need to notice the smaller things now. That's partly a pickier attitude, partly using the tools mentioned above more. To summarize, the alligators in Stage 3 are fewer, but they're big, nasty, and sometimes a bit hard to outwit.Stage 4Resembles Stage 3, but is making more use of the network management tools. This organization usually has Concord for performance reporting and/or capacity planning. But there's usually no 1/3-time person dedicated to the care and feeding of Concord (or whatever tool is used). It tends to get spiffed up perhaps every 3-6 months, for a spot check of things. Alligators aren't much of a problem at Stage 4, the occasional alligator outbreak is easily dealt with. But occasionally the swamp grows back a little faster than you realize.So What?There's no criticism in any of the above. We all do what we can with the people, time, and budget we have. I've run into several organizations in the Stage 3 to 4 range, and there the problem is staffing level: no time to frequently view and respond to, let alone tweak network management and reporting tools. And let's face it, if folks have plenty of time to do perfect net management reports, documentation, etc., you might be overstaffed. There's a fine line here, and lately it seems folks are definitely understaffed.What Kinds of Network Problems?Let's shift gears now, and consider what sorts of problems a network can have. There are three basic types of outages or problems:
Tools used for this: Lotus Notes, Excel spreadsheets. Trouble ticket systems may do this -- but I have yet to run into the same trouble-ticketing system twice (other than perhaps Remedy). And nobody ever seems to know how to get such reports out of them. Logs do help, they give you a rational basis for giving your WAN provider a hard time. Or your hardware vendor. Outages are actually easy. Something isn't working, and you can easily tell when you've fixed the problem. They're binary: up/down sorts of things. Flakiness is harder. You have to catch it in the act. This is where network management tools like HPOV can help, because you can see the link or router turning red occasionally. (You need to realize there may be other times it went out but came back before HPOV noticed). Once you're aware that something is flapping up/down, you can fix flakiness: swap out hardware, call the WAN provider, etc. That leaves poor user performance. That's the hard one to diagnose. The problem can also be intermittent, which can make it worse. It's end-to-end, so you need to consider the user PC, the network, the server(s) the PC is interacting with, and perhaps even the protocol or application in use. This is also why I'm a big fan of structured tested cabling. If you comply with standards, and if you had 10-20% or more of the cabling tested in your presence, then you know whether you can trust the cabling job. The point is to use the 10-20% sampling of cabling as an indicator of the quality of the remainder of the cabling. Don't forget patch panels and patch cables. See also http://www.bicsi.org/. It costs some real money to get to trusted cabling. The benefit is that from then one, the cabling is not a problem, or at least rarely is. This actually has a real pair of benefits: staff spends less time running around and testing cabling in response to trouble calls. The money spent on testing cabling comes back quickly in terms of staff productivity and less stress. The other benefit is MTTR (Mean Time to Repair): things get fixed faster, if you don't have to run around testing the cabling. I see similar benefits to designed networks with redundancy. Keywords here: design, and redundancy. The proper design and configuration can make your network a lot easier to manage. (See some of my other articles.) Proper design means you don't have outages or that troubleshooting is simpler (e.g. knowing where your root switch is, and keeping VLAN's very small). Some simple configuration examples: number the Frame Relay subinterface with the DLCI number. Used named ACL's when defining ACL's for each site in a crypto map. Use remark statements in your ACL's. The redundancy buys time: outages are no longer crises, but can be managed. This means you don't get seduced into shotgunning, where you wildly try things to get the network back up ASAP. Instead, you can take a little more relaxed approach, and take time to really understand the problem. Causes for Poor PerformanceLet's consider just some of the many possible causes of poor performance.Physical layer (OSI layer 1): cabling, cables that got pulled, cables bent severely, bad crimp jobs, cheap patch panel or patch cable, corrosion or oxidation in cheap components, water damage, cabling near electrical noise, fiber patch cables that don't match cable run type (MMF versus SMF). Hard coded speed/duplex at one end but not the other. Corrupted driver software, improper NIC card settings. Data layer (OSI layer 2): large VLAN, spanning tree instability, (rarely) mis-matched keepalive or other timers. Or bit errors on some link (bad CRC's). Contention with other traffic on trunks. Capacity bottlenecks. Network layer (OSI layer 3): duplicate IP address, excess broadcast or multicast traffic, strange routing. The latter can happen in a particularly nasty form if you get inconsistent with secondary addresses on the LAN port on a pair of routers. Traffic on the LAN between local subnets may end up getting routed across the WAN, which doesn't do much for performance locally, and can also clobber other legitimate WAN traffic. Other OSI layers: usually application or protocol problem at this level. Consider whether the application does "ping-poing": request something, wait for reply, request next chunk, etc. Despite running over TCP, Microsoft Server Message Block (SMB) does this, because it requests 32 KB file chunks, and waits before getting the next. This is very far from streaming performance. Add in WAN or VPN delays, and the performance may easily become intolerable. General response: use Citrix, and get the Microsoft database traffic localized to the data center, so that latency isn't a performance-killer. This can also happen with large relational databases (PeopleSoft, SAP/R3, etc.). The cause may be DBA's or other programmers who pull a major chunk of the database (very large file, whatever) back to the user PC to do the join or whatever logic locally. All DBA's and application developers should have to test their code through an old pair of Cisco 2500's with serial back-back cable running at 128 Kbps. Such coding practices then become painfully clear! A word about Ethernet auto-negotiation. You cannot just set one end and "let the other one figure it out". If one end only is auto-negotiating, it will always give up and end up at 100 Mbps half duplex. Things have improved enough now you probably ought to have auto-negotiation set on all PC's and switches. Whatever you do, have a site standard. There's nothing worse than not being sure what the user PC is set to, since you then have to run and check it: more delay and time spent on something rather trivial. This is a matter where staff has to pick the standard and then be 100% consistent. (I'm ranting about this a little, since it seems to be so often misunderstood.) Troubleshooting Poor PerformanceThe first thing that is really critical is scoping. Is everyone having the problem? Or is it more localized? Which sites, floors, or VLAN's have the problem?This is where helpdesk staff can be very helpful. Even if the helpdesk staff have a limited skill set and role in your organization, if you can work with them to get them to automatically think about scoping, you can then delegate a very distracting chore and make your troubleshooting better. The reason scoping is so critical is that otherwise, you don't have any particular guide to where to look. Collecting and examining broad performance data is time consuming. With some automation, we can check out a smaller site or network (20-100 routers) in a few days. That's not exactly what you're looking for if your users are revolting now. Having said that, if their problems are severe, you are generally getting enough calls and enough information between four-letter words that scoping can easily be done. Some real-world examples of scoping and how it helps... Case 1: All users seem to be having performance problems. You then get someone (server staff?) to check the server performance specs, and you go check the load and error rates on the links in the core and to the server farm. It turns out backup is running on the user VLAN and port on the server. Upstream, this is going to a router-on-a-stick to route back to the same switch, to get to the backup server VLAN. All the traffic on this trunk and router is adversely affecting WAN user traffic. The fix: use the correct address on the backup server, to keep the backup traffic "out of band". (And yes, we're upgrading to Catalyst 6500's which will be doing Layer 3 switching in the core switches.) This struck me as a "could happen anywhere" story. Do you know where your backup traffic is really going? Will you notice when it goes by some other Layer 2 or 3 path? Case 2: Only users on 2 or 3 VLAN's are affected. Then it's worth checking for problems on the uplinks from their switches. If it is just isolated users, why them? Bad cabling, common switch, recent software upgrade? Scoping allows you to focus your troubleshooting efforts. And yes, you know where to use your Sniffer. So scoping helps you focus your problem and performance checking efforts. This is especially critical if you're using CLI and show commands to troubleshoot. We'll come back to network management tools later. They can make a big difference, particularly when it comes to eliminating potential causes quickly and finding the real problem indicators. A Real World StoryLet's now shift to a real world story. Recent case: I recently spent a quality Saturday with some folks with some very good network management and troubleshooting skills. It was day 5 of performance-related outage, and things were getting ugly. Senior management was of course affected. The cabling techs were there and tested some cables: no problems found. Low error/collision counts on switch ports, so probably not duplex or speed mismatches. Traffic (utilization) reporting and server reporting: no obvious problems. No error rate or other issues apparent in nearby switches or routers. DFM was showing some gripes, but no history of spanning tree problems in the affected region. Reasonable broadcast and multicast traffic rates in the affected building. Thinking about finding obscure problems, I was looking through CiscoWorks reports, especially syslog. Perhaps something was rebooting every once in a while, fast enough to not show up in Tivoli Netview. Or perhaps there would be some evidence of port flapping, errors, or spanning tree issues. What I found: some griping about duplicate IP addresses. And the address was the virtual IP for the HSRP router(s) on one of the afflicted VLAN's. Except the griping was only Tuesday PM late and Wednesday morning. That didn't seem to relate to the outage. Looking more closely, the Tuesday griping started about 4:25 PM, and that's about when problems started. I still wasn't too sure this was our problem, but it was better than anything else we had to go on. We looked for the vendor OUI code on the web and couldn't find it: no confirmation that the duplicate was on a PC. Note that PC using the address of the VLAN default gateway could be a real problem. Traffic sent to the PC would normally black hole (be thrown away). The question became: why was this showing up as slowness with occasional Lotus Notes session loss, as opposed to total disconnects? One of the staff on site had a detailed Sniffer capture of all traffic Friday from an affected users. He started looking at it. The capture had used a filter to only capture traffic to/from the one user, to allow a multi-hour sample. Curiously, there were a lot of ARP broadcasts from the offending MAC address. It was ARP announcing itself as the default gateway rather frequently. It was also ARPing more normally, as it would to talk to other hosts on the network. The conjecture became, mal-ware or virus. We chalked up the lack of router gripes about duplicate IP addresses as possibly only occuring on PC bootup, or due to some other form of throttling. (I must admit, I still don't feel I can explain that part of things.) To cut to the chase, when the MAC address in question showed up on Monday morning, scanning switch MAC tables quickly lead to the port and then cubicle in question. Shortly thereafter, network staff and security were there to greet the user. It turned out to be a college intern. I have no more details of the story. The site had recently been burned by SQL Slammer on a laptop. The firewalls blocked it fine, until the laptop was hooked up on the inside. This plus the severity of the outage may have contributed to the willingness to coordinate with security and aggressively target the source of the disruption. I'm going into this not to embarrass anyone, I think that in fact the situation was handled about as well as it could be. The cause and the difficulty troubleshooting this one were (I hope) well beyond the norm. I'm not identifying the people or company to avoid any public exposure or embarassment. As usual, I feel I learned some things from this , and that's why I'm sharing the story. We'll look at lessons learned at the end of this article. Would you have been able to track this down? How would you handle something like this? Would you be willing to coordinate with management and security and go politely find out what was on the user's computer? About DsniffPart of the lesson learned is that one has to know about the various security vulnerabilities that are out there. One example: Cisco has recently posted information about tightening up on Layer 2 vulnerabilities. Worthwhile reading! Take a look at:The tool I didn't really know enough about was DSNIFF, which has been around for a while. For details, see:
The purpose of this: it allows the DSNIFF user to capture all traffic leaving a VLAN. They can then analyze for problems (alleged legitimate use) or, if a hacker, analyze for passwords or other "useful" data. Not good! My personal take is that only a naive person could think there would ever be a legitimate use for this tool. It has a PC acting as a Layer 2 switch and then some (forwarding frames and doing MAC rewrite) for multiple 100 Mbps attached PC's. The PC and/or the switch port to the PC would become a massive bottleneck on all but small lightly-used networks. This would then be visible as a performance problem, as it actually was! Internal hacker use of this tool would be tough to catch. You'd need to know normal performance patterns in your network. The router syslog messages, and possibly User Tracking in CiscoWorks Campus, are the two thoughts I've had on how to spot actual live DSNIFF usage. Performance ManagementThe popular tool of choice these days seems to be SolarWinds Orion. See http://www.solarwinds.net/. The combination of affordable price, reasonably good scaling, and ease-of-use seem to be real winners! Drawback: limited visibility into other MIB variables that I as a (unusually?) demanding user would like. This tool is a quick way to rule out server and utilization and errors as problem causes. It also gives good visibility via the Top-N reports into the hotspots in your network. I've noticed people reaching for the Sniffer at the first sign of trouble. I've sometimes compared this to using a microscope to hunt deer (overwhelming detail, too narrow a focus). In general, I still think scoping and thought are crucial up front. Once you've found the deer, the Sniffer or other protocol analyzer is a great tool, provided you actually need the full details!One network management / design lesson bears repeating however. It that it really a very good idea to have either a Sniffer on a SPAN port of each core switch (per-building), or have Cisco NAM blades in them. This allows you to remotely capture filtered data, once you've scoped a problem down somewhat. With NAM you can then either analyze the data or pull it back to analyze with a local physical Sniffer. This all gives you that crucial data that can sometimes show what's going wrong with protocol, application, security, firewall ruleset, or other problem cause. When you need this, you need it badly! ConclusionLessons learned or repeated:
Something occurred to me during the troubleshooting and health checks I've done recently. I'd love your feedback on the following not-so-wild idea. I have yet to get a chance to try it on a live network. If you have Frame Relay, cable, DSL or IPSec VPN, or cabling or other problems, one of the causes of slowness may lost frames or packets. With Layer 2, you get switch collision, error, or CRC counts. With carriers or ISP's losing frames or packets, as in Frame Relay, DSL, cable, or IPSec VPN, you get nothing, so there's no indicator in the router that a packet or frame went missing. On the other hand, the server that sent the packet will bump up its retransmission counter, assuming TCP is being used. SNMP agents for servers are fairly readily available, and TCP Retransmissions is part of the MIB-II variable set. It seems like it would be very useful to monitor this variable on all servers, and have Top-N reporting and/or graphing. That way, when you see a high retransmission rate on a server, you fire off your Sniffer that is on a SPAN port, spanning the server port in question. (Possibly RSPAN if the server is not on the same 6500 switch.) Now you're getting real, live, scoping data, as the Sniffer Expert mode shows which users are getting retransmissions. Does this address the one real performance troubleshooting issue, lost packets with no (other) counter reporting on the problem? Thanks to Keith B. for pursuing and providing the key piece of Sniffer data that locked into resolution of this problem. And thanks to the other guys involved, for allowing me the chance to work with them on the problem related above. I have assumed you and your employer don't want to be mentioned more conspicuously here.And do please send email comments, suggestions for future articles, etc. to me at the email link below.
8/10/2003 |












