One technical component of what I'm referring to is called Service Assurance Agent (SAA). SAA is a marketing renaming of the Response Time Reporter (RTR) features embedded in the Cisco IOS, emphasizing the many new capabilities that were added in Cisco IOS Releases 12.0(5) T and 12.0(7) T.
SAA allows routers to measure and report network application round trip times. The routers can send SNMP Traps when specified round trip time thresholds are violated. In addition to just conducting measurements, the SAA code can store measurements over a period of time, similar to an RMON history. These measurements can be retrieved via SNMP using the Cisco RTTMon MIB. We'll go into more details of SAA below.
The other new technical component is a new box that hasn't received much press attention, the Management Engine ME1110. This box represents a novel and exciting approach to simple distributed data collection. The ME1110 is a simple network measurement collection engine, managed via CLI and HTTP. It has been described as a "data collection toaster". The ME1110 is based on the Web Cache Engine minimal operating system: think of it as a disk drive with a web server and the ability to load and run data collection applications. The idea is to scatter ME1110's around a network, to collect and store the measurements from the routers performing SAA measurements. Applications get the data back out via XML.
Oh, along the way we'll also talk a little bit about the Cisco Internetwork Performance Monitor and Service Level Manager network management software.
The Cisco SAA approach has one big thing going for it: you already have routers at the remote sites! Thus the basic instrumentation of the network adds little or no cost.
Admittedly, configuring SAA on numerous routers might become somewhat tedious. Manually examing show command output is not a scalable way to manage a network, not the way to go. That's why Cisco has also produced some graphical configuration and reporting tools to work with the SAA code in the routers. They are Internetwork Performance Monitor (IPM) and Service Level Manager (SLM) .
Another piece of the SLA management puzzle is scalable collection of data. Probes can only do a certain amount of polling. Centralized network management stations also. Hence, the ME1110 box.
The SLM application works with one or more ME1110's. SLM (and CiscoWorks 2000 version 3) can configure SAA in multiple routers via the ME1110 (which uses SNMP set to configure RTR or SAA). SLM uses the ME1110's as a distributed data repository for recent SLA compliance data in the network, periodically retrieving data to a central data repository to preserve it for longer periods of time. CiscoWorks 2000 Service Management Solution (SMS), not to be confused with any other Cisco SMS product, is a bundle consisting of the core CW2000 engine, SLM, and an ME1110. Recommended scale is approximately one SLM and ME1110 per 1000 response time measurements. You can pair the SLM server with a couple of ME1110's if more polling performance is needed. One would hope that going forward you could get even better scalability by pairing the SLM server with a larger number of ME1110's.
The following Cisco partners have announced plans to ship products by the end of 2000, using the XML interfaces of CiscoWorks2000 SLM: Agilent, Avesta, Computer Associates, Compuware, Desktalk, Fluke, Fujitsu, HP, Integrated Research, Nextpoint Networks, Paradyne, Precise Software, ProactiveNET, Response Networks, Trendium, Visionael, and Visual Networks. As you can see, this has the attention of some very serious players!
There's another business opportunity here: for the managed service provider, and for the network management systems integrator. Although the knowledge threshold has been lowered, there is still room for considerable value add based on knowing how to deploy effective SLA's with the above tools, or building a business service offering based on some subset of them.
SAA supports a variety of SNMP Traps (thresholds): immediate, N consecutive, X of Y, or average. That is, a rising or falling threshold can immediately trigger a trap. Or, to reduce trap volume at the central trap handler, you can configure SAA to only send a trap after a certain number of consecutive threshold violations, after X out of Y measurements violate the threshold, or if the average of 5 consecutive measurements violates the threshold.
SAA can also report various error counters.
Some hardware stats: the ME1110 is 1 RU in height, provides 10/100 connectivity, with 384 MB SDRAM, 8 MB Flash, and 9 GB Ultra II SCSI disk.
More exciting: the ME1110 can run multiple Java polling applets downloaded via HTTP. The SLM software can download software updates to associated ME's. This leaves plenty of room for 3rd party value-add to add their own Java SNMP or other polling engines running on the ME chassis in addition to the SLM SAA data retrieval applet.
The Cisco literature on the ME1110 points out that an SAA router will hold data for 2 hours if there is a loss of connectivity between it and the collecting ME box. The ME box in turn can operate for up to 3 days without contact with the central SLM software. Thus the distributed ME boxes provide a robust data collection system without the polling limits and scalability issues of "the one huge central polling station". Coming to the ME operating system: syslog collection, to allow scalable syslog handling (CW2000 already has remote collector capability), and CDP (to simplify discovery and management of the ME boxes).
I have not seen any mention of it (so I'm not violating any NDA here), but I'm wondering "what about RADIUS, and IP Call Detail Records"? Can these be far behind? My sources tell me Service Providers have enough fun collection phone billing records, and ISP's collecting data on user accesses. With DSL and cable modems for both data and voice, there is a vast and growing amount of data to collect, and simple robust collection appliances might be very attractive to a Service Provider. (Might beat running distributed UNIX boxes with expensive database software? Or is the volume of transactions likely to require too many ME's to be easy to manage?)
What about collecting data from some of the vast quantity of other MIB variables in Cisco equipment? Presumably some of the third party vendors will do this, in key areas. ISDN dial reporting and security come to mind.
But first we need some terminology. A source is the SAA router which sends the test traffic. A target is the host computer the test traffic is sent to. The target may be a responder, an SAA router configured to respond to SAA measurement traffic. A collection is a combination of one source, one target, and one type of test traffic. Collection is the IPM/SLM term; the Cisco IOS documentation refers to it as an operation (formerly, probe).
To configure a router for basic SAA, it is (obviously?) going to be the source. So you have to specify the target and the type of operation or collection. Since a router may simultaneously be measuring response time for a number of servers and applications, each collection or operation is numbered, so that you have some way to refer to it.
When the target is an SAA router, the packets sent include information from an SAA control protocol. This should be disabled when measuring response time to a real host computer.
The following configuration shows how to set up measurement of the time it takes to connect to TCP port 23 (telnet) on remote host 10.1.1.1. We'll do this without using the SAA control protocol, since the responder is a server, not a router. The default measurement interval is once per minute. We change that to every 120 seconds. The example starts the measurements immediately. Since no lifetime is specified, the default of one hour (3600 seconds) applies. The tag line allows grouping of collections (operations) from the same or different routers. That is, we might give the same tag to all collections monitoring different applications on one server, or we might use the same tag on several routers for one or several applications being monitored to one server or set of servers.
Router(config)# rtr 10To set up but not start a VoIP jitter measurement based on 15 packets sent at 30 second intervals:
Router(config-rtr)# type tcpConn dest-ipaddr 10.1.1.1 dest-port 23 control disable
Router(config-rtr)# frequency 120
Router(config-rtr)# tag TelnetPollServer1
Router(config-rtr)# exit
Router(config)# rtr schedule 10 start now
Router(config)#rtr 200To track response time from www.cisco.com (perhaps as a way of checking remote site Internet connectivity):
Router(config-rtr)#type jitter dest-ip 172.16.1.1 dest-port 20000 num-packets 15 interval 30
Router(config)# rtr 27The timeout is in milliseconds: the get operation is given 10 seconds to time out. If the response time is greater than 2000 milliseconds (2 seconds), we consider this to be a rising threshold crossing (see the Threshold Manager and RMON articles). Default threshold is 5000 msec. The response to crossing the rising threshold is to send an immediate SNMP trap.
Router(config-rtr)# type http operation get url http://www.cisco.com
Router(config-rtr)# timeout 10000
Router(config-rtr)# threshold 2000
Router(config-rtr)# exit
Router(config)# rtr reaction-configuration 27 threshold-type immediate action-type trapOnly
To measure UDP on port 45678 with IP Precedence 5 (loosely simulating VoIP traffic):
Router(config)# rtr 39The ToS field is 160, corresponding to a bit pattern of 10100000. You have to remember that the IP Precedence is the left-most three bits of the ToS byte. You can also measure DiffServ as it comes in. Voice in DiffServ-speak is Expedited Forwarding (EF), corresponding to decimal 46. DiffServ uses the left 6 bits of ToS, so we left shift by two bits. To the non-programmer, this is also known as "multiplying by 4". That means Differentiated Services Code Point (DSCP) value 46 needs to be ToS value 184.
Router(config-rtr)# type udpEcho dest-ipaddr 10.1.1.1 dest-port 45678
Router(config-rtr)# tos 160
To configure a router to be an RTR responder (for example, for the prior collection's udpEcho), configure:
Router(config)# rtr responderFor security, you can use authentication keys.
Show commands:
| show rtr application | shows supported operation types and protocols |
| show rtr authentication | show authentication method and name of key chain |
| show rtr collection-statistics | show statistical errors for all collections or for one specified collection |
| show rtr configuration | show all configuration values including defaults for all collections or for one specified collection |
| show rtr distributions-statistics | show statistical information about response times |
| show rtr history | show the history: data recorded over a period of time |
| show rtr operational-state | shows connection losses, timeouts, and over threshold count, as well as remaining life, whether the collection is active, and completion time, among other statistics |
| show rtr reaction-trigger | shows any reaction triggers you've configured (used for diagnostics) |
| show rtr responder | shows recent SAA control message sources, etc. |
| show rtr totals-statistics | shows total error and completion counts |
Recent history: in 12.1(1) T Cisco delivered local Voice Busyout: track a local interface or group of interface. If down, the voice router presents busyout/seized condition to attached CPE (Customer Premises Equipment) or PBX (phone switch). The point here is, if the WAN connection goes down, you can't place your call on your network, so the PBX has to call via the PSTN (Public Switched Telephone Network).
In 12.1(3) T this has been improved upon to allow Advanced Voice Busyout (AVBO). With AVBO, you can busyout a port or group of ports if RTR/SAA response time is greater than a threshold. Both features apply to CAS trunks only. There is no way to track bandwidth or lack of DSP resources currently. What I've seen in print is that the competition may have ping-based busyout, which would seem to be a rather unreliable round-trip time estimator. This appears to be more sophisticated. Neat stuff!
There are new courses coming on basic CiscoWorks 2000 and the RWAN and LAN bundles. Mentor Technologies may be offering them around February 2001. Check our web page for the latest information and schedules. We currently offer the CVOICE and CIPT voice classes (voice devices in general and Call Manager / IP Telephony, respectively), if you're interested in voice/telephony topics.
Your comments, preferences and ideas and suggestions for topics are always more than welcome! I enjoy hearing from you!
Dr. Peter J. Welcher (CCIE #1773, CCSI #94014) is a Senior Consultant with Chesapeake NetCraftsmen. NetCraftsmen is a high-end consulting firm and Cisco Premier Partner dedicated to quality consulting and knowledge transfer. NetCraftsmen has nine CCIE's, with expertise including large network high-availability routing/switching and design, VoIP, QoS, MPLS, network management, security, IP multicast, and other areas. See http://www.netcraftsmen.net for more information about NetCraftsmen. Pete's links start at http://www.netcraftsmen.net/welcher . New articles will be posted under the Articles link. Questions, suggestions for articles, etc. can be sent to pjw@netcraftsmen.net .