Service Assurance Agent (SAA) and the Management Engine

icon Service Assurance Agent (SAA) and the Management Engine 

Introduction

This month's article covers some new technology that lies at the heart of an exciting Cisco network management strategy. What has been announced or is shipping has the potential to become THE accepted way to instrument a network, and may become essential to both Enterprises and Service Providers (IP, ISP, and ASP) in monitoring Service Level Agreements. Cisco has also opened the door for partners to produce their own applications based on this technology, which may lead to a wealth of solutions targetting different market niches. Now that I've gotten your attention ...

One technical component of what I'm referring to is called Service Assurance Agent (SAA). SAA is a marketing renaming of the Response Time Reporter (RTR) features embedded in the Cisco IOS, emphasizing the many new capabilities that were added in Cisco IOS Releases 12.0(5) T and 12.0(7) T.

SAA allows routers to measure and report network application round trip times. The routers can send SNMP Traps when specified round trip time thresholds are violated. In addition to just conducting measurements, the SAA code can store measurements over a period of time, similar to an RMON history. These measurements can be retrieved via SNMP using the Cisco RTTMon MIB. We'll go into more details of SAA below.

The other new technical component is a new box that hasn't received much press attention, the Management Engine ME1110. This box represents a novel and exciting approach to simple distributed data collection. The ME1110 is a simple network measurement collection engine, managed via CLI and HTTP. It has been described as a "data collection toaster". The ME1110 is based on the Web Cache Engine minimal operating system: think of it as a disk drive with a web server and the ability to load and run data collection applications. The idea is to scatter ME1110's  around a network, to collect and store the measurements from the routers performing SAA measurements. Applications get the data back out via XML.

Oh, along the way we'll also talk a little bit about the Cisco Internetwork Performance Monitor and Service Level Manager network management software.

Why This Solution is Compelling

The reason I think the SAA technology is so compelling is that the it beats the alternatives hands down. Alternatives I know of:
  • run processes on key servers, tracking key processes (and make sure the tracking processes stay alive?)
  • run a dedicated (probe) server that tracks key services on other machines (some of the software prices here are amazing; installation and support is also a cost)
  • run many NT or Linux-based probes at remote sites (implementation and support cost?)
  • use AppScout in the NetScout probes/Cisco SwitchProbes you've scattered around your network (oh, you haven't scattered probes around?)
  • Visual Networks Frame Relay or ATM probes (which measure Frame Relay or ATM statistics, not the direct application response time measurements needed for SLA's)
These all suffer from two real and major problems:
  • cost
  • scalable implemenation, administration, and support
If you have to add boxes to remote sites, it gets expensive. The high-end application and service tracking software isn't cheap. Managing remote computer systems at more than a few sites is costly.

The Cisco SAA approach has one big thing going for it: you already have routers at the remote sites! Thus the basic instrumentation of the network adds little or no cost.

Where Do I Get It?

The SAA functionality appears to be included in all Cisco IOS feature sets for releases 12.0 and 12.1, for 2500 model routers and bigger, starting with Release 12.0(6) or later. The functionality is also supported in newer Route Processor components of the Catalyst switches. As far as CPU impact, if you need to conduct a little polling per site, it should be no big deal. If you have larger needs (such as a Service Provider monitoring multi-customer SLA's), then a dedicated router can do the SAA measurements, say for a remote POP, or for SLA customers connected through one or several Provider Edge routers.

Admittedly, configuring SAA on numerous routers might become somewhat tedious. Manually examing show command output is not a scalable way to manage a network, not the way to go. That's why Cisco has also produced some graphical configuration and reporting tools to work with the SAA code in the routers. They are Internetwork Performance Monitor (IPM) and Service Level Manager (SLM) .

Another piece of the SLA management puzzle is scalable collection of data. Probes can only do a certain amount of polling. Centralized network management stations also. Hence, the ME1110 box.

The SLM application works with one or more ME1110's. SLM (and CiscoWorks 2000 version 3) can configure SAA in multiple routers via the ME1110 (which uses SNMP set to configure RTR or SAA). SLM uses the ME1110's as a distributed data repository for recent SLA compliance data in the network, periodically retrieving data to a central data repository to preserve it for longer periods of time. CiscoWorks 2000 Service Management Solution (SMS), not to be confused with any other Cisco SMS product, is a bundle consisting of the core CW2000 engine, SLM, and an ME1110. Recommended scale is approximately one SLM and ME1110 per 1000 response time measurements. You can pair the SLM server with a couple of ME1110's if more polling performance is needed. One would hope that going forward you could get even better scalability by pairing the SLM server with a larger number of ME1110's.

The following Cisco partners have announced plans to ship products by the end of 2000, using the XML interfaces of CiscoWorks2000 SLM: Agilent, Avesta, Computer Associates, Compuware, Desktalk, Fluke, Fujitsu, HP, Integrated Research, Nextpoint Networks, Paradyne, Precise Software, ProactiveNET, Response Networks, Trendium, Visionael, and Visual Networks. As you can see, this has the attention of some very serious players!

There's another business opportunity here: for the managed service provider, and for the network management systems integrator. Although the knowledge threshold has been lowered, there is still room for considerable value add based on knowing how to deploy effective SLA's with the above tools, or building a business service offering based on some subset of them.

More About SAA

SAA measurements can use either real Web, DHCP, DNS or other traffic to real servers, or can use simulated application traffic to routers configured as responders. You can think of the latter as an "application level PING". The simulated application traffic can be TCP or UDP packets with specified sizes and IP Precedence settings, which allows for simulated VoIP traffic. The IOS code also can iterate measurements to calculate variance, useful in measuring jitter for voice traffic. There's also SNA (LU or PU echo), DLSw, PING, and trace. The software is apparently smart enough to decipher trace output when there are multiple equal-cost paths in the network.

SAA supports a variety of SNMP Traps (thresholds): immediate, N consecutive, X of Y, or average. That is, a rising or falling threshold can immediately trigger a trap. Or, to reduce trap volume at the central trap handler, you can configure SAA to only send a trap after a certain number of consecutive threshold violations, after X out of Y measurements violate the threshold, or if the average of 5 consecutive measurements violates the threshold.

SAA can also report various error counters.

  • Errors measured for TCP: total disconnects, total timeouts, total lost connections, total dropped connections, total out of sequence packets, total number of times response contained unexpected data.
  • Errors measured for UDP: packets lost from source to destination, packets lost from destination to source, out of sequence packets, packets lost with unknown direction ("MIA"), late arriving packets, busies (couldn't start because previous measurement not complete),
The SAA history feature allows collection of n data points. The number n is configurable, using the buckets-of-history-kept command. History is not available for jitter and HTTP.

More About the ME1110

The ME1110 is a black box data toaster, so there's not that much to say. The ME1100 may become a product line, with the ME1110 as first offering.

Some hardware stats: the ME1110 is 1 RU in height, provides 10/100 connectivity, with 384 MB SDRAM, 8 MB Flash, and 9 GB Ultra II SCSI disk.

More exciting: the ME1110 can run multiple Java polling applets downloaded via HTTP. The SLM software can download software updates to associated ME's. This leaves plenty of room for 3rd party value-add to add their own Java SNMP or other polling engines running on the ME chassis in addition to the SLM SAA data retrieval applet.

The Cisco literature on the ME1110 points out that an SAA router will hold data for 2 hours if there is a loss of connectivity between it and the collecting ME box. The ME box in turn can operate for up to 3 days without contact with the central SLM software. Thus the distributed ME boxes provide a robust data collection system without the polling limits and scalability issues of "the one huge central polling station". Coming to the ME operating system: syslog collection, to allow scalable syslog handling (CW2000 already has remote collector capability), and CDP (to simplify discovery and management of the ME boxes).

I have not seen any mention of it (so I'm not violating any NDA here), but I'm wondering  "what about RADIUS, and IP Call Detail Records"? Can these be far behind? My sources tell me Service Providers have enough fun collection phone billing records, and ISP's collecting data on user accesses. With DSL and cable modems for both data and voice, there is a vast and growing amount of data to collect, and simple robust collection appliances might be very attractive to a Service Provider. (Might beat running distributed UNIX boxes with expensive database software? Or is the volume of transactions likely to require too many ME's to be easy to manage?)

What about collecting data from some of the vast quantity of other MIB variables in Cisco equipment? Presumably some of the third party vendors will do this, in key areas. ISDN dial reporting and security come to mind.

Configuring SAA

Let's shift now from business and inspirational (?!) message to hardcore configuration. There are a large and somewhat confusing collection of commands for configuring SAA. So we'll focus on the more basic commands, to make sure you see how easy the basics are.

But first we need some terminology. A source is the SAA router which sends the test traffic. A target is the host computer the test traffic is sent to. The target may be a responder, an SAA router configured to respond to SAA measurement traffic. A collection is a combination of one source, one target, and one type of test traffic. Collection is the IPM/SLM term; the Cisco IOS documentation refers to it as an operation (formerly, probe).

To configure a router for basic SAA, it is (obviously?) going to be the source. So you have to specify the target and the type of operation or collection. Since a router may simultaneously be measuring response time for a number of servers and applications, each collection or operation is numbered, so that you have some way to refer to it.

When the target is an SAA router, the packets sent include information from an SAA control protocol. This should be disabled when measuring response time to a real host computer.

The following configuration shows how to set up measurement of the time it takes to connect to TCP port 23 (telnet) on remote host 10.1.1.1. We'll do this without using the SAA control protocol, since the responder is a server, not a router. The default measurement interval is once per minute. We change that to every 120 seconds. The example starts the measurements immediately. Since no lifetime is specified, the default of one hour (3600 seconds) applies. The tag line allows grouping of collections (operations) from the same or different routers. That is, we might give the same tag to all collections monitoring different applications on one server, or we might use the same tag on several routers for one or several applications being monitored to one server or set of servers.

Router(config)# rtr 10
Router(config-rtr)# type tcpConn dest-ipaddr 10.1.1.1 dest-port 23 control disable
Router(config-rtr)# frequency 120
Router(config-rtr)# tag TelnetPollServer1
Router(config-rtr)# exit
Router(config)# rtr schedule 10 start now
To set up but not start a VoIP jitter measurement based on 15 packets sent at 30 second intervals:
Router(config)#rtr 200
Router(config-rtr)#type jitter dest-ip 172.16.1.1 dest-port 20000 num-packets 15 interval 30
To track response time from www.cisco.com (perhaps as a way of checking remote site Internet connectivity):
Router(config)# rtr 27
Router(config-rtr)# type http operation get url http://www.cisco.com
Router(config-rtr)# timeout 10000
Router(config-rtr)# threshold 2000
Router(config-rtr)# exit
Router(config)# rtr reaction-configuration 27 threshold-type immediate action-type trapOnly
The timeout  is in milliseconds: the get operation is given 10 seconds to time out. If the response time is greater than 2000 milliseconds (2 seconds), we consider this to be a rising threshold crossing (see the Threshold Manager and RMON articles). Default threshold is 5000 msec. The response to crossing the rising threshold is to send an immediate  SNMP trap.

To measure UDP on port 45678 with IP Precedence 5 (loosely simulating VoIP traffic):

Router(config)# rtr 39
Router(config-rtr)# type udpEcho dest-ipaddr 10.1.1.1 dest-port 45678
Router(config-rtr)# tos 160
The ToS field is 160, corresponding to a bit pattern of 10100000. You have to remember that the IP Precedence is the left-most three bits of the ToS byte. You can also measure DiffServ as it comes in. Voice in DiffServ-speak is Expedited Forwarding (EF), corresponding to decimal 46. DiffServ uses the left 6 bits of ToS, so we left shift by two bits. To the non-programmer, this is also known as "multiplying by 4". That means Differentiated Services Code Point (DSCP) value 46 needs to be ToS value 184.

To configure a router to be an RTR responder (for example, for the prior collection's udpEcho), configure:

Router(config)# rtr responder
For security, you can use authentication keys.

Show commands:
 

show rtr application shows supported operation types and protocols
show rtr authentication show authentication method and name of key chain
show rtr collection-statistics show statistical errors for all collections or for one specified collection
show rtr configuration show all configuration values including defaults for all collections or for one specified collection
show rtr distributions-statistics show statistical information about response times
show rtr history show the history: data recorded over a period of time
show rtr operational-state shows connection losses, timeouts, and over threshold count, as well as remaining life, whether the collection is active, and completion time, among other statistics
show rtr reaction-trigger shows any reaction triggers you've configured (used for diagnostics)
show rtr responder shows recent SAA control message sources, etc.
show rtr totals-statistics shows total error and completion counts

 

 

Advanced Ways to Use SAA

For you voice aficionados, as of 12.1(3) T there's a tie from SAA to voice.

Recent history: in 12.1(1) T Cisco delivered local Voice Busyout: track a local interface or group of interface. If down, the voice router presents busyout/seized condition to attached CPE (Customer Premises Equipment) or PBX (phone switch). The point here is, if the WAN connection goes down, you can't place your call on your network, so the PBX has to call via the PSTN (Public Switched Telephone Network).

In 12.1(3) T this has been improved upon to allow Advanced Voice Busyout (AVBO). With  AVBO, you can busyout a port or group of ports if RTR/SAA response time is greater than a threshold. Both features apply to CAS trunks only. There is no way to track bandwidth or lack of DSP resources currently. What I've seen in print is that the competition may have ping-based busyout, which would seem to be a rather unreliable round-trip time estimator. This appears to be more sophisticated. Neat stuff!

Summary

Links to downloadable tutorials (30 MB or so each!) on CiscoWorks 2000 components, including IPM, SLM, and the ME1110: (Thanks, Jeff!)

There are new courses coming on basic CiscoWorks 2000 and the RWAN and LAN bundles. Mentor Technologies may be offering them around February 2001. Check our web page for the latest information and schedules. We currently offer the CVOICE and CIPT voice classes (voice devices in general and Call Manager / IP Telephony, respectively), if you're interested in voice/telephony topics.

Your comments, preferences and ideas and suggestions for topics are always more than welcome! I enjoy hearing from you!

Dr. Peter J. Welcher (CCIE #1773, CCSI #94014) is a Senior Consultant with Chesapeake NetCraftsmen. NetCraftsmen is a high-end consulting firm and Cisco Premier Partner dedicated to quality consulting and knowledge transfer. NetCraftsmen has eleven CCIE's (4 of whom are double-CCIE's, R&S and Security). NetCraftsmen has expertise including large network high-availability routing/switching and design, VoIP, QoS, MPLS, network management, security, IP multicast, and other areas. See http://www.netcraftsmen.net for more information about NetCraftsmen. . New articles will be posted under the Articles link. Questions, suggestions for articles, etc. can be sent to This email address is being protected from spambots. You need JavaScript enabled to view it. .

11/8/2000
Copyright (C)  2000,  Peter J. Welcher