WO2002042923A1

WO2002042923A1 - Service monitoring system

Info

Publication number: WO2002042923A1
Application number: PCT/US2001/043130
Authority: WO
Inventors: Timothy Keough; Gregory Keough
Original assignee: Aperserv Technologies, Inc.
Priority date: 2000-11-20
Filing date: 2001-11-20
Publication date: 2002-05-30
Also published as: AU2002225627A1

Abstract

A method, system, and computer program product for monitoring services (e.g., communications services and information server services) for compliance with a specified set of target criteria (e.g., as specified in a contract). A monitoring computer system (100) is used to monitor at least one of a monitored link (210) and a monitored server (200). By analyzing the quality of the service from a customer point-of-view, the customer can identify whether or not they are entitled to reimbursement for shortcomings in the service coverage and the service provider can determine performance problems in the services they provide. The monitoring computer system may include one or more measurement agents that request and receive performance data from the monitored server. The collected performance data is stored in a performance database (220). Also, such a customer point-of-view, however, does not require that the customer perform the testing, only that an entity, working on behalf of the customer, perform monitoring. In a recursive case, a monitoring service may monitor the services of a monitoring service.

Description

SERVICE MONITORING SYSTEM

BACKGROUND OF THE INVENTION Field of the Invention

The present invention relates generally to methods and systems and for monitoring a service (e.g., a network, communications, or information service) for compliance with specified objectives related, for example, to the performance and reliability of the service.

Discussion of the Background

As the number of users of wide-area networks (WAN) (e.g., the Internet) expands, an increasing number of companies are advertising, providing information, and processing orders on-line using information servers (e.g., Web servers). Depending on the size and sophistication of a company, it may not be possible or practical to maintain in-house the information servers and their communication networks. Accordingly, some companies offer server and communication outsourcing services that not only provide services, but also guarantee certain characteristics of those services, e.g., performance, reliability, and up-time.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a monitoring method, system, and computer program product that analyzes characteristics of a service, e.g., a network or information service, for conformance with a specified target set of those characteristics.

It is another object of the present invention to monitor performance of services and then, utilizing the parameters set forth in a specified set of target characteristics, e.g., in a service contract, to identify when equipment or services (e.g., provided by a third-party service provider) are not performing to the level of service specified. BRTEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:

Figure 1 is a schematic illustration of a computer for providing monitoring and analysis services according to the present invention;

Figure 2 is a block diagram of a monitoring system for monitoring at least one of a monitored communications link and a monitored server;

Figure 3 is an exemplary chart showing compliance with a longest allowable link downtime; and

Figure 4 is an exemplary selection of database records corresponding to the longest allowable link downtime of Figure 3.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views, Figure 1 is a schematic illustration of a computer system for providing service (e.g., network and information service) monitoring. A monitoring computer system 100 implements the method of the present invention, wherein the computer housing 102 houses a motherboard 104 which contains a CPU 106, memory 108 (e.g., DRAM, ROM, EPROM, EEPROM, SRAM, SDRAM, and Flash RAM), and other optional special purpose logic devices (e.g., ASICs) or configurable logic devices (e.g., GAL and reprogrammable FPGA). The computer 100 also includes a plurality of input devices, (e.g., a keyboard 122 and mouse 124), and a display card 110 for controlling monitor 120. In addition, the computer system 100 further includes a floppy disk drive 114; other removable media devices (e.g., compact disc 119, tape, and removable magneto-optical media, not shown); and a hard disk 112, or other fixed, high-density media drives, connected using an appropriate device bus (e.g., a SCSI bus, an Enhanced IDE bus, or a Ultra DMA bus). Also connected to the same device bus or another device bus, the computer 100 may additionally include a compact disc reader 118, a compact disc reader/writer unit (not shown) or a compact disc jukebox (not shown). Although compact disc 119 is shown in a CD caddy, the compact disc 119 can be inserted directly into a CD-ROM drive that does not require a caddy. In addition, a printer (not shown) also provides printed listings of how closely the service guarantees were met.

As stated above, the computer system includes at least one computer readable medium. Examples of computer readable media are compact discs 119, hard disks 112, floppy disks, tape, magneto-optical disks, PROMs (EPROM, EEPROM, Flash EPROM), DRAM, SRAM, SDRAM, etc. Stored on any one or on a combination of computer readable media, the present invention includes software for controlling the hardware of the monitoring computer system 100 and for enabling the monitoring computer system 100 to interact with a human user. Such software may include, but is not limited to, device drivers, operating systems and user applications, such as development tools. Such computer readable media further include the computer program product of the present invention for monitoring services for compliance with guarantees. The computer code devices of the present invention can be any interpreted or executable code mechanism, including, but not limited to, scripts, interpreters, dynamic link libraries, Java classes, and complete executable programs. Computer code devices may also be downloaded across a network using a network adapter (e.g., token ring or Ethernet) as an equivalent to embedding the computer code device within a computer readable medium.

Turning now to the block diagram of Figure 2, the monitoring computer system 100 of Figure 1 is used to monitor at least one of (1) a monitored link 210 (or monitored links 210') and (2) a monitored server 200. (As would be appreciated by one of ordinary skill in the art, links 210 and 210' may be either wired (electrical or optical) or wireless and carry traffic of any protocol, e.g., Voice over IP, without departing from the teachings of the present invention.) The monitoring computer system 100, which can be considered a back- end computer, generally simulates requests of various formats, types, and natures to different services and then records and monitors the information that is returned. For each information service of a monitored server 200 to be analyzed by the monitoring computer system 100, requests are transmitted (e.g, using HTTP, ICMP, FTP, HTTPS, TELNET, SSH, WAP, and other protocols) on a per-service basis to the monitored server 200, and the details of the request made to the system (e.g., request type, exact time and date, service identification) are recorded. Preferably, the requests are formatted similar to the format expected by the server. For example, to test the up-time of a link carrying IP traffic, the request must be IP- compatible so that it is forwarded by routers as appropriate. Subsequently, again on a per-service basis, the response (e.g., from the corresponding server) is recorded and associated with the request that was made, along with the response time, any returned messages (e.g., indicating error or success), as well as protocol-specific response data (e.g., packet loss, latency, file size, file data).

Exemplary services include, but are not limited to, web, database, and application server services, firewall services, co-location services, routing and networking services (switches, routers, etc.), telecommunications connectivity services (e.g., Tl, T3, OC3, OC48, ISDN, DSL, Frame Relay, ATM, cable), application service provider services (e.g., caching services, free email, online CRM, chat, customer support tools, streaming media services, file sharing services, productivity services, storage services, back-up services), content services (news feeds, stock feeds, information feeds, account aggregation, etc.), proprietary services (e.g., Lexis-Nexis, Bloomberg, bill payment, ATM networks), training services, conferencing services, and productivity services. Moreover, performance measurement of any TCP- or UDP -based applications, services, or providers over IP is possible.

The monitoring computer system 100 may further include one or more measurement agents that request and receive performance data from the monitored server 200. The collected performance data is subsequently stored in the performance database 220.

For example, a database measurement agent is capable of remotely and non- intrusively connecting to a database server and gathering extensive details regarding availability, response time, disk utilization, CPU utilization, back-up performance, job performance, and other system-level information. By connecting to the database via ODBC or native database connections (authentication and correct privileges are necessary to perform transactions), the database measurement agent is capable of gathering detailed information, on both the application and the system level. The database measurement agent is designed as a rule-based system provided with the information necessary to connect, authenticate, and perform transactions on the database server. Once connected and authenticated, the database measurement agent can perform standard SQL-based transactions against any portion of the database(s) and tables. Additionally, using internal components of the database system (stored procedures, extended stored procedures, etc.), the agent is capable of gathering performance information regarding the underlying system, including disk space utilization, CPU utilization, memory utilization, and other system level-information.

As another example, a storage area networks (SAN) measurement agent is capable of remotely and non-intrusively connecting to a SAN to determine if the SAN is available, responding appropriately, has capacity, etc. The SAN measurement agent first connects to the SAN, e.g., via a TCP/IP connection, or possibly over a proprietary IP address class and subnet. The agent sends requests to the SAN to determine the response of the SAN to the requests. Additionally, the SAN measurement agent queries the SAN to determine the state of transactions on the system and logs the performance as well as the transaction problems that have occurred or are occurring. The SAN measurement agent is also capable of querying the SAN to determine the utilization of the SAN-thereby providing insight into capacity utilization and the need to upgrade or expand capacity.

Similarly, Voice-over-IP (VoIP) systems can be measured by a measurement agent by connecting to the service (via a TCP/UDP connection over IP). Performance of VoIP systems can be measured in terms of response time, availability, quality, and data integrity.

In addition to directly gathering information from a monitored server (e.g., through management agents), the monitoring computer system 100 can also download relevant performance information from external data sources (programs, systems, or databases). Examples of such sources include HP Overview, Microsoft, Citrix Systems, Tivoli Systems, Network Associates, NetScout Systems, BMC, Info Vista, Micromuse, Computer Associates, and Compuware Corporation.

Similar to the management/analysis of servers, for each communications link 210 to be analyzed that is directly connected to the monitoring computer system 100, the monitoring computer system 100 transmits at least one message on that link 210 to determine if the link 210 is in the intended state (e.g., up or down). The monitoring computer system 100, however, need not be directly connected to the link 210' to be monitored. The monitoring computer system 100 may send messages (e.g., via gateways, routers, or bridges) to other links 210' that are not directly connected to the monitoring computer system 100.

As shown in Figure 2, the monitoring computer system 100 is connected (either directly or via a network, such as a WAN) to a database 220. By knowing the rules for making the request (request type, server identification (e.g., URL, IP Address, port number), and other request parameters (e.g., file path, packet size)), the monitoring computer system 100, or a sub-portion thereof including a monitoring application, is able to make the appropriate request, record the results (page data, file size returned, response time, packet loss, latency, etc.), and appropriately associate them with the request information stored in the database 220. Preferably the monitoring computer system 100 tracks all of the individual components of the request as applied to the monitored server or service. In addition, the monitoring computer system 100 records the status of each stage of the transaction, the description/messages returned, as well as the total time to perform each step in the transaction including, but not limited to, DNS look-ups, making the request, initial responses from the servers/service, initial data transmission, and data transmission completion.

The monitoring computer system 100 preferably also performs monitoring based upon set limits and definitions of the type of monitoring that is required. This is especially useful in a monitoring service that simulates different bandwidth speeds (14.4K modem, 28.8K modem, 33.6K modem, 56K modem, 128K ISDN, 256 Fractional Tl/DSL, as well as all different bandwidth dedicated connections).

The monitoring computer system 100 further includes a compliance application that analyzes the performance data received by the monitoring computer system 100. Such an analysis can be performed in real-time (e.g., looking for links 210 that are down) or off-line using collected real-time information from the database 220. To have a set of target performance characteristics, e.g., regarding service availability, against which to compare actual measured characteristics, the present invention preferably receives information through a web-based, front-end application or interface (described below). Such characteristics can represent the contractual obligations of the service provider (e.g., a service level agreement (SLA)) that are to be monitored by the compliance application or comprise a company's internal quality requirements (e.g., for ISO 9000 compliance). The compliance application compares the performance data to the performance characteristics that are specified (or guaranteed for a service provider) and compiles the information into a report indicating whether or not the service or link is meeting its specified goals, e.g., contractual obligations.

The compliance application further provides information display capabilities (e.g., timeline-based error charts) portraying the compliance with specified performance characteristics, as shown in the exemplary bar chart of Figure 3. Such a display may be updated either in real-time or periodically and is preferably based upon a variety of factors. The compliance application customizes the compliance analysis based upon the specifications that are entered into the system and is able to determine non-compliance based upon single, multiple, or cumulative failures. For example, if a provider guarantees that the web site that it is hosting will be up and available (e.g., the web server is responding with the correct requested web page) 100% of the time, and if the compliance application finds any occurrence of a failure, the service provider is in violation of its contractual obligations. Additionally, if the service provider guarantees no more than five failures, the compliance application can keep track of the number of failures and determine the level of compliance by the service provider. Similarly, as shown in Figure 4, if the service provider guarantees that a link will be down no more than a predetermined length of time (e.g., 10 seconds) in any day, then the compliance engine can analyze, from real-time or stored data, the longest period of time that the link was down (and preferably when that outage occurred).

The compliance application can also do an analysis of the accuracy and validity of an actual response in comparison with an expected response. Additionally, the response data can be analyzed against other factors, including the last update of the data. The application is capable of doing these types of response analytics based upon the data that is returned and stored in response to the monitoring computer system requests.

Further, the compliance application can generate messages (e.g., email, instant messages, phone calls) automatically and in real-time to at least one of the provider and the subscriber if a provider is failing and is not complying with the signed service agreement. These messages can be custom tailored to notify only when the compliance application determines that it meets the customer's predefined messaging requirements. These requirements can be based upon any of the metrics that are tracked and monitored by the monitoring application.

In a preferred embodiment, the compliance application analyzes the performance data gathered by the measurement agents, either periodically or continuously, to determine if the data indicates that a performance problem is occurring or is about to occur. For example, a problem can be indicated when packet loss on a network is exceeding 0.5%, although contractual terms state that a violation is when packet loss exceeds 1.0%. The compliance application is capable of identifying this degraded level of performance as a problem, even though it may not be exceeding the limits defined in the contract or SLA.

The compliance application preferably analyzes performance data and identifies performance issues using a formulaic or rule-based approach. The compliance application analyzes the results of an entire request-response transaction, including each step during the process. It then compares the results to predefined levels of performance that indicate problems as well as to the levels stated in contracts or SLAs. This comparison is not based upon straight comparisons of performance levels (packet loss as a percentage) only, but also upon performance relative to a period of time (packet loss as a percentage during some time period (month, week, day, year, or subset thereof). Moreover, the compliance application automatically accounts for maintenance windows and scheduled downtimes when analyzing the data for service performance problems. Additionally, the compliance application is capable of defining contract violations based upon response time of the service provider when problems occur. For example, a contract might indicate that the service provider must communicate with the client within two hours of identifying a problem.

The compliance application is capable of identifying issues that require attention and resolution on the part of the providers. In the preferred embodiment, the compliance application connects directly with the service provider's trouble-ticket system, creates a trouble ticket on the identified issue providing details as to the problem and location, and tracks the progress of the problem until it is fixed and closed. In addition to integrating with trouble-tracking systems, the compliance application is capable of sending emails, pages, text messages, and instant messages, and placing phone calls when problems occur, allowing for a flexible methodology to notify the provider of the service problems.

The compliance application is capable of looking at the performance measurement data including latency, packet loss, response time, error messages, status codes, response validation, etc. to determine if service problems are occurring. These are issues in which there is service degradation prior to complete service failure. By comparing the performance results to the continuously updated definitions of a performance problem, the compliance application is capable of defining performance as problematic in order to resolve the problem prior to it resulting in a failure. Additionally, the compliance application determines the owner of the component that is performing poorly, attributing the component (router, switch, server, etc.) to the provider or client that is responsible for the component.

Once a problem is discovered, the compliance application preferably connects to the provider or client's trouble-ticket/reporting system (e.g., via a TCP/IP-based connection (HTTP, HTTPS, email, SNMP, RMON, other TCP connection) or another communication protocol-based connection). The compliance application creates a trouble ticket for the problem that has been identified. This trouble ticket includes detailed performance information regarding the problem, including any latency, packet loss, response time, error messages, status codes, response content, etc. that is applicable to the problem. This creates an actionable trouble ticket with full, detailed information.

Additionally, the compliance application tracks the progress of the trouble ticket after it has been created, by updating its status as necessary, associating all provider or client communications with the problem, continuously measuring the performance of the service to determine when the problem has been fixed, and finally closing the trouble ticket when it is fixed. The compliance application thereby monitors the issue from identification to closure.

The same compliance application is capable of connecting to different provider and client trouble-ticketing/reporting systems using the flexible, rule-based framework that is the basis for the measurement agents. This provides the ability to connect to several different types of systems via a single interface in the compliance application. By passing parameters such as the trouble-ticket server address, the protocol, the message format, authentication credentials, and other pertinent information, the compliance application is capable of opening tickets in a variety of trouble-reporting and tracking systems at remote locations. This type of integration can occur with many types of systems, including Tivoli, CA's Unicenter, other proprietary systems, etc.

The monitoring computer system 100 preferably further includes a prediction engine that uses the large quantities of data that are gathered by the measurement agents. The prediction engine analyzes the performance data, either periodically or continuously, and compares the information to historic information collected in the past. Using a proprietary algorithm, the prediction engine analyzes the data to determine if current problems are indicators of future problems (increased traffic levels, utilization, etc.) or one-time, fluke events (e.g., power cable accidentally unplugged, network cable accidentally unplugged). Additionally, using detailed knowledge of the systems and services that are employed by the different providers and clients, the prediction engine is capable of tailoring a prediction based upon the type, location, function, and performance of a component.

As discussed above, the performance characteristics to be tracked and their possible ranges are preferably entered via a web-based front-end application. The front-end application is preferably a form-driven mechanism through which the specifics of the service are entered (those needed to perform the monitoring), including the request type, server identification (e.g., URL, IP Address, Port Number), and other request parameters. This information is validated for formatting and data accuracy, and can be additionally checked by simulating a request to the service that is indicated. After validating the request, the data is then stored in the performance database 220 and is accessible to the compliance application and the prediction engine.

Additionally, the front-end application receives information related to the corresponding guarantees (e.g., contract requirements regarding up-time, response time, DNS look-up time, connection time, download time, packet loss, latency, etc.). In an alternate embodiment, the front-end application also accepts an indication as to the level of failure (number of occurrences) equivalent to non-compliance. This information is also validated for accuracy (numeric values, alpha values, formatting, acceptable ranges, etc.) to ensure that the data is accurate. This information is then stored in the performance database 220 utilized by the compliance application to perform the analytical and notification functions of the service. The compliance application can then automatically calculate, for example, the discounts that should be applied to the service contracts based upon the performance of the provider during the invoice period.

Preferably the front-end application provides robust reporting tools that allow a human user to view reports of the performance of the services that are monitored, including real-time analysis and data, historic and trend information, costs/savings estimates (including exact/projected dollar savings calculations), and other projections. Additionally, the front- end application preferably generates reports based upon the monitored industry and service, providing company comparison reports detailing the results for each of the individual service providers in the report in comparison to their competitors.

Additionally, reports are preferably customizable and dynamic based upon certain parameters (e.g., date range, request type, response information, industry, service provider, etc.) that can be specified at report-generation time. This data then can be used to maintain a ranking of the providers with respect to the services that are provided.

Other exemplary services that can be monitored include: Online Credit Card Processing, electronic Procurement services, Internet Faxing, ASP services (e.g., E- Commerce ASP services, ERP ASP services, Human Resource ASP services, Financial/Accounting ASP services, Supply Chain Management ASP services, CRM services), and electronic monitoring services generally. Both publicly available services (web sites, ASP services, etc.) and private, internal services and service level agreements can be monitored. This can include intra-company service level agreements between departments and custom performance monitoring based upon these agreements.

In yet another alternate embodiment, the front-end application receives other performance data to be entered into the performance database 220 and analyzed by the compliance application and/or prediction engine. This data, which can be entered through the web-based interface, includes data necessary to perform the compliance analysis. This is especially useful for using the compliance engine to measure and analyze compliance of services that are not electronic. Exemplary non-electronic services include call center response and auditing.

Obviously, numerous modifications and variations of the present invention are possible in light of the above teachings. It is therefore to be understood that, within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.

Claims

CLAIMS:

1. A computer program product, comprising: a computer storage medium and a computer program code mechanism embeddec the computer storage medium for causing a computer to monitor characteristics of a sen including at least one of a server and a communications link, the computer program codi mechanism comprising: a first computer code device configured to receive an indication of the service to monitored; a second computer code device configured to receive tests to be performed to tes characteristics of the service to be monitored; a third computer code device configured to receive results of the tests performed test the characteristics of the service to be monitored; a fourth computer code device configured to receive target ranges for the characteristics of the service to be monitored; and a fifth computer code device configured to indicate to a customer of the service whether the characteristics of the service to be monitored are within the target ranges.

2. The computer program product as claimed in claim 1, wherein one of the characteristics is up-time over a specified period.

3. The computer program product as claimed in claim 1, wherein one of the characteristics is longest down-time over a fixed period.

4. The computer program product as claimed in claim 1, wherein one of the characteristics is transactions per second.

5. The computer program product as claimed in claim 1, wherein the service comprises a web service.

6. The computer program product as claimed in claim 1, wherein the service comprises a database service.

7. The computer program product as claimed in claim 1, wherein the service comprises a communications link service.

8. The computer program product as claimed in claim 1, wherein the service comprises an optical communications link service.

9. The computer program product as claimed in claim 1 , wherein the service comprises a wireless communications link service.

10. A method for causing a computer to monitor characteristics of a service including at least one of a server and a communications link, the method comprising the steps of: receiving an indication of the service to be monitored; receiving tests to be performed to test the characteristics of the service to be monitored; receiving results of the tests performed to test the characteristics of the service to be monitored; receiving target ranges for the characteristics of the service to be monitored; and indicating to a customer of the service whether the characteristics of the service to be monitored are within the target ranges.

11. The method as claimed in claim 10, wherein one of the characteristics is up-time over a specified period.

12. The method as claimed in claim 10, wherein one of the characteristics is longest down-time over a fixed period.

13. The method as claimed in claim 10, wherein one of the characteristics is transactions per second.

14. The method as claimed in claim 10, wherein the service comprises a web service.

15. The method as claimed in claim 10, wherein the service comprises a database service.

16. The method as claimed in claim 10, wherein the service comprises a communications link service.

17. The method as claimed in claim 10, wherein the service comprises an optical communications link service.

18. The method as claimed in claim 1 , wherein the service comprises a wireless communications link service.

19. A system, comprising: a first receiver configured to receive an indication of a service to be monitored; a second receiver configured to receive tests to be performed to test characteristics of the service to be monitored; a third receiver configured to receive results of the tests performed to test the characteristics of the service to be monitored; a fourth reciver configured to receive target ranges for the characteristics of the service to be monitored; and a compliance calculator configured to indicate to a customer of the service whether the characteristics of the service to be monitored are within the target ranges.

20. The system as claimed in claim 19, wherein one of the characteristics is up-time over a specified period.

21. The system as claimed in claim 19, wherein one of the characteristics is longest down-time over a fixed period.

22. The system as claimed in claim 19, wherein one of the characteristics is transactions per second.

23. The system as claimed in claim 19, wherein the service comprises a web service.

24. The system as claimed in claim 19, wherein the service comprises a database service.

25. The system as claimed in claim 19, wherein the service comprises a communications link service.

26. The system as claimed in claim 19, wherein the service comprises an optical communications link service.

27. The system as claimed in claim 19, wherein the service comprises a wireless communications link service.