Moving From Reactive to Proactive: Why Your Existing Monitoring Tools are in the Way

Most financial institutions and payment processors are working hard to mature their IT operations functions. The goal is to shift from reactive troubleshooting to proactive problem discovery and isolation. Unfortunately, this is easier said than done.

Payment processors are seeing a new ‘no boundaries’ global payments model emerging. Application and infrastructure performance is rapidly becoming blurred as the payment systems evolve into a rat’s nest of ‘global mazes’ consisting of:

  • TCP/IP, dial-up, mobility and wireless networks, national payment networks and expanding cross-border payment channels.
  • Multiple switches, data centres, applications, third party services and banking technology platforms.
  • Growing volumes of electronic transaction data, with a widening range of payment transaction message types.

IT teams face the growing challenge of managing this complex global maze. It is their responsibility to ensure payment systems can handle increasing electronic payment transaction loads and remain running at high performance and low latency. They are expected to continuously monitor expanding systems, multiple switches, various payment channels and transaction types – all with reduced headcount and declining IT budgets. They also need the flexibility to quickly expand performance monitoring capabilities to integrate new customers and payment services in a low risk, cost effective manner.

Lost in the Metric Translation Challenge

There is, no doubt, much going on behind the scenes to ensure the success of every transaction. Arguably, both network management and application tools are necessary to guarantee the health of a payment network. But deployment of these types of tools alone will leave you constantly trying to piece together and translate the fragmented metrics you get (network errors, device status, server performance, etc), into the comprehensive, real-time metrics you need (transaction response times, error rates, and the resulting business impact).

The metric translation challenge refers to the amount of time it takes to translate metrics and statistics generated from various monitoring tools, such as traditional system monitors, sniffers, switch log analysers, application and network performance monitoring tools, into the metrics IT and operations need to proactively isolate and resolve the root cause of problems impacting transaction performance.

The metric translation challenge adds costs, slows down the problem solving process, and makes it almost impossible for IT to shift from reactive troubleshooting to proactive problem resolution. This challenge manifests in every step of the problem resolution process:

Step 1: problem discovery

Problem discovery is often the result of angry inbound customer calls, problem re-creation efforts, or alerts from network and application management tools. Capturing real-time, actionable performance data with these tools can be an issue. Frequent polling mechanisms on traditional tools are sometimes turned off due to potential performance impact. In an effort to reduce traffic load on monitored systems, many of these systems may alternatively poll on five or 15 minute intervals. Thresholds are also set to suppress ‘noisy’ alerts and often require multiple events to trigger an alert, delaying the discovery of a potential transaction performance issue by several times your polling interval.

This makes it tough to proactively discover and isolate problems within a transaction environment. In addition, most metrics reported by traditional monitoring tools (such as network errors and links, application log entries, device status, and server performance) are great for reactive troubleshooting and technical deep dives, but are too fragmented to translate into useful business performance indicators.

Step 2: problem identification

Once an alert is generated, a trouble ticket is created by the front line customer service team and forwarded to IT operations. IT operations needs to quickly determine how an alert is (or isn’t) impacting transaction performance or customer service, and assign an appropriate level of severity to the issue. This tends to be tough to do, as there is often a ‘performance information gap’ in the reported metrics. Important information, such as end-to-end transaction timings, response codes, query and call timings, error rates, and rates of approved, declined, reversed, failed and unsupported transactions, is not reported.

The inability to tie an alert to a meaningful change in transaction performance means you could end up treating every event equally. This translates into wasted time spent tracking down low priority issues and a potential miss precursors to a higher priority issue.

Step 3: problem isolation

Once the problem has been confirmed, the source of the problem still needs to be isolated. In most cases, various network, application and customer service teams that own responsibility for different parts of the payment system environment work in siloed IT departments. These teams are simultaneously logging into multiple monitoring tools to track down information. This translates into a time consuming, labour intensive ‘blamestorm’ of determining who really owns the issue. This is a serial process where the key metric for each participant is ‘mean time to innocence’. Joking aside, this is the most labour-intensive and frustrating step.

Reconstructing transaction performance also takes a lot of time and experience. It is an extremely inefficient process that takes up the cycles of people you depend upon to architect, develop and deploy new services – your tier 2 and 3 application developers and network engineers.

Step 4: problem diagnosis

To properly diagnose the issue usually requires multiple parties, such as the application team, network team, customer service team, and third party service providers, spending a large amount of time piecing their fragmented data together into a holistic picture, scrambling to identify the root cause and delaying the fix. If no symptoms are apparent, various teams turn on diagnostics (which are usually off because they exact a performance cost), and wait for the problem to happen again. This translates into customer service levels taking a hit, operational cost inefficiencies, and longer outages and slowdowns.

Figure 1 illustrates how metric translation time impacts the problem resolution process.

Figure 1: Problem Resolution Using Traditional Tools: The Translation Effect

Source: INETCO

Transaction-level Monitoring Tools – A Proactive Shift in Monitoring

Emerging technologies such as transaction-level monitoring tools (also known as business transaction management (BTM) or transaction processing management (TPM)) enable a change in the way you manage your payment systems. They provide the intelligence that IT departments need to align with key business performance indicators. Transaction-level monitoring enables you to manage the one thing that has the biggest impact on customer retention and steady corporate revenue streams – the payment transaction.

Transaction-level monitoring directly addresses the metric translation challenge across all steps of the problem resolution process, making it easy for IT operations teams to shift from a high cost, labour intensive reactive troubleshooting approach to a proactive problem resolution process. Let’s look at the same steps and what changes with transaction-level monitoring:

Problem discovery is proactive

With transaction monitoring software, front-line customer service support will know you have a transaction performance problem before the customer does. This will eliminate the need for problem re-creation and decrease angry inbound customer calls. When deployed as the ‘precursor step’ to doing deep dives into network and application performance, transaction monitoring software will instantly detect transaction performance anomalies and raise a real-time alert without affecting traffic loads.

Transaction monitoring tools also address the metric translation challenge by capturing business metrics relevant across all IT departments such as:

  • Transaction volume ratios.
  • Transaction anomalies such as high dollar volume, recurring card swipes, decline patterns.
  • Stand-in mode detection.
  • Concurrent transaction rates.
  • Rates of approved, declined, reversed, failed and unsupported transactions.
  • User impact and geographic locations.
  • CPU utilisation per transaction for capacity planning.
  • Attempted transactions (number of transactions processed versus number attempted).
  • End-to-end transaction query and call timings (back end and front end links).
  • Transaction response codes.
  • Transaction error rates.
  • Terminal indentities of the ATM/POS device.
  • PAN of the card.
  • Rate of good/bad transactions.
  • Rate of closes.
Problem identification occurs in real time

With transaction monitoring tools, you gain visibility into what the end customer is experiencing every time they execute a transaction. This means you can quickly assign a level of severity to each issue based upon service level impact. These tools also bridge transaction performance information gaps by correlating end-to-end payment transaction data flows and timings from the point of customer execution to the data centre back end, and providing a complete profile on how every transaction executes across all the application components and infrastructure tiers in your payments system environment. This complete view includes the capture of message types and response codes as the transaction enters and leaves multiple components on the network, payment channels and multiple switch platforms.

Problem isolation is focused

It is more efficient and cost effective to deploy a proactive monitoring tool that can isolate potential problem areas versus taking a reactive deep dive troubleshooting approach involving multiple departments, external service providers, and customers that have to recreate scenarios.

Transaction monitoring tools eliminate blamestorms and quickly determine what kind of problem you are dealing with (i.e. application-level, network-level, third party-related), so trouble tickets and resources can be assigned more accurately. Tier 2 and 3 application developers and engineers do not have to waste valuable time reconstructing transaction performance issues. IT teams can shorten mean time to resolution rates by quickly drilling into under-performing transactions to see a complete analysis of the transaction and isolate the source of the problem.

Problem diagnosis efforts are streamlined

Transaction monitoring software will store several days worth of holistic transaction data and response codes. This allows support teams to analyse historical transaction performance and perform inquiries to quickly isolate the root cause of an issue, shorten outages and fix latency issues. Reporting capabilities will also improve customer communication, especially in times such as this when they will be questioning your service levels.

The Move from Reactive to Proactive Problem Resolution

In summary, there are five unique business performance insights transaction level monitoring can provide that will eliminate the metric translation challenge:

  1. Key business performance metrics.
  2. End-to-end transaction data flow visibility.
  3. Network-, application-, customer experience-, and business transaction-level data correlation.
  4. Continuous transaction monitoring and real-time alerting.
  5. Historical capture of complete transaction information.

Figure 2 illustrates the time savings possible by eliminating the translation of metrics.

Figure 2: Problem Resolution Using Transaction Monitoring: Moving from Reactive to Proactive

Source: INETCO

Transaction monitoring software allows IT teams to make a strategic shift from reactive problem resolution to proactive problem resolution. This shift will improve customer service, free up precious application and network staff time, and enable continuous operating efficiency improvements.


Related reading