Skip to main content

Where legacy operations management comes up short

Most legacy operations management products miss the point and come up short. Many of these products, although implemented are ghosts. They lurk around in buildings with limited or no business benefit.  Their primary focus is on monitoring outages. This is often referred to as a RAG tool: Red, Amber, and Green where Red signifies down, Amber signifies intermittent connectivity problems and Green signifies good connectivity. This serves a limited business purpose and cannot justify any return on investment. The product development has been focused on reporting on acceptable situations, as opposed to providing equal focus on operation in dire straits. How is this limited view operational monitoring? All it does is give you a comfortable feeling. With this approach, there is no difference in the value proposition of a cheap 'keep alive pulse' or an operational management framework product using more complex protocols worth millions.
What is required is to report the business service level agreement but provide causation on the downtime. Outages should be reported on from an outage timeline perspective and aggregated to provide a pattern or trend.  It is not the length of time of the outage that is important but the crucial time periods within the outage that provide the metrics. These metric times are aligned with ITIL’s expanded incident life-cycle.
 The period between diagnosed and determination is often referred to as rolling wheels because that is when a technician is physically dispatched to achieve resolution of an outage. There has always been an industry wide infatuation with “9’s”. Tools that report only on the 99.9% availability have limited meaning because:

  • We cannot ascertain which services as opposed to which assets were affected;
  • We are unable to establish the business impact and resulting consequences;
  • We don’t have insight into the prevailing conditions during the outage;
  • The resolution or temporary fix is unknown;
  • The visual, proximate and root causes should have been annotated;
  • We don’t know the resources assigned to work on the outage; and
  • The outage classification and prioritisation is unknown and we are unable to assess the risk and mitigating actions.

No operations management product in existence answers the above questions in a suitable manner. As an example, does a historical snapshot exist of other outages at a certain time of a major incident where there have been severe negative business consequences?
Crucially, the monitoring of downtime needs to record the delta in the time when the actual outage happened against when the outage was acknowledging and escalated to the responsible stakeholder via a notification process. If there was an outage with no notification, this needs to appear in an exception report.
Every operations management product vendor bases their sales pitch on the fact that they will provide a faster detection time for an outage. This is pure Kool-Aid as the reason in most cases, when the downtime of an outage is analysed, the detection time is the smallest contribution to the overall length of the outage downtime. Why would anyone buy a tool that's primary purpose is to address the smallest requirement?

Another misnomer is that an operations monitoring tool will somehow miraculously prevent an outage. The reality is that significant havoc in technology happens, and if your tool is only focused on prevention, it will fail when you try to use it to cure. The value of a tool is in how it deals with an outage, because the reality is that the HOW is the major factor in dealing with service improvements. Optimal management of outages results in two trends:

  • the time between outages increases; and
  • the length of the outage reduces.

There is no value in your operational monitoring tool if it cannot influence these trends and worse, if it cannot report on or measure them.
The next step is diagnosis. Diagnosis monitoring is core to this. What often happens is that an outage occurs; an engineer diagnoses the issue and after initial investigation, proposes a solution. Often this knowledge remains in the engineer’s head and is not recorded or annotated in a knowledge base. If this practice of recording incidents and their resulting resolutions was consistently done, then the diagnosis time of an incident would consistently reduce.   Immediately a tool that provides this ability would have a justifiable return on investment.
The repair-recovery-restore time periods are not difficult to measure if an operations management product is service aware, as these metric times will be obvious.
To be of value, the output required is an analysis of the consequences of downtime:

  • what is the total outage time and what was the time of degradation (brown outs versus black outs)?
  • what is the deviation from the norm for this outage compared against the historical trend data?
  • what are the average incident times for: detection, diagnosis, repair, recover and restore?
  • what are the top services affected by outages?
  • what are the top causes of outages?
  • what is the MTTR, MTBF and MTBSI?

There are many possible causes of extended downtime periods. How does a RAG (Red-Amber-Green) tool help to assist in resolving?

  • Long detection times or even misses;
  • Diagnostics;
  • Logistic issues delaying repair;
  • Slow recovery, like having to rebuild from scratch as there is no known last configuration;
  • Slow return to service even though the device is recovered; and
  • No workarounds being available or documented.

Operations management needs to step up a gear from the current RAG mindset-it really isn’t about the traffic lights, but driving the vehicle! 
The influx of Internet of Things (IOT) tools and sensors that will improve operational management crucial in two methods:
-          Expand the diameters of operations that can be monitored
-          Improve the effectiveness and efficiency of rolling wheels
As an example, network connectivity tools have basic problems with causation.  When an outage happens, the proximate causes are either a network or power failure.  IoT devices can provide this causation and assist in further root cause analysis by using additional sensors such as presence, door contacts, vibration and even smoke sensors.
The use of IoT in this use case will be discussed in detail in a further article.

Popular posts from this blog

Using OPENDNS on a Mikrotik

At the office we use a Mikrotik which is connected via fibre to Cool Ideas.  We use OpenDNS as a Information Security tool.  It prevents ransomware and bots from becoming major incidents within the office.

The router is scheduled to do a daily update via script of the OpenDNS settings.  Below is the example:

:local opendnsuser "";
:local opendnspass "itsprivate";
:local opendnshost "office";

:log info "OpenDNS Update";
:local url "";
/tool fetch url=($url . "\3Fhostname=$opendnshost") user=("$opendnsuser") password=("$opendnspass") mode=https dst-path=opendnsupdate.txt
:local opendnsresult [/file get opendnsupdate.txt contents];
:log info "OpenDNS: Host $opendnshost - $opendnsresult";

Digitisation – the ‘power’ of IoT

DS showcased the Powalert sensor at the MyBroadband conference on the 26th October 2017. IoT is the incremental next step into the optimised use of technology that was made prevalent by smart phone technology. The cost efficiencies of smart phone technologies have resulted in the deployment of a number of other generic devices that use low powered network connections as an alternative. These are the class of devices known as IoT, the Internet of Things, which were previously not networked. The PowaINFRA range is a true IoT solution.  Powalert is a product in the PowerINFRA range. This solution expands the digital world to buildings and infrastructure that were previously mostly accessed by requiring physical connections. The Powalert sensor, using IoT technology, provides the ability to determine power related failures within telecommunications. A last mile provider typically has multiple customers using leased lines. When an outage occurs at the customer premises, the…

The Hours of WannaCry from the Cisco Umbrella Blog

In the span of just 10 days, two large-scale, wormable attacks grabbed international headlines. First, a phishing campaign posing as a Google Docs sharing request gained access to Google accounts then spread across its victim’s contacts, and now, a ransomware campaign with a bite, named WannaCry, autonomously infected vulnerable systems leveraging an exploit leaked on the internet. In the early minutes of the attack, we worked with our Talos counterparts to analyse the behaviour of WannaCry and protect our customers. We were also particularly proud to see that our Investigate product helped MalwareTech reduce WannaCry’s impact. In this post, we hope to give you a retrospective analysis of what we’ve observed during the first critical hours of the event. 
Read more here.