Skip to main content

Where legacy operations management comes up short

Most legacy operations management products miss the point and come up short. Many of these products, although implemented are ghosts. They lurk around in buildings with limited or no business benefit.  Their primary focus is on monitoring outages. This is often referred to as a RAG tool: Red, Amber, and Green where Red signifies down, Amber signifies intermittent connectivity problems and Green signifies good connectivity. This serves a limited business purpose and cannot justify any return on investment. The product development has been focused on reporting on acceptable situations, as opposed to providing equal focus on operation in dire straits. How is this limited view operational monitoring? All it does is give you a comfortable feeling. With this approach, there is no difference in the value proposition of a cheap 'keep alive pulse' or an operational management framework product using more complex protocols worth millions.
What is required is to report the business service level agreement but provide causation on the downtime. Outages should be reported on from an outage timeline perspective and aggregated to provide a pattern or trend.  It is not the length of time of the outage that is important but the crucial time periods within the outage that provide the metrics. These metric times are aligned with ITIL’s expanded incident life-cycle.
 The period between diagnosed and determination is often referred to as rolling wheels because that is when a technician is physically dispatched to achieve resolution of an outage. There has always been an industry wide infatuation with “9’s”. Tools that report only on the 99.9% availability have limited meaning because:

  • We cannot ascertain which services as opposed to which assets were affected;
  • We are unable to establish the business impact and resulting consequences;
  • We don’t have insight into the prevailing conditions during the outage;
  • The resolution or temporary fix is unknown;
  • The visual, proximate and root causes should have been annotated;
  • We don’t know the resources assigned to work on the outage; and
  • The outage classification and prioritisation is unknown and we are unable to assess the risk and mitigating actions.

No operations management product in existence answers the above questions in a suitable manner. As an example, does a historical snapshot exist of other outages at a certain time of a major incident where there have been severe negative business consequences?
Crucially, the monitoring of downtime needs to record the delta in the time when the actual outage happened against when the outage was acknowledging and escalated to the responsible stakeholder via a notification process. If there was an outage with no notification, this needs to appear in an exception report.
Every operations management product vendor bases their sales pitch on the fact that they will provide a faster detection time for an outage. This is pure Kool-Aid as the reason in most cases, when the downtime of an outage is analysed, the detection time is the smallest contribution to the overall length of the outage downtime. Why would anyone buy a tool that's primary purpose is to address the smallest requirement?

Another misnomer is that an operations monitoring tool will somehow miraculously prevent an outage. The reality is that significant havoc in technology happens, and if your tool is only focused on prevention, it will fail when you try to use it to cure. The value of a tool is in how it deals with an outage, because the reality is that the HOW is the major factor in dealing with service improvements. Optimal management of outages results in two trends:

  • the time between outages increases; and
  • the length of the outage reduces.

There is no value in your operational monitoring tool if it cannot influence these trends and worse, if it cannot report on or measure them.
The next step is diagnosis. Diagnosis monitoring is core to this. What often happens is that an outage occurs; an engineer diagnoses the issue and after initial investigation, proposes a solution. Often this knowledge remains in the engineer’s head and is not recorded or annotated in a knowledge base. If this practice of recording incidents and their resulting resolutions was consistently done, then the diagnosis time of an incident would consistently reduce.   Immediately a tool that provides this ability would have a justifiable return on investment.
The repair-recovery-restore time periods are not difficult to measure if an operations management product is service aware, as these metric times will be obvious.
To be of value, the output required is an analysis of the consequences of downtime:

  • what is the total outage time and what was the time of degradation (brown outs versus black outs)?
  • what is the deviation from the norm for this outage compared against the historical trend data?
  • what are the average incident times for: detection, diagnosis, repair, recover and restore?
  • what are the top services affected by outages?
  • what are the top causes of outages?
  • what is the MTTR, MTBF and MTBSI?

There are many possible causes of extended downtime periods. How does a RAG (Red-Amber-Green) tool help to assist in resolving?

  • Long detection times or even misses;
  • Diagnostics;
  • Logistic issues delaying repair;
  • Slow recovery, like having to rebuild from scratch as there is no known last configuration;
  • Slow return to service even though the device is recovered; and
  • No workarounds being available or documented.

Operations management needs to step up a gear from the current RAG mindset-it really isn’t about the traffic lights, but driving the vehicle! 
The influx of Internet of Things (IOT) tools and sensors that will improve operational management crucial in two methods:
-          Expand the diameters of operations that can be monitored
-          Improve the effectiveness and efficiency of rolling wheels
As an example, network connectivity tools have basic problems with causation.  When an outage happens, the proximate causes are either a network or power failure.  IoT devices can provide this causation and assist in further root cause analysis by using additional sensors such as presence, door contacts, vibration and even smoke sensors.
The use of IoT in this use case will be discussed in detail in a further article.


Popular posts from this blog

The importance of the major incident process

ITIL mentions the Major Incident process as a special case of the incident management process as well its close relationship to problem management.  However, the Major Incident process requires greater clarity and specification as in many large enterprises the process is crucial for overcoming a crisis. A Major Incident typically defined as an incident with severe negative business consequences and an important duty of any designated Information Technology (IT) resources is to deal with Major Incidents in a structured manner.  We will address this important topic in a series of articles that specifically addresses the process and crisis management in general. Read the full article here .

NeDi - a great open source tool for network management

NeDi is an open source software tool which discovers, maps and inventories your network devices and tracks connected end-nodes. Features Network Discovery, management & monitoring Netflow & sFlow based traffic analysis IT Inventory & lifecycle management Network topology visualisation Locate & Track Computers Security audits & more VM, DC management Printer management Backup Configs IT Reports Read more about it here or contact DS to find out more.

On board PowaINFRA gateway deployment

DS has an on board version of the PowaINFRA gateway that can be deployed on a vehicle. The gateway is powered by the 12V of the vehicle and typically installed under the dash. Additionally, the gateway has an extra sensor and metric ability of using Geo-location. The on board PowaINFRA gateway has the same capabilities as the standard PowaINFRA gateway and is compatible with the sensors in the PowaINFRA range. Other vehicle tracking systems are typically wired and thus rely on the sensors to be connected to contacts on the main unit. No only is this a more difficult installation but it limits the number of sensors installed in a vehicle as it is not cost effective. Most vehicle installation of refrigerated trucks only have one temperature probe installed, either on the output or return vents of the cooling units. This is typically located at the front of the refrigeration trailer and the cooling varies within the trailer. Thus it is likely that the load can ex