Most legacy
operations management products miss the point and come up short. Many of these
products, although implemented are ghosts. They lurk around in buildings with
limited or no business benefit. Their
primary focus is on monitoring outages. This is often referred to as a RAG
tool: Red, Amber, and Green where Red signifies down, Amber signifies intermittent
connectivity problems and Green
signifies good connectivity. This serves a limited business purpose and cannot
justify any return on investment. The product development has been focused on
reporting on acceptable situations, as opposed to providing equal focus on
operation in dire straits. How is this
limited view operational monitoring? All it does is give you a
comfortable feeling. With this approach, there is no difference in the value
proposition of a cheap 'keep alive pulse' or an operational management
framework product using more complex protocols worth millions.
What is required is
to report the business service level agreement but provide causation on the downtime.
Outages should be reported on from an outage timeline perspective and
aggregated to provide a pattern or trend.
It is not the length of time of the outage that is important but the
crucial time periods within the outage that provide the metrics. These metric
times are aligned with ITIL’s expanded incident life-cycle.
The period between diagnosed
and determination is often referred to as rolling wheels because that is when a
technician is physically dispatched to achieve resolution of an outage. There has
always been an industry wide infatuation with “9’s”. Tools that report only on the
99.9% availability have limited meaning because:
- We cannot ascertain which services as opposed to which assets were affected;
- We are unable to establish the business impact and resulting consequences;
- We don’t have insight into the prevailing conditions during the outage;
- The resolution or temporary fix is unknown;
- The visual, proximate and root causes should have been annotated;
- We don’t know the resources assigned to work on the outage; and
- The outage classification and prioritisation is unknown and we are unable to assess the risk and mitigating actions.
No operations management
product in existence answers the above questions in a suitable manner. As an
example, does a historical snapshot exist of other outages at a certain time of
a major incident where there have been severe negative business consequences?
Crucially, the
monitoring of downtime needs to record the delta in the time when the actual
outage happened against when the outage was acknowledging and escalated to the
responsible stakeholder via a notification process. If there was an outage with
no notification, this needs to appear in an exception report.
Every operations management
product vendor bases their sales pitch on the fact that they will provide a
faster detection time for an outage. This is pure Kool-Aid as the reason in
most cases, when the downtime of an outage is analysed, the detection time is
the smallest contribution to the overall length of the outage downtime. Why
would anyone buy a tool that's primary purpose is to address the smallest
requirement?
Another misnomer is
that an operations monitoring tool will somehow miraculously prevent an outage.
The reality is that significant havoc in technology happens, and if your tool
is only focused on prevention, it will fail when you try to use it to cure. The
value of a tool is in how it deals with an outage, because the reality is that the
HOW is the major factor in dealing with service improvements. Optimal
management of outages results in two trends:
- the time between outages increases; and
- the length of the outage reduces.
There is no value
in your operational monitoring tool if it cannot influence these trends and
worse, if it cannot report on or measure them.
The next step is
diagnosis. Diagnosis monitoring is core to this. What often happens is that an
outage occurs; an engineer diagnoses the issue and after initial investigation,
proposes a solution. Often this knowledge remains in the engineer’s head and is
not recorded or annotated in a knowledge base. If this practice of recording
incidents and their resulting resolutions was consistently done, then the
diagnosis time of an incident would consistently reduce. Immediately
a tool that provides this ability would have a justifiable return on
investment.
The repair-recovery-restore time periods are not difficult to measure if an operations management product is service aware, as these metric times will be obvious.
The repair-recovery-restore time periods are not difficult to measure if an operations management product is service aware, as these metric times will be obvious.
To be of value, the
output required is an analysis of the consequences of downtime:
- what is the total outage time and what was the time of degradation (brown outs versus black outs)?
- what is the deviation from the norm for this outage compared against the historical trend data?
- what are the average incident times for: detection, diagnosis, repair, recover and restore?
- what are the top services affected by outages?
- what are the top causes of outages?
- what is the MTTR, MTBF and MTBSI?
There are many
possible causes of extended downtime periods. How does a RAG (Red-Amber-Green)
tool help to assist in resolving?
- Long detection times or even misses;
- Diagnostics;
- Logistic issues delaying repair;
- Slow recovery, like having to rebuild from scratch as there is no known last configuration;
- Slow return to service even though the device is recovered; and
- No workarounds being available or documented.
Operations management
needs to step up a gear from the current RAG mindset-it really isn’t about the
traffic lights, but driving the vehicle!
The influx of
Internet of Things (IOT) tools and sensors that will improve operational
management crucial in two methods:
-
Expand the
diameters of operations that can be monitored
-
Improve
the effectiveness and efficiency of rolling wheels
As an example,
network connectivity tools have basic problems with causation. When an outage happens, the proximate causes
are either a network or power failure.
IoT devices can provide this causation and assist in further root cause
analysis by using additional sensors such as presence, door contacts, vibration
and even smoke sensors.
The use of IoT in
this use case will be discussed in detail in a further article.
Comments
Post a Comment