ITSI - Getting to the heart of the problem
26/06/19 – Author: Andy Gibbs – Certified Splunk Consultant
Its 3 o’clock in the morning and you are unceremoniously awoken by your smartphone. That beautiful dream you were just having dissolves into a stark reality – you’re being beckoned onto a priority 1 emergency incident call. Critical business services have been down for the past half-hour, nobody seems to know why, and customers are ringing the phones off the hook. Your boss has just sent you a very curt text demanding a status update in the next 15 minutes. That’s the third major incident this week!
It’s the start of yet another long and tortuous day, much of which you’ll spend gathering incident updates from the various IT teams. This wouldn’t be so bad if they all spoke the same language, but each team seems to have its own technobabble, and each points the blame at someone else for the issue. Identifying root-cause will be a nightmare.
Symptoms of the issue sound familiar. They are just like an outage you had last Tuesday. You wish you could remember how you’d resolved that one. You recall being told it was something to do with “… an application lock-up due to a database tablespace issue because a caching error resulting from a lack of fault tolerant infrastructure when a disk-drive failed” or something like that – phew! Someone somewhere must surely be monitoring this stuff – so why aren’t they taking action before these issues occur? You also remember the flood of customers complaining that their services had just stopped without warning – no error messages – just a hung screen. Oh, and the service desk going crazy, with over 400 calls queued and 700 abandoned at one point. You just hope your customers will be a little more understanding than last time. Your boss certainly won’t be!
Is this a typical day at the office for your business operations?
Are you struggling to meet your service KPIs due to operational underperformance?
Is technical complexity and lack of visibility into the moving parts of your organisation causing real service performance issues?
In our experience at Somerford, many organisations find this is an all too familiar picture, particularly for those whose businesses who have significant service delivery and operational commitments. Many have already invested in event monitoring tools, but often discover these are point-solutions which will only help identify localised failures. With the ever growing complexity in the technology ‘stack’ used for service operations, it’s difficult to get a holistic picture of the full delivery platform. Keeping track of all the interdependencies between the moving parts within the organisation can be a nightmare, and identifying the root-cause of issues can be complex and time-consuming.
How can Splunk ITSI help?
Splunk IT Service Intelligence (ITSI) is a monitoring and analytics tool for Service and IT Operations. It enhances business effectiveness by improving the efficiency, reliability and cost-effectiveness of key services. Powered by machine learning, it provides visibility into the health state and performance of an organisation’s critical business services, and the underlying IT infrastructure upon which these depend. Splunk ITSI provides a core platform for service operations yielding a number of key benefits:
- It provides real-time insights to understand how key services are performing
- It uses advanced analytics to identify patterns, anomalies and trends,
- It simplifies service monitoring and analytics, allowing faster, more informed decision making
- It supports in-depth analysis of service issues and helps identify root-cause and reduce resolution times
- It helps automate operations, saving effort on laborious, repetitive operational tasks
- It allows organisations to pre-empt and avoid service failure
Where Splunk ITSI really wins out is through its use of Predictive Analytics and Machine Learning capabilities. By looking at patterns of behaviour over time and based on past experience, it can alert and respond to conditions that have previously led to adverse events or service failures. These ‘actionable events’ can be used as a catalyst for deeper investigation and preventative action.
Over time, operations teams can move from a reactive to a predictive state whereby adverse events can be detected and corrected before they have the opportunity to affect live services. Importantly, this saves the cost of incident management and the associated business disruption. Time is returned to the business by avoiding a service failure, and the customer benefits from service continuity, while you maintain your business reputation.
Using the predictive capabilities of Splunk, customers are claiming some significant operational benefits such as:
- 70-90% reduction in incident investigation time
- 30-45% reduction in outages
- Ability to predict imminent outages 30-45 minutes in advance
- Reduced alert noise by 90+%