O kompaniji
Zühlke Engineering logo

Zühlke Engineering


A new era of incident management


Just how valuable is incident management? What are the standards in this line of work?

Pour some coffee, put on some relaxing music and join us for the eighth episode of The Hüb where Tijana Krstajić shares the valuable experience she obtained while working as a Project Manager at Zühlke, sometimes covering the Service Manager role in various projects.

What is the value of incident management?

I was responsible for organising incident management at all levels, to be the owner of that framework and all its sub-processes. This was a challenging project for me. The story is interesting mostly because I created that incident management process in a FinTech company, in a virtual bank. Lately, FinTech companies have been arising everywhere, particularly neobanks or challenger banks, as they are also called: virtual banks that have no physical presence on the market, but simply exist as systems usually accessed via a mobile application.

In a company like a neobank, the greatest value lies in technology, which is the core of everything and which provides automation. The number of operational personnel required to run the bank has decreased, so that from a bank which had a staff of 10,000, you can create a bank with 50 employees that is equally capable of providing the same services, thanks to technology. At an institution with such a large investment in technology, incident management in a technological sense is of paramount importance.

Incident management is not only important for developers. It’s not just a bug in an application that prevents a user from paying for something or seeing a screen, but it has become important on a larger scale. It has become important for clients and their perception of the bank and its regulatory compliance. Regulators also pay great attention to it. For instance, in Hong Kong where my project took place, the CIO, COO, CEO can risk enormous fines in case of, for example, a security incident; they can even end up in prison, faced with criminal charges. So these people carry a financial and personal liability.

How does incident management fit into the wider business perspective?

In a modern company, the distinction between IT and business is fading, slowly but steadily. All departments share a common vision and goal – happy customers and a successful company. I’ve already mentioned the importance of technology in FinTech. In the domain of banking, an incident can have a great impact: the bank’s entire operation must be considered, together with the risk appetite of the institution. We need to know, for instance, how many incidents per month can be considered acceptable by the bank, and how many high severe incidents, in which for example, clients cannot log on or make payments. This requires close coordination with the management and other sections of the bank, such as the treasury, customer service, risk management and cybersecurity risk departments. On the other hand, it was also necessary to conduct a large training course for developers, since many of them did not know what an “incident” means, what it means when a service that is critical for operations of an institution does not work, and the consequences that may have. So on my side, there was a lot of not just creating frameworks but also raising the awareness of what an incident is and what consequences it may have for the bank and its clients.

What are the official standards in incident management?

ITIL is the best-known standard within service management. Incident management is a process within ITIL, which first defines what an incident is and then some best practices are collected through feedback from many companies. I used to have a concern about ITIL, and the existence of those procedures and strictly defined ways of reacting, but that has changed since it upgraded to its 4th version. I think strict procedures can significantly limit the awareness of people who need to react to incidents, and they often define the way to react even when that way does not fulfill the purpose. In some situations, people should have the freedom to make a good decision based on the situation, because procedures are written for certain kinds of reactions, and they cannot be applied to absolutely all cases.

It is also good to have defined roles in Incident Management. One important role is the Incident Commander. This is a person that has the executive authority to manage an incident. They can make decisions based on the proposed solutions, they can monitor the troubleshooting process, and they can coordinate everything that is happening during an incident. Usually, this should be a technical person, but not rarely it is a service manager or project manager. It is of great importance that the person understands the broader context and processes in the technological domain in which they are appointed incident manager.

When it comes to the process itself, I could easily draw parallels between Incident and Project Management. An incident is like a small project; it begins when a disruption of service occurs, and it ends when the service returns to normal operations.

Concerning reviewing events during an incident, my practice for every high severity incident, has been a complete review. This means we summarize how the incident occurred, why it occurred, what the resolution flow was, whether the right people were involved at the right point and what we can do differently and better next time. The term for that is Incident post mortem.

After we have detailed the timeline, we can move on to the way the incident resolution was supposed to flow in an ideal scenario. We discuss who was supposed to be included at what point, what could have been done differently, and most importantly – Root Cause Analysis. When we say Root Cause Analysis, we don’t refer to just problems. I don’t appreciate Problem Management much anymore. I realize its purpose and significance if we discuss regulatory bodies and generally reporting on risks at higher levels of financial institutions. But as such this has no value in IT, because it cannot be said that an incident or any service disruption has a single root cause. Usually, it’s a set of circumstances that lead to a situation and in order to fix it and avoid a repetition of a similar situation in the future, various actions must be taken, it’s never just one. We usually determine what actions need to be taken so that the problem doesn’t repeat itself, and we also prioritize them at the end of the post mortem, when we appoint owners to all of those actions and follow their resolution in the timeline.

What are the challenges of cross-team communication?

I will start with what was the most difficult for me. Working in large organizations, (especially financial institutions) produces an inevitable culture clash. Because on the one hand, all people coming from Risk Management in banks, and I believe the same applies to all institutions with Risk Management, simply have a certain way of thinking. That’s because they have gone through certain established practices that they picked up from their previous jobs, and they now expect that we, as people looking for new solutions to old problems, would just fit into their already existing solutions and ways of thinking.

This has usually been resolved through a lot of convincing, and reducing and exposing problems, so that we would in the end realize that we were talking about the same thing, but in two different languages. And since they already have their own vision that I cannot comply with, because it’s not efficient, requires a lot of time and would be a waste of effort and money, I convince them to follow my version for a couple of months, just to see how it goes, on a trial-and-error principle and then, on that basis, to make further decisions – to change something, go back to the old ways of working or remain with the first one. That’s not such a simple thing I’ve learned, but the good point was the fact that I had strong support from the management of the project, especially related to Incident Management, and on that basis I succeeded in achieving what I intended - making us ready for large incidents that happened later.

Visit the company profile.