I saw this meme on my social media feed, and it reminded me of my first rule of troubleshooting.
Never, ever, try to prove it’s not your area.
If you’re in IT, you’re familiar with ‘critical’ issues. You might call them SevA, Sev1, TITSUP, outage, all-hands or something else. But we all have them, and we’ve all been involved in some way.
How many times in one of those situations did you hear:
“It’s not the network”
“It’s not the storage”
“It’s not VMware”
“It’s not my code”
What you’re really saying is: “It’s not MY fault”.
Stop thinking this way. Stop trying to prove it’s not your fault or your area, or your systems. Instead, ask yourself how can you solve the problem. How can you make it better. Maybe your area did not create the issue at hand. But if all you’re are trying to do is prove it’s not your responsibility, you’re not actually trying to solve the problem, rather you’re only trying to get out of the situation. It reminds me of childhood neighborhood games yelling “1-2-3 NOT IT”. If everyone is simply trying to be “not it” then the problem will never get solved.
Rather than trying to prove it’s not you, I urge everyone to prove it IS. Why? Well for starters if I keep trying to proof it’s my area, and I work under the assumption it might be… I might find out it actually is. It’s very easy for us to overlook a detail about why our area is part of the problem if all we’re trying to do is prove it’s not.
More than that, it’s a mindset.
The goal should always be to restore the service at any cost; does it matter why it happened during the outage? If during a critical issue I can find a way to improve the area I’m responsible for enough to alleviate the pain, I can help de-escalate the situation enough to restore service, then, get to true root cause.
Everything in IT is related, the components all work together. If I leave the situation after proving it’s not my area, I’m not present in the conversation to help answer questions. We see this result as waiting to get someone back on the phone or in the war room to answer a question, delaying the resolution.
Continuing to stay engaged I will learn more about how my role fits into the larger ecosystem my area supports. With that knowledge, I improve my ability to contribute. Not just to the issue at hand, but to future designs. Plus if I wish for “my area” to grow, a.k.a. get promotions; the more I know about the other areas the better suited I am for a wider set of responsibilities.
Digging deeper and deeper into the tools and metrics I have may help uncover the key to solving the problem. I might be able to find the nugget that helps my peer solve the problem. Tracing network packets can help developers; comparing IOPS volume from before the incident can point to workload changes, leveraging security tools might help find code issues; I’ve witnessed this over and over again.
I have a great real-world example of this I use when talking to teams about critical response practices.
Years ago, we were experiencing an issue where a database server would crash every night during major ETL loads. For days no one could figure out why. The database looked fine, the logs were clean, but the whole server would panic every night. I was not responsible for the operating system up at the time, so I was not involved in the initial troubleshooting effort. But with the problem ongoing the teams who were responsible started reaching out for help.
I offered to take a look. While initially I didn’t see anything of concern, I asked when the issues happened and if I could watch. It was the middle of the night, so I agreed to stay up late with them that night to watch in real time. While the other team looked at the OS and Database monitoring tools; I opened up mine, vCenter, storage, etc. Right before the crash happened, in real-time monitoring mode inside vCenter, I noticed an enormous spike in packets per second at the network layer. We repeated the workload, and the crash and the spike repeated as well.
Why and what was happening? The ETL load was causing a large influx of data over the network, increasing the packets per second. While the 10Gbs bandwidth was not a bottleneck, the virtual network card was an older model (E1000) which in turn was overwhelming the kernel processor usage, confirmed by the Linux admin after I asked him to look at each processor usage statistic individually. The solution was to adjust the virtual nic (to VXNET3), as well enable Receive Side Scaling to spread the network processing workload across multiple cores, avoiding starving the kernel on core 0.
By looking at the tools for my area, we were able to find data that led us down the path to the ultimate cause of the issue and solved it. It wasn’t the vSphere Hypervisor causing the issue, but the monitoring at that level could point to the issue. I could help solve the issue, even though it wasn’t my fault. Because I was trying to help, not just trying to prove it wasn’t my fault.
This story also demonstrates another important point… it often is not anyone’s or any areas fault; but the combination of them. Which means no one team can solve it on their own.
My last point, and maybe the most important personally, is also the easiest to forget. This time, it might not be your area, but next time it might be. When it is, don’t you want your peers there to help you? More over, isn’t it better to solve it together and make it a team problem? It might not be your culture today, but it can be with your help.
These are all the reasons I’ve told my teams “Don’t try to prove it’s NOT your area, try to prove IT IS, because if it is, YOU can fix it, and I need it fixed”.
So if you find yourself saying: “It’s not the <my area>”. Try instead “How can <my area> help?”