“It’s not the…”

I saw this meme on my social media feed, and it reminded me of my first rule of troubleshooting.

Never, ever, try to prove it’s not your area.

If you’re in IT, you’re familiar with ‘critical’ issues. You might call them SevA, Sev1, TITSUP, outage, all-hands or something else. But we all have them, and we’ve all been involved in some way.

How many times in one of those situations did you hear:
“It’s not the network”
“It’s not the storage”
“It’s not VMware”
“It’s not my code”

What you’re really saying is: “It’s not MY fault”.

Stop thinking this way. Stop trying to prove it’s not your fault or your area, or your systems. Instead, ask yourself how can you solve the problem. How can you make it better. Maybe your area did not create the issue at hand. But if all you’re are trying to do is prove it’s not your responsibility, you’re not actually trying to solve the problem, rather you’re only trying to get out of the situation. It reminds me of childhood neighborhood games yelling “1-2-3 NOT IT”. If everyone is simply trying to be “not it” then the problem will never get solved.

Rather than trying to prove it’s not you, I urge everyone to prove it IS. Why? Well for starters if I keep trying to proof it’s my area, and I work under the assumption it might be… I might find out it actually is. It’s very easy for us to overlook a detail about why our area is part of the problem if all we’re trying to do is prove it’s not.

More than that, it’s a mindset.

The goal should always be to restore the service at any cost; does it matter why it happened during the outage? If during a critical issue I can find a way to improve the area I’m responsible for enough to alleviate the pain, I can help de-escalate the situation enough to restore service, then, get to true root cause.

Everything in IT is related, the components all work together. If I leave the situation after proving it’s not my area, I’m not present in the conversation to help answer questions. We see this result as waiting to get someone back on the phone or in the war room to answer a question, delaying the resolution.

Continuing to stay engaged I will learn more about how my role fits into the larger ecosystem my area supports. With that knowledge, I improve my ability to contribute. Not just to the issue at hand, but to future designs. Plus if I wish for “my area” to grow, a.k.a. get promotions; the more I know about the other areas the better suited I am for a wider set of responsibilities.

Digging deeper and deeper into the tools and metrics I have may help uncover the key to solving the problem. I might be able to find the nugget that helps my peer solve the problem. Tracing network packets can help developers; comparing IOPS volume from before the incident can point to workload changes, leveraging security tools might help find code issues; I’ve witnessed this over and over again.

I have a great real-world example of this I use when talking to teams about critical response practices.

Years ago, we were experiencing an issue where a database server would crash every night during major ETL loads. For days no one could figure out why. The database looked fine, the logs were clean, but the whole server would panic every night. I was not responsible for the operating system up at the time, so I was not involved in the initial troubleshooting effort. But with the problem ongoing the teams who were responsible started reaching out for help.

I offered to take a look. While initially I didn’t see anything of concern, I asked when the issues happened and if I could watch. It was the middle of the night, so I agreed to stay up late with them that night to watch in real time. While the other team looked at the OS and Database monitoring tools; I opened up mine, vCenter, storage, etc. Right before the crash happened, in real-time monitoring mode inside vCenter, I noticed an enormous spike in packets per second at the network layer. We repeated the workload, and the crash and the spike repeated as well.

Why and what was happening? The ETL load was causing a large influx of data over the network, increasing the packets per second. While the 10Gbs bandwidth was not a bottleneck, the virtual network card was an older model (E1000) which in turn was overwhelming the kernel processor usage, confirmed by the Linux admin after I asked him to look at each processor usage statistic individually. The solution was to adjust the virtual nic (to VXNET3), as well enable Receive Side Scaling to spread the network processing workload across multiple cores, avoiding starving the kernel on core 0.

By looking at the tools for my area, we were able to find data that led us down the path to the ultimate cause of the issue and solved it. It wasn’t the vSphere Hypervisor causing the issue, but the monitoring at that level could point to the issue. I could help solve the issue, even though it wasn’t my fault. Because I was trying to help, not just trying to prove it wasn’t my fault.

This story also demonstrates another important point… it often is not anyone’s or any areas fault; but the combination of them. Which means no one team can solve it on their own.

My last point, and maybe the most important personally, is also the easiest to forget. This time, it might not be your area, but next time it might be. When it is, don’t you want your peers there to help you? More over, isn’t it better to solve it together and make it a team problem? It might not be your culture today, but it can be with your help.

These are all the reasons I’ve told my teams “Don’t try to prove it’s NOT your area, try to prove IT IS, because if it is, YOU can fix it, and I need it fixed”.

So if you find yourself saying: “It’s not the <my area>”. Try instead “How can <my area> help?”

By | April 27th, 2016|Opinions, Pet Peeve, Soapbox|0 Comments

Soapbox Topic=”FUD”


FUD; Fear, Uncertainty and Doubt. More specifically for this blog, a vendor who attempts to discredit a competitor rather than speak of their own value. Be it with straw man arguments, comparing apples to oranges, or simply outright lies. I find the practice personally disgusting. If you cannot speak well enough about your product that you require speaking poorly of others, I’ll assume you have a bad product. It’s like when my kids tattle on each trying to get out of trouble, then I know for sure they did the deed I’m asking about. It happens every day, but I can recall a few cases in the storage world that raised my ire.

Years ago, a large technology manufacturer (of who I had many of their products, as well as their competitor) was pitching me on a need for net new growth. Their product line had been suffering in my operations, so it was not the front-runner. Rather than speaking about how to improve the situation, or how the newer generation would resolve the issues; they tried make their competition appear overly expensive. In doing so, pulled prices from eBay (yes really, eBay). Now, the argument had a merit at the surface, since sales are not privy to competitors pricing, eBay has to be cheaper than list price, or even discount, right, it’s eBay! I can see their line of thought “certainly if we’re cheaper than eBay than they’ll use us”. Sadly, the eBay prices were incredibly inflated, over current list prices, not to mention the discount I’d receive. I recall losing my cool in that conversation, dressing down the sales rep about FUD practices and failing to address our concerns in the new sale, not to mention our operational issues. To boot, believing we as the customer couldn’t do the basic math in a cost comparison. He was walked off our campus, never to return (seriously, did I mention I don’t like FUD?).

In another case, a vendor was telling me how their product was superior to their competitor because they tiered at the sub-lun level, telling me my product of choice would only tier at the whole lun. I was in management at the time and storage was one of many of my departments, so I have to imagine the sales team believed I simply wasn’t aware of the details. The detail being, they were comparing their current product to their competitors product of 2 years ago. Not only did I correct them on their misinformation, but since they sold other products I liked and wanted, I had the account team replaced because of that breach of trust (again, I really don’t like FUD).

Today, with social media, my witness of this practice is no longer limited to personal interactions. Almost daily I see a tweet about one product replacing another; and when the replacement is 3-5 years old, and likely 2+ generations, again my hackles raise. Especially because in many of these cases, I believe the product has technical merit. The bitter use of logical fallacy in comparing different generations, in the world of Moore’s Law, causes me to assume they are trying to cover up something. The approach erodes my trust in the people and the company itself that spread the misinformation.

If you are reading this and have an involvement in the sales channel, please, compete with integrity. Stand on your own merits. If your pitch is rooted in bashing your competitor, educate yourself and focus on your products positive aspects, leave it up to the customer to weight them against the competition. If the product you are competing against truly has issues you want to inform your customer of, leverage a reference customer to have an unbiased call.

You might just win the deal based on your integrity.


By | February 4th, 2016|Soapbox|0 Comments