DevOps Lessons Learned: Don't become the problem (again)

01 Jun 2018 - Aaron Dodd

This was published in the company newsletter for the topic of describing a DevOps-related failure and what was learned.

The following is a true story. The names have been changed to protect the guilty.

I glanced at my phone as I pushed the “brew” button on my coffee maker. My notifications were the typical overnight ones: a few emails that could wait, a bunch to delete, one that I should probably read but didn’t want to deal with right after waking up, and a slew of escalations in our client’s group chat. That last one was odd as we were over a year into our DevOps engagement and had ironed out most of the troublesome processes and issues months ago. I scrolled through the chat log and frowned. One of my engineers and one of the client’s developers had been arguing back and forth for the past few hours over a recurring high-cpu alert on one web server in a farm. Both were obviously frustrated. Seeing my status change from “away” to “online” my overnight engineer pinged me.

“If Chris’ team has a code issue, why won’t they debug it?” he asked. Chris was the manager for the customer’s PHP development team whereas I ran our operations group that supported the infrastructure. Despite some good progress getting our guys to collaborate better, the stress of an abnormally high number of new environment builds in the past few weeks had caused old tensions to flare up.

I finished scanning the chat log before replying. It was the same issue we’d highlighted for over a week. One server in a farm of thirty became unresponsive due to high CPU once every few days at varying times, but always overnight. We could find no infrastructure or configuration issue with that node, and each time we confirmed the node had the identical software stack and code base as the others in the farm. The only clue was that one PHP thread would consume the CPU, but there were no cron jobs or access log entries to indicate a scheduled task either on the server or executed remotely.

“You escalated the ticket to development with your validation checks?” I asked.

“And again,” he said, “it was closed as ‘not an issue’ because it isn’t happening by the time they look into it.”

I wrote to the group chat: “If the server is healthy now and the application isn’t impacted, let’s write up our findings and discuss on the morning stand up. I’ll join in person.” To my engineer, I asked for all the tickets we had opened for the development team.

coffee

I was on my third cup of coffee, this time from Starbucks on the way to the office, as I sat down for the morning sync between operations and development. The agenda was straightforward: discuss what happened yesterday, what’s planned for today, and what blockers exist. When we reached blockers, I brought up the seventeen tickets I could find for the development team about the CPU issue and asked what we can do to stop this from happening.

Chris sighed. “Look,” he said, “it doesn’t happen in dev, QA, or the load environments. It’s only one server and the application is still up. It’s not a code issue, it’s an operations one. Let’s just kill the instance when it happens and let it be respun. That the beauty of a cloud, right?”

“What is different about the application usage in production as opposed to lower environments?” I asked. “Aside from significantly higher load and more servers, are there any application configurations that don’t match?”

Chris glanced at his watch. “I have a hard stop, but if it’ll make you feel better, I’ll have someone dump the config tables and compare. I guarantee code is the same. If there’s no differences in configs, just restart PHP or the node and let’s move on. I’m not wasting any more development time on non-issues.”

I received Chris’ comparison of the configurations a few days later via email. He had also cc’d Rich, the VP of Technology to whom we both reported. As I expected he’d say, he found no differences to explain the CPU spikes and suggested to Rich that my team “use the cloud as it’s meant and put in self-healing.” I scheduled a follow up meeting and included Rich, but I was tired of fighting what felt like a small battle was more interested in just keeping Chris from painting us in a negative light. Both Chris and I eventually agreed to several action items: my team would automate restarting PHP if the CPU suddenly spiked or respin the node if it became unavailable, and Chris would implement “watchdog” logic in code to throttle the process.

The alerts stopped occurring. Since we logged all performance metrics and alerts for analysis, we could still see that, even though the CPU alerts weren’t generating, there was still a random node in the farm that would get respun every few nights, sometimes several times a night. Since we could find no availability issues and no one complained, the “issue” was soon forgotten. Both Chris and I mentioned the additional changes we implemented as “continuous improvements” on our monthly reports to Rich; a “win-win” for both our teams

Later that month, Rich invited everyone for drinks after work as congratulations on the past year’s successful application launches. I found myself having a beer with Darren, introduced as the somewhat redundant sounding “Content Management System Manager” for one of the brands Rich–and, therefore, all of us–supported. “So, you’re the guy that pushes pretty buttons on websites,” I joked. Darren laughed. “Nah, my team develops the custom Drupal modules and front-end design.”

I was surprised as I was only aware of one development group. “Do you work with Chris’ team?” I asked.

He hesitated a moment, trying to place the name, then said, “Sort of. We commit to Git. They wave a wand to make it live. Sometimes we collaborate on Drupal core issues. They’re okay, I guess. I mean, we’re live, right? But, we’re pretty sure there’s something wrong with the setup and they keep saying it has to be our module.”

I smiled, remembering similar interactions, and asked what the issue was.

“A few months ago, we implemented an ingest hook to get assets from our vendor and process them for display on the site. In dev, we can do a full ingest of live data in just under an hour, but in production it seems to die randomly. It happens so often we implemented batching so we can re-process only parts of the feed after the ingest dies just so it can eventually complete. Even so, it takes upwards of six hours in prod and its reaching about eight hours this month. We keep spending time refactoring but I’m sure it’s not the module.”

I had a sinking feeling. “Does this run overnight?” I asked.

“Yea, every two or three days.”

I’m an idiot, I thought. I fished a business card out of my bag and asked Darren to call me, saying that I might know what’s going on. And worse, I kept to myself, we might be the cause.

After I ended the conference bridge with Darren and his lead developer the next day, I looked over my notes on the ingest logic he described. There was no doubt. This change was the cause of the random production issues we’d been experiencing. Darren’s team had the foresight to realize ingestion couldn’t occur from every node in the farm at the same time, nor could they rely on any single node always being available, and they knew the additional processing might impact production traffic, so their logic would randomly choose one on which to execute an ingest event after business hours. Given the typical load in production as the node served content to end users, the overhead of this added process was enough to impair or kill a single server, something they wouldn’t notice in the non-production environments.

work

Worse still, our remediation steps treated the issue as just a CPU alert and, in our attempt to “fix” that, we caused bigger problems. I realized we had broken several key concepts of “DevOps” that we needed to address right away, in addition to actually helping Darren with getting their ingestion logic working.

We called ourselves a “DevOps” team and, while our KPIs on the monthly reports testified to the improvements we had brought to the project, they weren’t enough if we weren’t properly aligned to the business. The RACI matrix we agreed to with the client listed Chris’ development team as responsible for anything code and application configuration related, while my team was responsible for infrastructure and operating systems. Although each team was designated to the other as “to be consulted,” in practice we’d regressed to the old siloed “operations versus development” mindset since we could simply point to the RACI. More importantly, we had completely missed including the critical development teams into our processes. This meant that we were all working on tiny pieces of the larger puzzle without understanding what was actually needed or what effects we were having, which actually decreased efficiency and reliability across the board.

These were key lessons and they seemed obvious in retrospect, but they led me to a more important one. I had become part of the problem. I had grown complacent with the quality of my team’s work. Instead of keeping focused on continually improving, I had grown weary of fighting political battles that shouldn’t have existed in the first place had I been properly highlighting project risks and working to improve the processes. I let myself, my team, and the customer down by not remaining vigilant, by acting like an old-school operations manager and not a DevOps lead.