In almost every profession, it seems like there are two types of workers: the ones who get the glory, and the ones who do the essential work no one ever sees — unless something goes wrong.
In enterprise computing, those overlooked people are known as operations engineers. They're the ones who keep the rickety Rube Goldberg machine that is the modern internet from falling to pieces every day, while their glamorous counterparts — software developers — get to bask in the recognition that comes with shipping a new feature or creating a new service.
A little over 10 years ago, a group of operations-oriented engineers decided they were fed up with software developers who didn't care if their code actually worked, so long as it shipped. They were tired of abuse at the hands of management who forced their teams to be on call 24/7 with little to no internal support, let alone recognition.
Those engineers created the Velocity Conference in order to band together: to share their lived experiences including the intense pressure to keep Fortune 500 companies up and running, to discuss tips and tricks for navigating tricky problems and to come together as a community of people who know what it's like to be at the bottom of the food chain when everything has gone to hell.
That community sparked a revolution known as DevOps, the idea that software developers and operations professionals needed to work together more closely to support the ever-more complex task of running sophisticated software over the internet. Big companies such as Amazon and Google started to develop the operations career path with incentives and rewards parallel to those on the development side, while acknowledging that these people needed support from the highest levels of the company to do their very difficult jobs.
And out of this community came a Twitter hashtag, an in-group signal to their peers during the most stressful moments of their careers that a team had their back. When a major cloud service goes down, such as during Slack's early January outage, most people on Twitter see an opportunity to vent their frustration and score points at the affected company's expense.
At those moments, the people who know what it takes to keep these services afloat spread a hashtag: #hugops.
This is the story of the engineers who keep the cloud running, and how they created their own culture of empathy when nobody else cared.
Life of a sysadmin
Adam Jacob, CEO of The System Initiative, co-founder and former CTO of Chef: Systems administrators — a now almost basically nonexistent job title — were not the most beloved humans in the technical world. We didn't get a lot of respect.
We were sort of in the same bucket like secretaries; we had a System Administrator Appreciation Day. The people who do the stuff you don't see get appreciation days because, by definition, it means I'm not being appreciated every other day.
[One team leader] took us and my whole team, there were like 20 systems administrators, and he took us all out for beer on System Administrator Appreciation Day. And he sat down with the pitchers of beer and the first thing he said was, "Here's your guys' beer. Too bad none of you are smart enough to be engineers. Cheers."
My response to that was to just be mean to him.
Werner Vogels, CTO, Amazon: I think sysadmins mostly came out at a time when most companies were buying software. Traditionally at those operations, [software] development is on one side. Then there's this wall, and you throw software over the wall; and you don't care anymore.
Tim O'Reilly, founder, O'Reilly Media: There were all the, effectively, software janitors who were cleaning up after them. And the software janitors were kind of going: That doesn't really work.
Jesse Robbins, a former firefighter and present-day hugger, at the Velocity conference in 2010.Image: James Duncan Davidson/O'Reilly Conferences
Kolton Andrus, co-founder and CEO, Gremlin: At Amazon, I was one of 10 people that was paged when the website went down. And I took and managed the resolution of those calls from the side of the freeway next to my motorcycle because I had to pull over, call in and handle it immediately; it couldn't wait 10 minutes until I got home.
There was an Amazon Christmas party that I was at where I got a page, I had to run out to my car, get my backpack, come into a war room, sit down and resolve an incident before going back to the party. There's a lot of work that the engineers and the ops folks do behind the scenes, a lot of thankless work to help make sure things go well and get fixed.
Nathen Harvey, developer advocate, Google: What do we celebrate in technology? We celebrate new; new features, shipping new capabilities that we're delivering to customers. And we get angry when systems fail. Basically what you're saying is: We celebrate the developers, and we recognize the operators when everything goes to shit. That's not great.
Jacob: I sat in a room early on at Chef with a bunch of video game developers that were running the U.S. operations for one of the biggest video games of all time. And their boss sat across the table from them, and to my face, in front of them, said, "My guys aren't smart enough to learn Ruby." If you just interviewed system administrators from that era, 100% of them have that story.
Jesse Robbins, founder and executive chairman of Orion Labs, former co-founder and CEO of Chef: In operations, we always missed the launch party, because we were too busy in the data center or locked in an office looking at green screens trying to support a launch. We were never there for the fun part. We were always the ones that were giving up our nights and our weekends, and we're powerless to actually improve things.
When emergencies are a day job
Andrus: The on-call training I received at every company amounted to: "Here's your pager, good luck. You're smart, you'll figure it out."
Harvey: I remember a conversation I had with Ron Vidal, who is a firefighter in the San Francisco area. And one of the things he said to me was: "A firefighter has never, in their life at work, responded to an emergency. If your house is on fire, that's an emergency for you, but for the firefighters, that's their job."
Robbins: I'm a firefighter by training, and when I joined Amazon in 2001, "master of disaster" was my title. I realized that the way that we were running operations at Amazon was fundamentally not going to scale and that we needed a process and almost a cultural overhaul.
I began turning Amazon into a fire department. I literally took the sort of incident management principles that we used in the fire service and turned that into what we call GameDays and Scale Days, using essentially the incident command system in order to support people through the various ways of thinking when the red light is on.
Davis: A lot of what operations is like encourages this heroism: You have to do everything to keep it running and just throw yourself into it. It's not sustainable work. It's not great, it's terrible, and you're celebrated when you save the day but the reality is, it's terrible. It harms your relationships, and it harms your health and just frames how you work with other people.
"Here's your pager, good luck. You're smart, you'll figure it out."
Nora Jones, founder and CEO, Jeli: We're shifting towards a kind of a time where people see issues and incidents as a symptom rather than a cause of something, and trying to understand the bigger system that is playing out in those organizations.
Robbins: I owned availability at Amazon, and when I say owned it, I was sort of a tyrant, and ran it very aggressively. There was this big outage that we had [in the early 2000s], and there was a person early in their career who was literally shaking when I walked into the room because they were so afraid of what was going to happen.
I realized, "I've got to change the way that I approach this entirely and make it safe to experiment, safe to do these other things, to not have this punitive model and approach." It was seeing that person's face where I'm like, "Oh, I'm not the fire department, I'm like a bad guy. I'm being a villain."
Davis: If we reduce the heroism, we can reduce burnout.
Robbins: There is an ethos that came from all of that early work that recognizes how it is important to be kind to each other. And part of what I did early on at Amazon was create a culture of safety. You only get to do really big, great things when you're able to take great risks safely.
A meeting of like minds
John Allspaw, founder and principal, Adaptive Capacity Labs: These topics deserved an entire conference. I guess it was less that it deserved an entire conference, but more that a few folks convinced Tim O'Reilly to actually do it.
O'Reilly: They said, "Look, we need a gathering place for our tribe." We had done that before, for these various open-source communities. A lot of these things are rooted in communities, and so if you can figure out what community you want to bring together, you start by bringing them together.
Allspaw: What [the Velocity Conference] did was important, because it was a signal that operating software and understanding how things are running and anticipating things that can go wrong could be considered distinct from software development.
Artur Bergman, co-founder and chief architect, Fastly: What we were doing was just as critical as writing the code. If you can't run the code, it has no value.
Vogels: The time to develop software is actually quite small [compared] to the time that you have to operate it. So even though you may be building something complex, it may take a year or two years [to build], you may have to operate it for many, many more years to come.
Jacob: Velocity was like the first time that there was a non-academic place where everybody who is doing that work could get together. And it was like, well-funded and pretty. It wasn't like we were meeting up in the American Legion hall or whatever. It was a fucking conference.
Allspaw: We were finding this pretty significant common ground. For many, many years, they didn't have a place to put these ideas, or even labels or terms or vocabulary to talk about the dread — or actually sort of outright terror — that can come with, "shit's broken, and we have no idea."
So there's this lived experience of, "OK, you're with your colleagues and shit's broken and you don't have 100% clarity, but you've got a couple of good ideas that look sort of fruitful. And okay, so it seems like we should connect this thing to this thing and restart this other thing? We should do it in that order. What do you think about that?" You'd see this in IRC, we didn't have Slack back then.
This conference exists because we've got this shared experience with incidents and the general challenge is not just responding to incidents, but trying to work out how to prevent the ones in the future. And it's difficult work.
Time for a hug
Jacob: I'm a very huggy person. And so I hugged all of those people [at Velocity], all the time. Because it was happening to this group of people who … their work environment was not a place where you got a fucking hug.
Davis: We're building complex systems that include the people. And so how do we handle the unpredictable stress of complex systems? When you think about hugs, hugs are used to reduce pain. They're used to show that you care and they're used to reduce fear.
Jacob: So Artur Bergman was — is? — a particularly salty dude. He swears as much as I do, maybe more, and he's Swedish, so like when he swears, it's better.
Artur is not a person who was huggy. Artur would maybe suffer a hug from me, or suffer a hug from John [Allspaw]. At some point, John made a T-shirt that is the earliest I remember of the #hugops-y thing, and on the back of it it basically says, "Hug Artur Bergman."
Bergman: [During one Velocity] I gave a keynote and then [Adam] gave a keynote where he told people to hug me, and I was not aware that he had said that. During the day around the conference, random people started coming up and hugging me, which was, you know, quite uncomfortable, especially because I had no idea why. And so I ended up hiding for the rest of the day until I finally found out at the end of the day why this was happening.
Artur Bergman, who is not the naturally huggy type, at the Velocity Conference in 2010.Image: James Duncan Davidson/O'Reilly Conferences
Jacob: It was a very special moment in time where there was this very high degree of camaraderie, there was this really high degree of familiarity.
Allspaw: Capturing this real dread, these pretty scary, pressure-filled situations, sort of fueled that you're part of this tribe. I don't know who you are, but you're here and you're talking and, so having that common ground is what I think genuinely got people [to be] like, "Can I give you a hug?"
Jacob: We knew people at all of those [big tech companies], right? And so as everybody starts to know each other, when like, Facebook would have an outage, you'd use the #hugops hashtag and you were like literally talking to your people.
Robbins: It's not a surprise that what began with a sarcastic joke to troll one of my best friends became an idea that a lot of people have rallied around because it reflects the world that they're building continuously, that they're continuously improving.
Davis: It's just a message of caring. It's a shorthand to show that I have empathy for where you're at, because I'm going to be there at some point. And I hope you show me that empathy too, but also, you know what? You are not alone.
The future according to #hugops
Jones: What we're really seeing right now is a shift in the software industry and us buttoning up and understanding that our software is quite critical. But the pressures that people are under to write this software is a lot.
Take Slack. During that outage, they had all just come back, it was the Monday that everyone came back from New Year. I can't imagine being in that office, because you're just getting used to writing code again, you're just getting used to deploying things again, and then all of a sudden, all the world is signing on to Slack at the exact same time. It makes total sense that they had an incident that day.
I think part of what we're seeing from the "learning from incidents" community is just a shift in thinking and software to say, "OK, they didn't do something wrong. Something happened that made sense for them to do what they did," and kind of allowing for that conversation to happen.
Robbins: That shift happened because we made it happen, in part because we simply made it so clear that large businesses, large organizations cannot succeed with this kind of outdated enterprise software legacy mindset. To be always on, to be always available, you're always improving, and that means dealing with failures and enabling rapid change.
I think we're in the second chapter now of a movement that has new leaders emerging and evolving. It's not a part of the MBA curriculum yet, but it soon will be.
Andrus: Inertia within an organization is hard. You can get a team of 10 to pivot quickly. You're a startup, you've got 100 people, you can change your process. You've got 10,000 engineers, it's a lot harder to get everyone to change how they've done things the last decade or two.
Harvey: The #hugops movement and the ideas behind it really speak about, "How do we build more empathy for the other humans that we interact with every day?" In my mind, it certainly goes beyond technology.
As a society, we could take some real lessons from this: How do we just have better empathy for and respect for the work and the way that people show up in the work that they do, and the fact that you know, everyone is out there doing absolutely the best that they can with what they have? I think that's really, really important.
"Random people started coming up and hugging me, which was, you know, quite uncomfortable."
Davis: Every time I hear "NoOps" or "NoDev," I'm like, "Nooooo…." Because when people are saying that the robots and automation are gonna take over, that doesn't think through all of these complexities that humans are really great at.
Yes, reducing the toil is so great. And we can have these conversations about how to balance out what availability is, and how much I'm going to spend on resolving things, and have those kinds of conversations separate from like, "We're gonna just eliminate all the humans because humans make mistakes." Humans make mistakes building the stuff that then we're relying on; we need humans as the safety checks.
Bergman: If you have a long outage, you need to care about your people and their sleep schedules, and the fact that they have to eat. And by day four or five, if you didn't do that, you're just gonna have a bunch of really tired and grumpy people who are going to make more mistakes.
Andrus: I did enjoy at Amazon and at Netflix the approach of, "You should know how your software behaves." If you've written software and deployed it and then you're turning a blind eye to it, that's just not good engineering.
Davis: What is so fascinating is that the next generation isn't putting up with this negative stuff. They're setting the expectations and they're very vocal about what they want their work environments to be like and how they want to work.
John Allspaw values the shared experience of the Velocity Conference.Photo: firstname.lastname@example.org
Jones: We need to be asking different questions and we need to give more people seats at the table. I've been at way too many organizations where the incident was just the [site reliability engineers] in the room. It should have had marketing in the room, it should have had PR in the room, it should have had customer service in the room, it should have had leadership in the room. But it's thought of as kind of an SRE issue, like SREs have to prepare for any type of situation that gets thrown their way.
I was at one organization a while back where we launched a Super Bowl commercial. And we had some bumps when we launched the commercial, but the SRE team didn't get a ton of notice that the commercial was happening, I think it was either same-day notice or a couple days beforehand, and that was not really mentioned in the post-incident review.
Andrus: The flip side of #hugops is I do think there is responsibility that should be held to the leadership of those companies. We're empathetic to the engineers that are dealing with the situation they have, but in part that's because leadership isn't prioritizing their actions, or resilience and reliability in the same way that they prioritize some of their product efforts.
Allspaw: As my colleague Dr. Richard Cook has said, we shouldn't be surprised that these systems go down. We should be more surprised that they stay up as often as they do.
Bergman: We took a job that was critical to running the world's largest websites and the internet, that was kind of under-appreciated, and turned it into a movement, modernized it with DevOps, and gave those individuals career paths.
Jacob: Who gets credit when you see a beautiful car? You don't give credit to the mechanics. You're like, "Man, those guys at Porsche really make beautiful cars." You might know, like, one legendary mechanic in the history of great mechanics.
But that's why it's so persistent: because the mechanics know the mechanics.